Samsung SSD 990 Pro in RAID 1 on Servers - Disks Vanishing Issue

Posted by Otherwise-Ad-424@reddit | buildapc | View on Reddit | 49 comments

First reddit post!

I've seen very technical questions/issues on R so here I am!

We have been using Samsung 990 Pro in several servers. We are aware that it doesn't have power protection like a PM9A3 but it's way faster so practical for many use cases.

We are using for some of our servers this Motherboard: AsRock B650D4U-2L2T, to fit 2 SSDs in Raid1, we are using PCIe to M.2 adapters (like this one):

[PCIe to M.2](

Some servers are very stable, others seems to "loose" one drive once in a while. We don't know why but we get this from syslog/kernel in Linux:

[136244.461088] nvme nvme1: I/O 177 QID 7 timeout, aborting
[136244.461105] nvme nvme1: I/O 852 QID 12 timeout, aborting
[136244.461112] nvme nvme1: I/O 853 QID 12 timeout, aborting
[136244.557074] nvme nvme1: I/O 309 QID 3 timeout, aborting
[136275.185578] nvme nvme1: I/O 309 QID 3 timeout, reset controller
[136281.325896] nvme nvme1: I/O 18 QID 0 timeout, reset controller
[136357.126884] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[136357.159275] nvme nvme1: Abort status: 0x371
[136357.159278] nvme nvme1: Abort status: 0x371
[136357.159279] nvme nvme1: Abort status: 0x371
[136357.159280] nvme nvme1: Abort status: 0x371
[136377.703231] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[136377.703256] nvme nvme1: Removing after probe failure status: -19
[136398.247561] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[136398.247959] nvme1n1: detected capacity change from 3907029168 to 0
[136398.247963] blk_update_request: I/O error, dev nvme1n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[136398.247965] blk_update_request: I/O error, dev nvme1n1, sector 687804416 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[136398.247969] blk_update_request: I/O error, dev nvme1n1, sector 2599914832 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[136398.247980] blk_update_request: I/O error, dev nvme1n1, sector 2203664 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[136398.247988] md/raid1:md1: Disk failure on nvme1n1p2, disabling device.

As this Raid1 has the system partition, sometimes it has impact on the system stability

We did investigate if this could be a firmware issue, but 3B2QJXD7 firmware seems to be relatively stable (although 4B2QJXD7 does exist).

Anyone have good advice on how to find the root cause of the disk randomly disconnecting?

Smartctl reports no specific issues. Are there any other logs to check besides syslog and dmesg? Could this be related to a temperature problem, as the active disks appear to be more affected?

[-]

BuyAccomplished3460@reddit

Hello, Sorry for the late reply but I hope this helps you.

We have 45 servers all running (4) 2TB and 4TB Samsung 990 Pros. They would all drop the nvme drives from raid randomly.

What finally resolved this for us was adding the following lines to the kernelopts line in grub.cfg and rebuilding grub:

nvme_core.default_ps_max_latency_us=0 default_ps_max_latency_us=0 pcie_aspm=off

Otherwise, when the drives change power states they will desync and the raid will degrade.

Before we found this solution we switched multiple servers over to the HP FX900 Pro series line and those don't seem to have the same issue.

[-]

BuyAccomplished3460@reddit

We use dell Poweredge servers currently R620, R630 and R640

[-]

Rare_Airline1418@reddit

That is so odd. I replaced the Supermicro H13SAE-MF with an ASUS desktop mainboard. Problem gone.

[-]

Bad news after bad news: While I had kernel panics with Micron 7400 Pro as well, I put back the Samsung 990 Pros. Immediately after reinstallation of Debian 12.9 Bookworm one of the Samsung 990 Pros disappeared ("Unable to change power state from D3cold to D0, device inaccessible") and degraded the RAID. I am in contact with Supermicro and the said I should try it with ASPM being set to Auto, before that it was disabled.

[-]

Rare_Airline1418@reddit

I noticed another thing: Both Samsung 990 Pros have about 188 TB written in 436 hours with no reason. They are brand new. So with 1.200 TBW they are already 16 % dead.

[-]

colin_1972@reddit

Dragging this up, but if these are in RAID, everytime one drops, it will have to rebuild, would this be the cause of the noted large data writes?

[-]

Rare_Airline1418@reddit

Probably. I switched the mainboard: ASUS Desktop (B650E) instead of Supermicro. No problems so far.

[-]

colin_1972@reddit

I was literally about to buy a 980 pro for the Windows OS disk on an H12SSL-i, I'm still trying Samsung, but a 983 Enterprise M.2. Hopefully won't get any problems.

[-]

Otherwise-Ad-424@reddit (OP)

Hello, for me, this looks like a different issue. In our case after a power cycle (and not just a reboot), the disks are back. What about you?

[-]

Rare_Airline1418@reddit

I rebooted the server four times and the NVME remained disappeared. Then I powered the server off for a few minutes and started it again, after that, the NVME came back.

[-]

Am3thystXR@reddit

https://i.imgur.com/WY0OFnm.png

I have nothing to add other than I am experiencing similar issues on very similar drives. Piece of junk SSD

[-]

Jamira40@reddit

Can confirm this is happening randomly to us too. 990 Pro, 980 Pro all 2TB versions. Multiple systems. Today it happened for 990 Pro with FW 4B2QJXD7. I/O error and disconnected.

We RMA tens for 990 Pro already but its keep happening. Also its happening on different kernel versions too.

[-]

IntelligentHoliday71@reddit

Did it happen even after firmware fix

[-]

Spooky-Mulder@reddit

No solution, but exact same issue here with two 990 pro in raid 0 truenas scale on an asrock mainboard

[-]

SilverDetective@reddit

I have the same problem with Samsung 990 PRO 2TB. Intel CPU. And reboot don't bring it back, I need to cut power. I've moved drive to different slot, didn't help.

I also have this messages:

[ 2557.778707] pcieport 0000:00:1a.0: AER: Correctable error message received from 0000:00:1a.0
[ 2557.778722] pcieport 0000:00:1a.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)

I disabled lowest power state for this drive and this get rid of this AER messages, but drive still stops working.

Now I have disabled APST. I'm not sure yet, but it seems this helped. It's now working for 6 days. But, I don't want to have it always in highest power state.

[-]

SilverDetective@reddit

Just reporting my status. After disabling APST, drive is now working for 63 days. So this seems to help. But it's now always in highest power state.

[-]

Objective-Entry-4416@reddit

I don't want to have it offline ;)

APST helps to reduce the number of appearance of the prob, but doesn't finally help.

I think turning off APST is even not the right way to handle the prob, because it works on PCI. I'm not sure if this impresses NVME which are connected at the processor's lane.

There might be a way to do it by "nvme get-feature" and "nvme set-feature" to read possible power states and reduce to full power ... I think I will have to check that.

[-]

SilverDetective@reddit

It's now 13 days and it still works. Last time it was offline after 3 days. But I think max was 3 weeks, so I'm still not sure if this really helps.

APST is actually disabled:

nvme get-feature /dev/nvme0 -f 0xc -H|grep 'APST'      Autonomous Power State Transition Enable (APSTE): Disabled

It's now always in PS 0:

nvme get-feature /dev/nvme0 -f 2 -H get-feature:0x02 (Power Management), Current value:00000000         Workload Hint (WH): 0 - No Workload         Power State   (PS): 0

Supported Power States

St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat  0 +     9.39W       -        -    0  0  0  0        0       0  1 +     9.39W       -        -    1  1  1  1        0       0  2 +     9.39W       -        -    2  2  2  2        0       0  3 -   0.0400W       -        -    3  3  3  3     4200    2700  4 -   0.0050W       -        -    4  4  4  4      500   21800

But I don't know how to disable APST just for 1 drive so I actually patched kernel, there is already some "quirks" for other drives and I just added NVME_QUIRK_NO_APST for this drive.

As I understand, disabling APST with "nvme set-feature" won't work or it will just work temporary until kernel resets state.

[-]

Fletch_to_99@reddit

I'm seeing a similar issue in unraid home server. Setup is a Crosshair VII hero with an AMD 5950x and I've got 2 990 Pros in a ZFS mirror. For some reason they seem to intermittently drop out with similar logs to what the OP has posted. I checked and both are on the latest firmware. I tried to disable pcie_aspm but that didn't seem to help.

Did you have any luck figuring out the issue?

[-]

Maunose@reddit

Estou com o mesmo problema, tenho 2 Samsung 990 Pro 2TB, 1 com hestsink outro sem. O sem heatsink está em uma porta com lanes diretos do CPU, esse nunca deu problema com ASPM. Já o com heatsink a cada 2~3 dias “some” do sistema com o erro do change power state from D3cold to D0. Minha placa mãe é uma ASUS Pro WS W680-ACE, processador Intel i7-14700 e sistema operacional Proxmox 8.3 (aos que não conhecem, a base é debian bookworm). Por motivos que eu não consigo compreender, a placa mãe não mantém desativado o ASPM então estou sem saber o que fazer.

[-]

Objective-Entry-4416@reddit

There are some differences between Proxmox 8.3 and Debian 12. Proxmox uses kernel 6.8 while Debian uses 6.1. Proxmox uses zfs while Debian uses ext4 as standard. Might make some differences. But funnily not in reality.

We are also using an i7-14700 which is known for additional problems.

We have all 990 PRO connected to the processor's lanes and using the heatsinks of the mainboards. Usually nvme1n1 disappears. One time nvme0n1 followed after one day. One time nvme0n1 disappeared.

Could be that it's a matter of "active" nvme while the other on which is synced disappears. Who knows ...

I find threads in other forums that it helped to install Samsung's magician tool and put the M.2 into full-power mode. Magician tool is not available for Linux and setting ASPM = off doesn't do the job ...

[-]

Maunose@reddit

Does setting the drive to full-power mode solves the issue? Have you noticed if the drives are running hotter at full-power mode? One last question; Does that setting persists if I use another machine to set it and move the drive back into my server? Thanks!

[-]

Objective-Entry-4416@reddit

I red that it does under Windows. But I cannot confirm, because we don't use Windows.

A colleague told me that this writes something on the M.2. So it might last, when you change it back to another machine. Might ...

Since I didn't had the possibility to try it in Windows, I cannot tell about temperature.

[-]

Maunose@reddit

Thankfully I use these drives on a ZRAID1 array as I had to wipe wipe out the ZFS partition and format it as NTFS for Samsung Magician to work, without formatting as NTFS it just shows "No Supported Volumes found.".
After formatting it to NTFS I was able to set the Full Power Mode and now back to proxmox, after I "replaced" the zpool drive, smartclt shows only one supported power state, the full-power one.

Supported Power States

St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat

0 + 9.39W - - 0 0 0 0 0 0

I Hope this solves the issue and at the same time that it doesn't lower the drive's lifetime.

[-]

Objective-Entry-4416@reddit

Interesting!

I would read it like that:

Samsung's tool needs Windows to be able to "see" the M.2. So it has to be formatted in a way Windows can read it and write on it.
When Windows can read it and write on it, Samsung's tool can disable power states in the M.2s firmware.
Because changes of the power state are written into the firmware, the M.2 can be formatted anyhow afterwords.

If it is like that, then Samsung is just to lazy to release a tool for Linux to write changes of power states into the firmware.

I guess I will test that.

[-]

Maunose@reddit

I installed Windows 10 on a spare nvme to try setting the 990 Pro into full-power mode and it's a no go. It probably just works on windows (or drives formatted as NTFS or FAT32) as I can't do anything to the drive since it says: "No supported volumes found.".

[-]

Objective-Entry-4416@reddit

Hey there,

Same prob here on seven machines with 2x 990 Pro 4TB in SW-RAID1 on Debian Bookworm.

On three machines the prob never appeared, on three every few weeks or months, on one on a daily base.

Two weeks ago I added the kernel flag nvme_core.default_ps_max_latency_us=0 pcie_aspm=off. Since then it only happened once and only on the machine where it happened nearly every day.

I found out that I better set the NVME powerless to change things. So reboot doesn't help. Better shutdown, pull the power cord, press the power button to get rid of any left electrical voltage, then power it on again.

We also saw that before the 990 Pro disappears the temperature in monitoring is unreal high (\~ 90°C and above).

Since not all M.2 ports on the main board are connected to PCIe, but to the processor's lane, I am wondering if pcie_aspm=off helps there ...

On the one which has those massive problems we already changed both of the 990 Pro to new ones and also the main board from Gigabyte Z790 Gaming to Microstar MS-7E06. Next we will change both 990 Pro to 4th generation NVME of another producer to finally get rid of that prob.

Greetz

[-]

Rare_Airline1418@reddit

We changed the Samsung 990 Pros with Microns 7400 Pros and then with Samsung PM9A1s. And we changed the mainboard. Then we changed the CPU. Still no success.

Do you use AMD or Intel?

[-]

Objective-Entry-4416@reddit

I use 14th generation of i-processors, mostly i7-14700k.

What you describe only makes one clue: You sync the prob with every change of M.2 or SSD onto the new one by putting it into RAID1. THAT is kinda weird. Don't like to believe it ...

At least I would expect that the prob disappears when you change form a 5th generation M.2 like 990 PRO to a 4th generation M.2 like Micron 7400 Pro.

[-]

Rare_Airline1418@reddit

It's most likely a firmware bug of Supermicro, that is why the problem is not gone after all the hardware changes.

[-]

Objective-Entry-4416@reddit

Hey there,

Same prob here on seven machines with 2x 990 Pro 4TB in SW-RAID1 on Debian Bookworm.

On three machines the prob never appeared, on three every few weeks or months, on one on a daily base.

Two weeks ago I added the kernel flag nvme_core.default_ps_max_latency_us=0 pcie_aspm=off. Since then it only happened once and only on the machine where it happened nearly every day.

We also saw that before the 990 Pro disappears the temperature in monitoring is unreal high (\~ 90°C and above).

Since not all M.2 ports on the main board are connected to PCIe, but to the processor's lane, I am wondering if pcie_aspm=off helps there ...

Greetz

[-]

Rare_Airline1418@reddit

I have bad news: As I already told, I bought brand new Micron 9400 NVMEs as a replacement for the Samsung 990 Pro NVMEs. It still causes a kernel panic on reboot, so likely the Microns will also disappear such as the Samsungs. There is something really strange going on here.

[-]

kendrickluong@reddit

Howdy!

Don't have a solution but have the same problem. 4 of these in a Dell R740, 3 of them have done it. 990 PRO 4TB latest firmware (4B2) all within a span of 3 months or so. I'm planning scrapping the lot, it's caused nothing but issues with a MDADM RAID 1 as an iscsi target.

I'm running Debian 12 also with kernel flags as well:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

People with the same issue:

https://www.reddit.com/r/linuxquestions/comments/1gnmucp/what_to_do_about_a_samsung_4tb_990_pro_crashing/

https://www.reddit.com/r/PcBuild/comments/18kir2d/samsung_4tb_990_pro_nvme_keeps_disconnecting_nov/

https://www.reddit.com/r/techsupport/comments/17pbshx/my_samsung_990_pro_keeps_disconnectingmaking_pc/

https://www.reddit.com/r/pcmasterrace/comments/1eebawt/samsung_990_pro_disconnects/

So apparently, on Windows with the Samsung magician you can set it to performance mode and it seems to stop it, although i thought i could be smart and set it on windows and bring it over, but it appears to require a windows partition to set and not and actually hardware flag

[-]

SkunkDeRay@reddit

bringin my 2 ct's t this discussion ....

Setup:

2 Homeserver/WS
HW Raid Trimode Controller: broadcom 9670w-i16
Raid 10
990pro NVMEs : one Server 2 TB and on the other 4TB

Same issues here. Loosing random drives and therefore Raid is degrading. If I reboot, devices are not present in the enclosure (Controller Diag). I have to completely turn off and on, so that the lost devices reappear. First then it is possible to rebuild the underlying Virtual Raid Drive. The 990`s are not on BCs compatibility list but I gave this setup a shot. Had expected a lot but not such freakbehaviour. Till seeing this post I changed the oculing cables and yesterday Ive updated one servers Controller FW. Didn`t expect the NVMEs themselves could be such a pain in the *ss.

Maybe someone benefits knowing about this ...

[-]

Rare_Airline1418@reddit

Thanks for sharing your experience. I moved now to Micron and will give feedback if I should encounter any problems. So far, after many reboots, no kernel panic at all.

[-]

tylerwatt12@reddit

Same problem. Across two different systems with completely different specs and different batches of drives. I’m no longer buying Samsung drives. This is ridiculous.

I’m running these on desktop boards with 12/13th Gen Intel i7 CPUs. One is my personal workstation in the office. Another is an NVR camera system. On my workstation, I switched to Inland’s fastest model SSDs and haven’t had an issue since.

These SSDs are not in RAID.

[-]

Rare_Airline1418@reddit

Interesting. So you're not even using a workstation mainboard (Supermicro, AsRock Rack) and not AMD, but Intel instead and still encounter the same problems? Do you use Samsung 990 Pros as well?

[-]

tylerwatt12@reddit

All 1TB 990 pros

[-]

Rare_Airline1418@reddit

Thanks. And the error messages, are they similar? Today in idle I got 'Unable to change power state from D3cold to D0, device inaccessible' and then the disk disappeared from 'fdisk -l'.

[-]

Otherwise-Ad-424@reddit (OP)

I guess this should be further investigated: https://en.wikipedia.org/wiki/Active_State_Power_Management

[-]

tylerwatt12@reddit

I'm on windows and I can only see that the drive effectively disappears from the system entirely, even on system reset. Gone from BIOS. I have to pull the power, hold the power button for a few seconds and that fixes it most of the time. Works for anywhere between a week and a month. It's very random.

[-]

Rare_Airline1418@reddit

Today, once again an NVME (still 990 Pro 2 TB) disappeard, this time `/dev/nvme0n1`:

`Unable to change power state from D3cold to D0, device inaccessible
...
Disk failure on nvme0n1p3, disabling device.`

[-]

Rare_Airline1418@reddit

Very interesting. I use a Supermicro H13SAE-MF and 2 x 990 Pro 2 TB and have the same problem with Debian 12 Bookworm. The secondary SSD /dev/nvme1n1 suddenly disappeared with the same error message:

[...] md/raid1:md0: Disk failure on nvme1n1p1, disabling device.
[...] md/raid1:md0: Operation continuing on 1 devices.

The device wasn't even visible in rescue (grml), so I did update BCM from version 01.03.02 to BMC 01.03.06 and BIOS from 2.1 to 2.2. The SSD then came back, but a few days later I had a kernel panic and the server freezed:

[...] ? do_wp_page+...
[...] ? srso_alias_return_thunk+...
[...] ? __handle_mm_fault+...
[...] ? srso_alias_return_thunk+...
[...] ? handle_mm_fault+...
[...] ? srso_alias_return_thunk+...
[...] ? do_user_addr_fault+...
...

I contacted Supermicro about that issue and they said they aren't aware of any isues. Also they refused to give information about the BMC/BIOS changelog.

[-]

Otherwise-Ad-424@reddit (OP)

Same CPU, 7900. Super nice to see you also found interesting to use 7900 in a server setup. If you don't need PCIe lanes, it's amazing.

After some reading, PCIe ASPM could be a reason. I'll continue to investigate. It mostly happens on PCIe to NVME card adapter. So maybe signal integrity?

FYI, I also have Supermicro with Epyc and also have this issue but it's not frequent at all... I lso have on my servers (~20) different firmware versions. I don't see correlation for now. Some have the "0" firmware and have been working flawlessly for years.

[-]

Rare_Airline1418@reddit

New SSDs from Micron are ordered now and replaced in a few days hopefully. I've been running a stress test for days with no Kernel panic so far, so I think your assumption that it may has something to do with the PCIe power management is not so unlikely, since the kernel panic occurred right at the time of reboot.

[-]

Rare_Airline1418@reddit

I had to choose between the Ryzen 7900 (12C) and the EPYC 4464P (12C) and was unable to find out what the difference between these two is, since the single and multithreading performance seems to be exactly the same (according to passmark), but the 7900 was cheaper. Thats always a problem with hardware manufacturers, you always need to invest so much time to find out product differences ... I moved from Intel to AMD recently, so it was even more confusing (but as far as I know, AMD does usually anyway offer more PCIe lanes than Intel).

My NVMEs directly sit on the mainboard. That you use the same CPU (or even AMD), is another interesting fact. The worst would be a software bug in AMD AGESA (https://en.wikipedia.org/wiki/AGESA), but I really don't know if that could be the case.

For now, I decided to drop Samsung since I stopped liking them anyway for their inferior customer support, so I will go with Micron in future. A Micron 7400 (22110) might not be the most modern NVME, but still a robust product for my use case. If then the problem persists, I will have a problem 😳

[-]

Otherwise-Ad-424@reddit (OP)

More info: We use software Raid on Ubuntu Server. The disk is back after a power cycle.