Samsung SSD 990 Pro in RAID 1 on Servers - Disks Vanishing Issue
Posted by Otherwise-Ad-424@reddit | buildapc | View on Reddit | 20 comments
First reddit post!
I've seen very technical questions/issues on R so here I am!
We have been using Samsung 990 Pro in several servers. We are aware that it doesn't have power protection like a PM9A3 but it's way faster so practical for many use cases.
We are using for some of our servers this Motherboard: AsRock B650D4U-2L2T, to fit 2 SSDs in Raid1, we are using PCIe to M.2 adapters (like this one):
[PCIe to M.2](
Some servers are very stable, others seems to "loose" one drive once in a while. We don't know why but we get this from syslog/kernel in Linux:
[136244.461088] nvme nvme1: I/O 177 QID 7 timeout, aborting
[136244.461105] nvme nvme1: I/O 852 QID 12 timeout, aborting
[136244.461112] nvme nvme1: I/O 853 QID 12 timeout, aborting
[136244.557074] nvme nvme1: I/O 309 QID 3 timeout, aborting
[136275.185578] nvme nvme1: I/O 309 QID 3 timeout, reset controller
[136281.325896] nvme nvme1: I/O 18 QID 0 timeout, reset controller
[136357.126884] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[136357.159275] nvme nvme1: Abort status: 0x371
[136357.159278] nvme nvme1: Abort status: 0x371
[136357.159279] nvme nvme1: Abort status: 0x371
[136357.159280] nvme nvme1: Abort status: 0x371
[136377.703231] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[136377.703256] nvme nvme1: Removing after probe failure status: -19
[136398.247561] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[136398.247959] nvme1n1: detected capacity change from 3907029168 to 0
[136398.247963] blk_update_request: I/O error, dev nvme1n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[136398.247965] blk_update_request: I/O error, dev nvme1n1, sector 687804416 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[136398.247969] blk_update_request: I/O error, dev nvme1n1, sector 2599914832 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[136398.247980] blk_update_request: I/O error, dev nvme1n1, sector 2203664 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[136398.247988] md/raid1:md1: Disk failure on nvme1n1p2, disabling device.
As this Raid1 has the system partition, sometimes it has impact on the system stability
We did investigate if this could be a firmware issue, but 3B2QJXD7 firmware seems to be relatively stable (although 4B2QJXD7 does exist).
Anyone have good advice on how to find the root cause of the disk randomly disconnecting?
Smartctl reports no specific issues. Are there any other logs to check besides syslog and dmesg? Could this be related to a temperature problem, as the active disks appear to be more affected?
Rare_Airline1418@reddit
Bad news after bad news: While I had kernel panics with Micron 7400 Pro as well, I put back the Samsung 990 Pros. Immediately after reinstallation of Debian 12.9 Bookworm one of the Samsung 990 Pros disappeared ("Unable to change power state from D3cold to D0, device inaccessible") and degraded the RAID. I am in contact with Supermicro and the said I should try it with ASPM being set to Auto, before that it was disabled.
Rare_Airline1418@reddit
I noticed another thing: Both Samsung 990 Pros have about 188 TB written in 436 hours with no reason. They are brand new. So with 1.200 TBW they are already 16 % dead.
Otherwise-Ad-424@reddit (OP)
Hello, for me, this looks like a different issue. In our case after a power cycle (and not just a reboot), the disks are back. What about you?
Rare_Airline1418@reddit
I rebooted the server four times and the NVME remained disappeared. Then I powered the server off for a few minutes and started it again, after that, the NVME came back.
Rare_Airline1418@reddit
I have bad news: As I already told, I bought brand new Micron 9400 NVMEs as a replacement for the Samsung 990 Pro NVMEs. It still causes a kernel panic on reboot, so likely the Microns will also disappear such as the Samsungs. There is something really strange going on here.
kendrickluong@reddit
Howdy!
Don't have a solution but have the same problem. 4 of these in a Dell R740, 3 of them have done it. 990 PRO 4TB latest firmware (4B2) all within a span of 3 months or so. I'm planning scrapping the lot, it's caused nothing but issues with a MDADM RAID 1 as an iscsi target.
I'm running Debian 12 also with kernel flags as well:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off
People with the same issue:
https://www.reddit.com/r/linuxquestions/comments/1gnmucp/what_to_do_about_a_samsung_4tb_990_pro_crashing/
https://www.reddit.com/r/PcBuild/comments/18kir2d/samsung_4tb_990_pro_nvme_keeps_disconnecting_nov/
https://www.reddit.com/r/techsupport/comments/17pbshx/my_samsung_990_pro_keeps_disconnectingmaking_pc/
https://www.reddit.com/r/pcmasterrace/comments/1eebawt/samsung_990_pro_disconnects/
So apparently, on Windows with the Samsung magician you can set it to performance mode and it seems to stop it, although i thought i could be smart and set it on windows and bring it over, but it appears to require a windows partition to set and not and actually hardware flag
SkunkDeRay@reddit
bringin my 2 ct's t this discussion ....
Setup:
Same issues here. Loosing random drives and therefore Raid is degrading. If I reboot, devices are not present in the enclosure (Controller Diag). I have to completely turn off and on, so that the lost devices reappear. First then it is possible to rebuild the underlying Virtual Raid Drive. The 990`s are not on BCs compatibility list but I gave this setup a shot. Had expected a lot but not such freakbehaviour. Till seeing this post I changed the oculing cables and yesterday Ive updated one servers Controller FW. Didn`t expect the NVMEs themselves could be such a pain in the *ss.
Maybe someone benefits knowing about this ...
Rare_Airline1418@reddit
Thanks for sharing your experience. I moved now to Micron and will give feedback if I should encounter any problems. So far, after many reboots, no kernel panic at all.
tylerwatt12@reddit
Same problem. Across two different systems with completely different specs and different batches of drives. I’m no longer buying Samsung drives. This is ridiculous.
I’m running these on desktop boards with 12/13th Gen Intel i7 CPUs. One is my personal workstation in the office. Another is an NVR camera system. On my workstation, I switched to Inland’s fastest model SSDs and haven’t had an issue since.
These SSDs are not in RAID.
Rare_Airline1418@reddit
Interesting. So you're not even using a workstation mainboard (Supermicro, AsRock Rack) and not AMD, but Intel instead and still encounter the same problems? Do you use Samsung 990 Pros as well?
tylerwatt12@reddit
All 1TB 990 pros
Rare_Airline1418@reddit
Thanks. And the error messages, are they similar? Today in idle I got 'Unable to change power state from D3cold to D0, device inaccessible' and then the disk disappeared from 'fdisk -l'.
Otherwise-Ad-424@reddit (OP)
I guess this should be further investigated: https://en.wikipedia.org/wiki/Active_State_Power_Management
tylerwatt12@reddit
I'm on windows and I can only see that the drive effectively disappears from the system entirely, even on system reset. Gone from BIOS. I have to pull the power, hold the power button for a few seconds and that fixes it most of the time. Works for anywhere between a week and a month. It's very random.
Rare_Airline1418@reddit
Today, once again an NVME (still 990 Pro 2 TB) disappeard, this time `/dev/nvme0n1`:
`Unable to change power state from D3cold to D0, device inaccessible
...
Disk failure on nvme0n1p3, disabling device.`
Rare_Airline1418@reddit
Very interesting. I use a Supermicro H13SAE-MF and 2 x 990 Pro 2 TB and have the same problem with Debian 12 Bookworm. The secondary SSD /dev/nvme1n1 suddenly disappeared with the same error message:
[...] md/raid1:md0: Disk failure on nvme1n1p1, disabling device.
[...] md/raid1:md0: Operation continuing on 1 devices.
The device wasn't even visible in rescue (grml), so I did update BCM from version 01.03.02 to BMC 01.03.06 and BIOS from 2.1 to 2.2. The SSD then came back, but a few days later I had a kernel panic and the server freezed:
[...] ? do_wp_page+...
[...] ? srso_alias_return_thunk+...
[...] ? __handle_mm_fault+...
[...] ? srso_alias_return_thunk+...
[...] ? handle_mm_fault+...
[...] ? srso_alias_return_thunk+...
[...] ? do_user_addr_fault+...
...
I contacted Supermicro about that issue and they said they aren't aware of any isues. Also they refused to give information about the BMC/BIOS changelog.
Otherwise-Ad-424@reddit (OP)
Same CPU, 7900. Super nice to see you also found interesting to use 7900 in a server setup. If you don't need PCIe lanes, it's amazing.
After some reading, PCIe ASPM could be a reason. I'll continue to investigate. It mostly happens on PCIe to NVME card adapter. So maybe signal integrity?
FYI, I also have Supermicro with Epyc and also have this issue but it's not frequent at all... I lso have on my servers (~20) different firmware versions. I don't see correlation for now. Some have the "0" firmware and have been working flawlessly for years.
Rare_Airline1418@reddit
New SSDs from Micron are ordered now and replaced in a few days hopefully. I've been running a stress test for days with no Kernel panic so far, so I think your assumption that it may has something to do with the PCIe power management is not so unlikely, since the kernel panic occurred right at the time of reboot.
Rare_Airline1418@reddit
I had to choose between the Ryzen 7900 (12C) and the EPYC 4464P (12C) and was unable to find out what the difference between these two is, since the single and multithreading performance seems to be exactly the same (according to passmark), but the 7900 was cheaper. Thats always a problem with hardware manufacturers, you always need to invest so much time to find out product differences ... I moved from Intel to AMD recently, so it was even more confusing (but as far as I know, AMD does usually anyway offer more PCIe lanes than Intel).
My NVMEs directly sit on the mainboard. That you use the same CPU (or even AMD), is another interesting fact. The worst would be a software bug in AMD AGESA (https://en.wikipedia.org/wiki/AGESA), but I really don't know if that could be the case.
For now, I decided to drop Samsung since I stopped liking them anyway for their inferior customer support, so I will go with Micron in future. A Micron 7400 (22110) might not be the most modern NVME, but still a robust product for my use case. If then the problem persists, I will have a problem 😳
Otherwise-Ad-424@reddit (OP)
More info: We use software Raid on Ubuntu Server. The disk is back after a power cycle.