Samsung SSD 990 Pro in RAID 1 on Servers - Disks Vanishing Issue

Posted by Otherwise-Ad-424@reddit | buildapc | View on Reddit | 20 comments

First reddit post!

I've seen very technical questions/issues on R so here I am!

We have been using Samsung 990 Pro in several servers. We are aware that it doesn't have power protection like a PM9A3 but it's way faster so practical for many use cases.

We are using for some of our servers this Motherboard: AsRock B650D4U-2L2T, to fit 2 SSDs in Raid1, we are using PCIe to M.2 adapters (like this one):

[PCIe to M.2](

Some servers are very stable, others seems to "loose" one drive once in a while. We don't know why but we get this from syslog/kernel in Linux:

[136244.461088] nvme nvme1: I/O 177 QID 7 timeout, aborting
[136244.461105] nvme nvme1: I/O 852 QID 12 timeout, aborting
[136244.461112] nvme nvme1: I/O 853 QID 12 timeout, aborting
[136244.557074] nvme nvme1: I/O 309 QID 3 timeout, aborting
[136275.185578] nvme nvme1: I/O 309 QID 3 timeout, reset controller
[136281.325896] nvme nvme1: I/O 18 QID 0 timeout, reset controller
[136357.126884] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[136357.159275] nvme nvme1: Abort status: 0x371
[136357.159278] nvme nvme1: Abort status: 0x371
[136357.159279] nvme nvme1: Abort status: 0x371
[136357.159280] nvme nvme1: Abort status: 0x371
[136377.703231] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[136377.703256] nvme nvme1: Removing after probe failure status: -19
[136398.247561] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[136398.247959] nvme1n1: detected capacity change from 3907029168 to 0
[136398.247963] blk_update_request: I/O error, dev nvme1n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[136398.247965] blk_update_request: I/O error, dev nvme1n1, sector 687804416 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[136398.247969] blk_update_request: I/O error, dev nvme1n1, sector 2599914832 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[136398.247980] blk_update_request: I/O error, dev nvme1n1, sector 2203664 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[136398.247988] md/raid1:md1: Disk failure on nvme1n1p2, disabling device.

As this Raid1 has the system partition, sometimes it has impact on the system stability

We did investigate if this could be a firmware issue, but 3B2QJXD7 firmware seems to be relatively stable (although 4B2QJXD7 does exist).

Anyone have good advice on how to find the root cause of the disk randomly disconnecting?

Smartctl reports no specific issues. Are there any other logs to check besides syslog and dmesg? Could this be related to a temperature problem, as the active disks appear to be more affected?