illumos NVMe hotplug
Last year I replaced a failed Samsung SSD 980 PRO 1TB with a Corsair CT1000P3SSD8, this was all I had budget for at the time and not a great NVMe drive overall. Last week it started throwing the occasional error.
Time to replace it with a Samsung SSD 9100 PRO 1TB, but on hard mode! No powering down the system just to figure out which bay it’s in and then replace it!
hotplug
I happened to see a few CRs from Oxide having to do with NVMe/PCIe hotplug and LED control, using that as hints for what commands to use.
First let’s figure out if we even have a LED to identify the bay! Let’s collect some information about the NVMe device.
root@jupiter:~# diskinfo
...
NVME c7t00A0750146DDE966d0 NVMe CT1000P3SSD8 931.51 GiB no yes
...
root@jupiter:~$ /usr/lib/pci/pcieadm show-devs -o vendor,device,driver,bdf,path nvme
...
Corsair 2550 NVMe SSD (DRAM-less) nvme 18/0/0 /pci@13,0/pci8086,2031@1/pci1344,1100@0
...
I already knew the WWN from the zpool status
output where I noticed the occasional error, but it’s good to confirm it’s indeed the device I was thinking of. I used pciadm to get some more info, the naming is a bit different but there was only one device with the nvme driver attached that matched what I was looking for.
root@jupiter:~# hotplug list -l /pci@13,0/pci8086,2031@1
/pci@13,0/pci8086,2031@1
/pci@13,0/pci8086,2031@1 [pcie2] (ENABLED)
/pci@13,0/pci8086,2031@1 <pci.0,0> (ONLINE)
/pci@13,0/pci8086,2031@1/pci1344,1100@0
/pci@13,0/pci8086,2031@1/pci1344,1100@0/blkdev@w00A0750146DDE966,0
Let’s see if it has a LED we can toggle!
root@jupiter:~# hotplug get -o all /pci@13,0/pci8086,2031@1 pcie2
power_led=default
attn_led=default
card_type=nvme
board_type=pci hotplug
slot_condition=ok
Looks like we do :) Let’s make it blink!
root@jupiter:~# hotplug set -o power_led=blink /pci@13,0/pci8086,2031@1 pcie2
root@jupiter:~# hotplug get -o power_led /pci@13,0/pci8086,2031@1 pcie2
I verified 1 of the 2 NVMe bays had a blinking LED! Nice, time to offline the device and remove the sled (drive tray).
root@jupiter:~# zpool offline rpool c7t00A0750146DDE966d0
root@jupiter:~# hotplug offline /pci@13,0/pci8086,2031@1 pci.0,0
I removed the sled and swapped the NVMe drive, I was expecting to have to online it once inserted but that just worked without doing anything!
root@jupiter:~# dmesg | tail -n 25
Jun 23 17:11:35 jupiter genunix: [ID 408114 kern.info] /pci@13,0/pci8086,2031@1/pci1344,1100@0/blkdev@w00A0750146DDE966,0 (blkdev5) offline
Jun 23 17:11:35 jupiter genunix: [ID 408114 kern.info] /pci@13,0/pci8086,2031@1/pci1344,1100@0 (nvme3) offline
Jun 23 17:11:46 jupiter pcie: [ID 661617 kern.notice] NOTICE: pciehpc (pcieb4): card is inserted in the slot pcie2
Jun 23 17:11:46 jupiter pcie: [ID 126225 kern.notice] NOTICE: pciehpc (pcieb4): card is removed from the slot pcie2
Jun 23 17:16:33 jupiter pcie: [ID 661617 kern.notice] NOTICE: pciehpc (pcieb4): card is inserted in the slot pcie2
Jun 23 17:16:33 jupiter pcie: [ID 661617 kern.notice] NOTICE: pciehpc (pcieb4): card is inserted in the slot pcie2
Jun 23 17:16:33 jupiter nvme: [ID 259564 kern.info] nvme1: NVMe spec version 2.0
Jun 23 17:16:33 jupiter blkdev: [ID 643073 kern.info] NOTICE: blkdev6: dynamic LUN expansion
Jun 23 17:16:33 jupiter blkdev: [ID 348765 kern.info] Block device: blkdev@w002538A5514092D8,0, blkdev6
Jun 23 17:16:33 jupiter genunix: [ID 936769 kern.info] blkdev6 is /pci@13,0/pci8086,2031@1/pci144d,a801@0/blkdev@w002538A5514092D8,0
Jun 23 17:16:33 jupiter genunix: [ID 408114 kern.info] /pci@13,0/pci8086,2031@1/pci144d,a801@0/blkdev@w002538A5514092D8,0 (blkdev6) online
Jun 23 17:16:33 jupiter genunix: [ID 408114 kern.info] /pci@13,0/pci8086,2031@1/pci144d,a801@0 (nvme1) online
Jun 23 17:16:39 jupiter genunix: [ID 408114 kern.info] /pci@13,0/pci8086,2031@1/pci144d,a801@0/blkdev@w002538A5514092D8,0 (blkdev6) offline
Jun 23 17:16:39 jupiter genunix: [ID 408114 kern.info] /pci@13,0/pci8086,2031@1/pci144d,a801@0 (nvme1) offline
Jun 23 17:16:39 jupiter pcie: [ID 661617 kern.notice] NOTICE: pciehpc (pcieb4): card is inserted in the slot pcie2
Jun 23 17:16:40 jupiter pcie: [ID 661617 kern.notice] NOTICE: pciehpc (pcieb4): card is inserted in the slot pcie2
Jun 23 17:16:40 jupiter nvme: [ID 259564 kern.info] nvme1: NVMe spec version 2.0
Jun 23 17:16:40 jupiter blkdev: [ID 643073 kern.info] NOTICE: blkdev6: dynamic LUN expansion
Jun 23 17:16:40 jupiter blkdev: [ID 348765 kern.info] Block device: blkdev@w002538A5514092D8,0, blkdev6
Jun 23 17:16:40 jupiter genunix: [ID 936769 kern.info] blkdev6 is /pci@13,0/pci8086,2031@1/pci144d,a801@0/blkdev@w002538A5514092D8,0
Jun 23 17:16:40 jupiter genunix: [ID 408114 kern.info] /pci@13,0/pci8086,2031@1/pci144d,a801@0/blkdev@w002538A5514092D8,0 (blkdev6) online
Jun 23 17:16:40 jupiter genunix: [ID 408114 kern.info] /pci@13,0/pci8086,2031@1/pci144d,a801@0 (nvme1) online
Just a zpool command to go to replace the device.
root@jupiter:~# zpool replace rpool c7t00A0750146DDE966d0 c3t002538A5514092D8d0
root@jupiter:~# zpool status rpool
pool: rpool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jun 23 17:17:30 2025
164G scanned at 1.12G/s, 10.6G issued at 74.0M/s, 164G total
11.4G resilvered, 6.47% done, 0 days 00:35:26 to go
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
c7t00A0750146DDE966d0 OFFLINE 0 0 0
c3t002538A5514092D8d0 ONLINE 0 0 0 (resilvering)
c2t002538B331B18B22d0 ONLINE 0 0 0
errors: No known data errors
And that was it, about 5 minutes later it was already done.