Replacing a (silently) failing disk in a ZFS pool

Maybe I can’t read, but I have the feeling that official documentations explain every single corner case for a given tool, except the one you will actually need. My today’s struggle: replacing a disk within a FreeBSD ZFS pool.

What? there’s a shitton of docs on this topic! Are you stupid?

I don’t know, maybe. Yet none covered the process in a simple, straight and complete manner. Here’s the story:

Since yesterday I felt my personal FreeBSD NAS was sluggish, and this morning, I saw those horrible messages popping in my syslog console:

Jul  2 12:49:53 <kern.crit> newcoruscant kernel: ahcich1: Timeout on slot 8 port 0
Jul  2 12:49:53 <kern.crit> newcoruscant kernel: ahcich1: is 00000000 cs 00000000 ss 00000300 rs 00000300 tfd 40 serr 00000000 cmd 0000c917
Jul  2 12:49:53 <kern.crit> newcoruscant kernel: (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 08 50 25 e9 40 3b 00 00 00 00 00
Jul  2 12:49:53 <kern.crit> newcoruscant kernel: (ada1:ahcich1:0:0:0): CAM status: Command timeout
Jul  2 12:49:53 <kern.crit> newcoruscant kernel: (ada1:ahcich1:0:0:0): Retrying command
Jul  2 12:51:02 <kern.crit> newcoruscant kernel: cant/memory/memory-inactive: ds[0] = 52350976.000000
Jul  2 12:51:02 <kern.crit> newcoruscant kernel: ahcich1: AHCI reset: device not ready after 31000ms (tfd = 00000080)

Yeah… that bad.

The first thing that stroke me is that ZFS seemed perfectly fine with that:

root@newcoruscant:~ # zpool status
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 2h26m with 0 errors on Tue Jun 25 12:08:56 2019
config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    ada0p4  ONLINE       0     0     0
	    ada1p4  ONLINE       0     0     0
	    ada2p4  ONLINE       0     0     0

errors: No known data errors

But the input/output error thrown by smartctl -a /dev/ada1 made things clear, I needed to replace this disk quickly!
Thanks to past-me, there already was a disk ready for this task at ada3, so, after trustfully reading the zpool administration guide, and in particular Replacing a Functioning Device, I entered:

# zpool replace zroot ada1p4 ada3p4

Except it didn’t ran as expected:

cannot open 'ada3p4': no such GEOM provider
must be a full path or shorthand device name

What a fantastic and explicit error message just to say that ada3 doesn’t have a corresponding partition table.
I am no FreeBSD guru and very occasional user, so no, I am not used to GEOM, gpart, GELI etc… finally, this very well written stackexchange post showed me how to replicate the correct partition table to the new disk:

   # gpart backup ada0|gpart restore -F ada3

Now zpool replace zroot ada1p4 ada3p4 would work! I also did not forget to replicate the boot sequence to the new disk as instructed by both the documentation and zpool(Warning! freebsd-boot is on ada3p2) :

# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 2 ada3 
partcode written to ada3p2
bootcode written to ada3

And at last the silvering was taking place:

root@newcoruscant:~ # zpool status
  pool: zroot
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jul  2 11:21:24 2019
	3.91M scanned out of 1.84T at 38.5K/s, (scan is slow, no estimated time)
        1.30M resilvered, 0.00% done
config:

	NAME             STATE     READ WRITE CKSUM
	zroot            ONLINE       0     0     0
	  raidz1-0       ONLINE       0     0     0
	    ada0p4       ONLINE       0     0     0
	    replacing-1  ONLINE       0     0     0
	      ada1p4     ONLINE       0     0     0
	      ada3p4     ONLINE       0     0     0
	    ada2p4       ONLINE       0     0     0

errors: No known data errors

But… at less than 40K/s! Turns out that very logically the failing disk and its timeouts was slowing down the silvering, so I learned that to avoid this kind of situation, you should offline the failing disk from the zpool:

# zpool offline zroot ada1p4

And then

$ sudo zpool status
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jul  2 16:01:22 2019
	514G scanned out of 1.84T at 167M/s, 2h20m to go
        170G resilvered, 27.22% done
config:

	NAME                        STATE     READ WRITE CKSUM
	zroot                       DEGRADED     0     0     0
	  raidz1-0                  DEGRADED     0     0     0
	    ada0p4                  ONLINE       0     0     0
	    replacing-1             DEGRADED     0     0     8
	      15084350875675872541  OFFLINE      0     0     0  was /dev/ada1p4
	      ada3p4                ONLINE       0     0     0
	    ada2p4                  ONLINE       0     0     0

errors: No known data errors

Much better. At the end of the resilvering, everything is now working correctly:

$ sudo zpool status
  pool: zroot
 state: ONLINE
  scan: resilvered 628G in 2h52m with 0 errors on Tue Jul  2 18:53:48 2019
config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    ada0p4  ONLINE       0     0     0
	    ada3p4  ONLINE       0     0     0
	    ada2p4  ONLINE       0     0     0

errors: No known data errors

I read that you should zpool remove the failing disk at the end of this operation, but when trying to do so:

root@newcoruscant:~ # zpool remove zroot ada1p4
cannot remove ada1p4: no such device in pool
root@newcoruscant:~ # zpool remove zroot 15084350875675872541
cannot remove 15084350875675872541: no such device in pool

So I guess zpool did it itself.
Now it’s time to buy and add a new spare for the next disk that fails…