ZFS Resilver Speed Comparisons

ZFS Resilver Speed Comparisons

http://cft.lv/23

#ZFS

October 27th, 2014

Version 1.0

© Michael Dexter

Only 10 or so more variables to consider...

ZFS is one of those "minute to learn, lifetime to master" kinds of things and the rapid progress being made by the OpenZFS project is good reason to verify your assumptions from time to time. I have long assumed that a ZFS pool of striped mirrored pairs (RAID 1+0 equivalent which I will refer to as "RaidZ10" at the risk of upsetting a few people) would resilver faster than equivalent RaidZ2 pools because the operation would not rely on any complex RaidZ allocation formulas. I also assumed they would achieve higher write speeds for the same reason. Here's what I did and what I found for a first round of assumption testing.

The Test Environment

I performed my tests with FreeNAS 9.2.1.7 because of its simplicity and consistency using four identical 500GB Hitachi "DeathStar" drives attached to an LSI 9211-4i controller. diskinfo -v reports:

        512             # sectorsize
        500107862016    # mediasize in bytes (465G)
        976773168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        60801           # Cylinders according to firmware.
        255             # Heads according to firmware.
        63              # Sectors according to firmware.
        KRVN03ZAHV****  # Disk ident

The partitioning remained the same for each test with a 2GB FreeNAS default swap partition on each drive:

[root@freenas ~]# gpart show da0
=>       34  976773101  da0  GPT  (465G)
         34         94       - free -  (47k)
        128    4194304    1  freebsd-swap  (2.0G)
    4194432  972578696    2  freebsd-zfs  (463G)
  976773128          7       - free -  (3.5k)

I used a sample data set consisting of a large file full of /dev/random garbage created with the dd utility. dd has its strengths and weaknesses for this purpose and I will touch on a few of those later. Resist the temptation to use /dev/zero to generate data as ZFS is smart enough to both recognize it as empty and compress it 100% which would render the tests meaningless. A better data source would have known compressibility and perhaps a video tool like ffmepg could be used for this purpose: create a fixed-length video file in a specific format with a consistent source and then experiment with the in-video compression settings. I found various tools to create dummy database datasets but none for general-purpose use. This is worth investigating and I am open to suggestions.

For these tests I used a 659GB file that represented a 75% fill of my smallest four-drive pool which happened to be the RaidZ2 layout. I used a Megabyte (1048576 bytes) block size for the sake of easy sizing but could have gone with anything that didn't take too long. The command I used for each test was:

dd if=/dev/random of=testfile bs=1048576 count=675000

I prefixed these with the time command and thought it might serve as a primitive write speed test:

A four-drive "RaidZ10" stripe of two mirrors pool, no compression
707788800000 bytes transferred in 9369.355315 secs (75542956 bytes/sec)

A four-drive "non-optimal" RaidZ1 pool, no compression
707788800000 bytes transferred in 9462.791934 secs (74797037 bytes/sec)

A four-drive RaidZ2 pool, no compression
707788800000 bytes transferred in 9605.649576 secs (73684637 bytes/sec)

A four-drive ZFS striped pool for comparison, no compression
707788800000 bytes transferred in 9363.161820 secs (75592926 bytes/sec)

I had not not put much thought into the possibility of a human-friendly flag for time and found on another machine that FreeBSD's builtin does not support it but /usr/bin/time does. Fortunately, this forced me to break out a spreadsheet and attempt to get my conversions right once an for all.

The first conversion I investigated was time. I eventually found this Excel formula for pushing around seconds remainders:

=TIME(INT(A1/3600),INT(MOD(A1,3600)/60),MOD(MOD(A1,3600),60))

The above data with a time conversion becomes:

A four-drive "RaidZ10" stripe of two mirrors pool, no compression
707788800000 bytes transferred in 2:36:09 (75542956 bytes/sec)

A four-drive "non-optimal" RaidZ1 pool, no compression
707788800000 bytes transferred in 2:37:42 secs (74797037 bytes/sec)

A four-drive RaidZ2 pool, no compression
707788800000 bytes transferred in 2:40:05 secs (73684637 bytes/sec)

A four-drive ZFS striped pool for comparison, no compression
707788800000 bytes transferred in 2:36:03 secs (75592926 bytes/sec)

That is, about two hours and 36 minutes for each of the initial writes.

Now for the bytes. I used bytes divided by two to the power of twenty (1048576) for Megabyes and two to the power of thirty for Gigabyes (1073741824):

=A1/(2^20) for bytes to Megabytes and
=A1/(2^30) for bytes to Gigabytes

The above data with byte conversions becomes:

A four-drive "RaidZ10" stripe of two mirrors pool, no compression
659 Gbytes transferred in 2:36:09 (72.04 Mbytes/sec)

A four-drive "non-optimal" RaidZ1 pool, no compression
659 Gbytes bytes transferred in 2:37:42 secs (71.33 Mbytes/sec)

A four-drive RaidZ2 pool, no compression
659 Gbytes bytes transferred in 2:40:05 secs (70.27 Mbytes/sec)

A four-drive ZFS striped pool for comparison, no compression
659 Gbytes bytes transferred in 2:36:03 secs (72.09 Mbytes/sec)

We now see that the write speed was pretty consistent at about 71 Megabytes per second.

The dd process ran at 100% CPU for all of these tests and gstat showed roughly 80% disk activity with the exception of the RaidZ1 pool which hummed along at roughly 35% disk activity. A smarter data source than dd would be appreciated as the drives were most likely not the bottleneck and should yield better write results.

As I was not sure what to look out for when I started the experiment, I ran these tests with compression disabled though I enabled lz4 compression for the resilver tests.

Note that the RaidZ1 "non-optimal" "imbalanced vdev" configuration had no issues keeping up with the other configurations. dd was probably the bottleneck here but I should re-run it with time -h, compression enabled and a better benchmarking tool.

Ruling out compression, note Matt Ahren's excellent post, ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ in which he states that "Due to compression, the physical (allocated) block sizes are not powers of two, they are odd sizes like 3.5KB or 6KB."

It would be great to confirm that "vdev balancing" is obsolete voodoo with lz4 compression.

Capacities

Here is the df -h output before and after the random dummy data was added to each pool with zl4 compression:

Filesystem             Size    Used   Avail Capacity  Mounted on
z10                    905G    144k    905G     0%    /mnt/z10
z10                    905G    659G    245G    73%    /mnt/z10
z1                     1.3T    209k    1.3T     0%    /mnt/z1
z1                     1.3T    659G    656G    50%    /mnt/z1
z2                     877G    209k    877G     0%    /mnt/z2
z2                     877G    659G    217G    75%    /mnt/z2
zstripe                1.8T    659G    1.1T    36%    /mnt/zstripe
zfs500                 452G    144k    452G     0%    /mnt/zfs500
/dev/ufs/ufs500        449G    8.0k    413G     0%    /mnt/ufs500

At the end you will see single ZFS and UFS-formatted drives for comparison. I was quite surprised that the UFS drive appears to have a smaller capacity than the single ZFS-formatted drive.

The Actual Tests

With each pool configuration populated with the fixed-size data set, I proceeded to Offline member drive da0, Wipe it and Replace it using the FreeNAS GUI.

I made passing note of the resilver throughput as reported by zpool status. If anything, the throughput quickly peaked and slowed somewhat as the as the process neared completion.

Here is some incremental zpool status output from the "RaidZ10" drive resilver which sustained around 110M/s throughout the process:

4.90G scanned out of 660G at 98.4M/s, 1h53m to go
2.45G resilvered, 0.74% done

11.3G scanned out of 660G at 108M/s, 1h42m to go
5.65G resilvered, 1.71% done

15.5G scanned out of 660G at 110M/s, 1h40m to go
7.77G resilvered, 2.36% done

537G scanned out of 660G at 102M/s, 0h20m to go
268G resilvered, 81.34% done

A RaidZ1 Sample:
46.8G scanned out of 908G at 168M/s, 1h27m to go
11.2G resilvered, 5.15% done

A RaidZ2 Sample:
12.5G scanned out of 1.33T at 156M/s, 2h27m to go
3.02G resilvered, 0.92% done

That's a pretty significant difference in reported throughput and surprisingly it has little to do with how long the resilver took to complete. I will attribute these differences to the aggregate throughput of all drives involved in the resilver. A mirrored drive that is part of a larger stripe will only have one other drive to restore data from. RaidZ pools will draw from multiple member drives and gstat output confirms this assumption. I have had clients set up stripes of triple mirrors and it would be interesting to see if a triple mirror resilvers faster than a single mirror. The write would remain the same but it could read from two drives and perhaps a real-world load during the resilver would give this an advantage.

The Results

A four-drive "RaidZ10" stripe of two mirrors pool, lz4 compression
scan: resilvered 330G in 1h55m with 0 errors on Tue Oct  7 12:19:18 2014

A four-drive "RaidZ10" stripe of two mirrors pool, no compression
scan: resilvered 330G in 1h54m with 0 errors on Thu Oct  9 01:53:13 2014

A four-drive "non-optimal" RaidZ1 pool, lz4 compression
scan: resilvered 217G in 1h35m with 0 errors on Thu Oct  9 21:40:57 2014

A four-drive RaidZ2 pool, lz4 compression
scan: resilvered 330G in 2h31m with 0 errors on Tue Oct  7 19:35:57 2014

A four-drive RaidZ2 pool, no compression
scan: resilvered 330G in 2h31m with 0 errors on Thu Oct  9 16:06:44 2014

First off, we see that enabling or disabling compression has zero impact on the resilver speed of random data. Whether or not this would change with known-compressible data is worth testing.

Including RaidZ1 was an afterthought because of its "non-optimal" configuration but it turned out to be the resilver speed winner. Given that the data set only consumed 50% of each drive as opposed to 75% with RaidZ2, the resilvered drive only required 217G of replacement data as opposed to the 330G on the other configurations. Rerunning this test with the same 75% of consumed space on adjusted partition size drives should prove enlightening.

Finally, yes, the stripe of two mirrors configuration completed the resilver a good 30 minutes faster than the RaidZ2 configuration.

Now let's hope the controller cards for my 16 drive lab array arrive sooner rather than later.