Only 10 or so more variables to consider...
ZFS is one of those "minute to learn, lifetime to master" kinds of things and the rapid progress being made by the OpenZFS project is good reason to verify your assumptions from time to time. I have long assumed that a ZFS pool of striped mirrored pairs (RAID 1+0 equivalent which I will refer to as "RaidZ10" at the risk of upsetting a few people) would resilver faster than equivalent RaidZ2 pools because the operation would not rely on any complex RaidZ allocation formulas. I also assumed they would achieve higher write speeds for the same reason. Here's what I did and what I found for a first round of assumption testing.
I performed my tests with FreeNAS 184.108.40.206 because of its simplicity and consistency using four identical 500GB Hitachi "DeathStar" drives attached to an LSI 9211-4i controller.
diskinfo -v reports:
512 # sectorsize 500107862016 # mediasize in bytes (465G) 976773168 # mediasize in sectors 0 # stripesize 0 # stripeoffset 60801 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. KRVN03ZAHV**** # Disk ident
The partitioning remained the same for each test with a 2GB FreeNAS default swap partition on each drive:
[root@freenas ~]# gpart show da0 => 34 976773101 da0 GPT (465G) 34 94 - free - (47k) 128 4194304 1 freebsd-swap (2.0G) 4194432 972578696 2 freebsd-zfs (463G) 976773128 7 - free - (3.5k)
I used a sample data set consisting of a large file full of
/dev/random garbage created with the
dd has its strengths and weaknesses for this purpose and I will touch on a few of those later. Resist the temptation to use
/dev/zero to generate data as ZFS is smart enough to both recognize it as empty and compress it 100% which would render the tests meaningless. A better data source would have known compressibility and perhaps a video tool like
ffmepg could be used for this purpose: create a fixed-length video file in a specific format with a consistent source and then experiment with the in-video compression settings. I found various tools to create dummy database datasets but none for general-purpose use. This is worth investigating and I am open to suggestions.
For these tests I used a 659GB file that represented a 75% fill of my smallest four-drive pool which happened to be the RaidZ2 layout. I used a Megabyte (1048576 bytes) block size for the sake of easy sizing but could have gone with anything that didn't take too long. The command I used for each test was:
dd if=/dev/random of=testfile bs=1048576 count=675000
I prefixed these with the
time command and thought it might serve as a primitive write speed test:
A four-drive "RaidZ10" stripe of two mirrors pool, no compression 707788800000 bytes transferred in 9369.355315 secs (75542956 bytes/sec) A four-drive "non-optimal" RaidZ1 pool, no compression 707788800000 bytes transferred in 9462.791934 secs (74797037 bytes/sec) A four-drive RaidZ2 pool, no compression 707788800000 bytes transferred in 9605.649576 secs (73684637 bytes/sec) A four-drive ZFS striped pool for comparison, no compression 707788800000 bytes transferred in 9363.161820 secs (75592926 bytes/sec)
I had not not put much thought into the possibility of a human-friendly flag for
time and found on another machine that FreeBSD's
builtin does not support it but
/usr/bin/time does. Fortunately, this forced me to break out a spreadsheet and attempt to get my conversions right once an for all.
The first conversion I investigated was time. I eventually found this Excel formula for pushing around seconds remainders:
The above data with a time conversion becomes:
A four-drive "RaidZ10" stripe of two mirrors pool, no compression 707788800000 bytes transferred in 2:36:09 (75542956 bytes/sec) A four-drive "non-optimal" RaidZ1 pool, no compression 707788800000 bytes transferred in 2:37:42 secs (74797037 bytes/sec) A four-drive RaidZ2 pool, no compression 707788800000 bytes transferred in 2:40:05 secs (73684637 bytes/sec) A four-drive ZFS striped pool for comparison, no compression 707788800000 bytes transferred in 2:36:03 secs (75592926 bytes/sec)
That is, about two hours and 36 minutes for each of the initial writes.
Now for the bytes. I used bytes divided by two to the power of twenty (1048576) for Megabyes and two to the power of thirty for Gigabyes (1073741824):
=A1/(2^20) for bytes to Megabytes and =A1/(2^30) for bytes to Gigabytes
The above data with byte conversions becomes:
A four-drive "RaidZ10" stripe of two mirrors pool, no compression 659 Gbytes transferred in 2:36:09 (72.04 Mbytes/sec) A four-drive "non-optimal" RaidZ1 pool, no compression 659 Gbytes bytes transferred in 2:37:42 secs (71.33 Mbytes/sec) A four-drive RaidZ2 pool, no compression 659 Gbytes bytes transferred in 2:40:05 secs (70.27 Mbytes/sec) A four-drive ZFS striped pool for comparison, no compression 659 Gbytes bytes transferred in 2:36:03 secs (72.09 Mbytes/sec)
We now see that the write speed was pretty consistent at about 71 Megabytes per second.
dd process ran at 100% CPU for all of these tests and
gstat showed roughly 80% disk activity with the exception of the RaidZ1 pool which hummed along at roughly 35% disk activity. A smarter data source than
dd would be appreciated as the drives were most likely not the bottleneck and should yield better write results.
As I was not sure what to look out for when I started the experiment, I ran these tests with compression disabled though I enabled lz4 compression for the resilver tests.
Note that the RaidZ1 "non-optimal" "imbalanced vdev" configuration had no issues keeping up with the other configurations.
dd was probably the bottleneck here but I should re-run it with
time -h, compression enabled and a better benchmarking tool.
Ruling out compression, note Matt Ahren's excellent post, ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ in which he states that "Due to compression, the physical (allocated) block sizes are not powers of two, they are odd sizes like 3.5KB or 6KB."
It would be great to confirm that "vdev balancing" is obsolete voodoo with lz4 compression.
Here is the
df -h output before and after the random dummy data was added to each pool with zl4 compression:
Filesystem Size Used Avail Capacity Mounted on z10 905G 144k 905G 0% /mnt/z10 z10 905G 659G 245G 73% /mnt/z10 z1 1.3T 209k 1.3T 0% /mnt/z1 z1 1.3T 659G 656G 50% /mnt/z1 z2 877G 209k 877G 0% /mnt/z2 z2 877G 659G 217G 75% /mnt/z2 zstripe 1.8T 659G 1.1T 36% /mnt/zstripe zfs500 452G 144k 452G 0% /mnt/zfs500 /dev/ufs/ufs500 449G 8.0k 413G 0% /mnt/ufs500
At the end you will see single ZFS and UFS-formatted drives for comparison. I was quite surprised that the UFS drive appears to have a smaller capacity than the single ZFS-formatted drive.
With each pool configuration populated with the fixed-size data set, I proceeded to
Offline member drive
Wipe it and
Replace it using the FreeNAS GUI.
I made passing note of the resilver throughput as reported by
zpool status. If anything, the throughput quickly peaked and slowed somewhat as the as the process neared completion.
Here is some incremental
zpool status output from the "RaidZ10" drive resilver which sustained around 110M/s throughout the process:
4.90G scanned out of 660G at 98.4M/s, 1h53m to go 2.45G resilvered, 0.74% done 11.3G scanned out of 660G at 108M/s, 1h42m to go 5.65G resilvered, 1.71% done 15.5G scanned out of 660G at 110M/s, 1h40m to go 7.77G resilvered, 2.36% done 537G scanned out of 660G at 102M/s, 0h20m to go 268G resilvered, 81.34% done A RaidZ1 Sample: 46.8G scanned out of 908G at 168M/s, 1h27m to go 11.2G resilvered, 5.15% done A RaidZ2 Sample: 12.5G scanned out of 1.33T at 156M/s, 2h27m to go 3.02G resilvered, 0.92% done
That's a pretty significant difference in reported throughput and surprisingly it has little to do with how long the resilver took to complete. I will attribute these differences to the aggregate throughput of all drives involved in the resilver. A mirrored drive that is part of a larger stripe will only have one other drive to restore data from. RaidZ pools will draw from multiple member drives and
gstat output confirms this assumption. I have had clients set up stripes of triple mirrors and it would be interesting to see if a triple mirror resilvers faster than a single mirror. The write would remain the same but it could read from two drives and perhaps a real-world load during the resilver would give this an advantage.
A four-drive "RaidZ10" stripe of two mirrors pool, lz4 compression scan: resilvered 330G in 1h55m with 0 errors on Tue Oct 7 12:19:18 2014 A four-drive "RaidZ10" stripe of two mirrors pool, no compression scan: resilvered 330G in 1h54m with 0 errors on Thu Oct 9 01:53:13 2014 A four-drive "non-optimal" RaidZ1 pool, lz4 compression scan: resilvered 217G in 1h35m with 0 errors on Thu Oct 9 21:40:57 2014 A four-drive RaidZ2 pool, lz4 compression scan: resilvered 330G in 2h31m with 0 errors on Tue Oct 7 19:35:57 2014 A four-drive RaidZ2 pool, no compression scan: resilvered 330G in 2h31m with 0 errors on Thu Oct 9 16:06:44 2014
First off, we see that enabling or disabling compression has zero impact on the resilver speed of random data. Whether or not this would change with known-compressible data is worth testing.
Including RaidZ1 was an afterthought because of its "non-optimal" configuration but it turned out to be the resilver speed winner. Given that the data set only consumed 50% of each drive as opposed to 75% with RaidZ2, the resilvered drive only required 217G of replacement data as opposed to the 330G on the other configurations. Rerunning this test with the same 75% of consumed space on adjusted partition size drives should prove enlightening.
Finally, yes, the stripe of two mirrors configuration completed the resilver a good 30 minutes faster than the RaidZ2 configuration.
Now let's hope the controller cards for my 16 drive lab array arrive sooner rather than later.
Copyright © 2011 – 2014 Michael Dexter unless specified otherwise. Feedback and corrections welcome.