A few of weeks ago, Matt and I integrated OpenZFS device removal into the Delphix downstream repository of Illumos. I first presented this work at the OpenZFS developer summit, but this blog post will provide a high level overview of the feature and some exploration into some of the more interesting challenges we faced.
Device removal allows an administrator to remove top level disks from a storage pool (there are already mechanisms for removing log and cache devices and detaching disks from a mirror). This feature is useful in a number of cases ranging from "Oops I meant to attach that disk as a mirror" to "I don't need as much storage capacity in this pool as I thought." An example of device removal:
# zpool remove test c2t1d0 # zpool status -v test pool: test state: ONLINE scan: none requested remove: Evacuation of vdev 1 in progress since Mon Dec 10 08:06:43 2014 340M copied out of 405M at 67.5M/s, 83.90% done, 0h0m to go config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c2t2d0 ONLINE 0 0 0 c2t3d0 ONLINE 0 0 0
At a high level, device removal takes 3 steps:
During this process, the challenge is keeping track of the copied data. The naive approach of building a map from block pointers (BPs) on the old device to BPs on the new device quickly requires an unreasonably large data structure that must be stored both on disk and kept in memory at all times it is in use. Even a technique like the now-defunct BP rewrite project would require a huge temporary table (and brings its own set of problems). We needed to adopt an approach that would allow a layer of indirection but use less memory. Ultimately, we decided to construct the mapping table at a lower layer: in the SPA. At this layer, we could ignore block boundaries and instead manage contiguously allocated segments on the old disk: the table would map an offset and length on the old disk to a new offset on another disk in the pool. For very fragmented disks, this could still require a lot of memory, but in the general case this is a huge reduction in the number of entries managed by our indirect mapping table. Another benefit to this approach is we can construct the mapping table by iterating through the allocated segments on the old disk in LBA order, allowing us to do the reads from the old disk in LBA order.
When iterating through the segments on the disk, we did not use the existing scan framework. There were 2 main reasons for this: the existing scan code for scrub and resilvering operated at too high a layer (it iterated through BPs in the DMU, not allocated segments in the SPA) and had confusing performance characteristics. Instead, we developed a new approach that used a background thread running in open context to do most of the work and used sync tasks to update the on-disk data structures. Every txg, the open context thread will:
Unfortunately, this process can lead to "split blocks": since we do not know where the DMU block boundaries are when iterating through allocated segments, we might might create a split in the middle of a DMU block. As a consequence, we must piece together the data for these blocks when doing reads or frees to them after the removal has completed.
In order to know that we have copied every segment off the old disk, we have to be careful not to change which segments are allocated to it while we are iterating over it. The set of segments on a disk will change if new segments are allocated to it or segments are freed from it. Although we can disable new allocations to the disk during a removal, frees are more complicated; we cannot simply disable or delay frees of the data without messing up the space accounting in the DMU. Instead, we allow frees to proceed to the old disk and carefully handle the interaction with the in-progress removal. We always free from the old disk, and then handle one (or more, because of split blocks) of the following 3 cases depending on the progress of the removal thread:
This work is currently undergoing testing on our internal fork of Illumos. We're actively working on a couple more features before we're ready to put the code up for general review:
We'll publish the code when we ship our next release, probably in march, but we won't integrate into Illumos until we address all the future work issues.
Writeup by Alex Reece, see me on Google+.