Tags: bandwidth, corruption, disk hardware, file system check, hardware trends, ing, intel, intel corporation, linux, linux intel, mance, ms 8, open source technology, rare event, reliability, system crash, technology center, time ms, val henson, valerie,
RepairDriven File System Design
Valerie Henson
Open Source Technology Center
Intel Corporation
val henson@linux.intel.com
2006 2009 2013 Change
Capacity (GB) 500 2000 8000 16x
File system design has traditionally focused on the
online performance of the file system while the perfor- Bandwidth (Mb/s) 1000 2000 5000 5x
mance and reliability of file system check, repair, and Seek time (ms) 8 7.2 6.5 1.2x
restore has been for the most part neglected. Journal-
ing and other techniques handle the most common case Table 1: Projected disk hardware trends[9].
of file system corruption--halffinished updates inter-
rupted by system crash or other event--but repairing
any other form of metadata corruption requires reading
For many years, we could brush off the importance
and checking the entire file system's metadata. Unfortu-
of making file system repair fast and reliable with the
nately, disk capacity is growing faster than disk band-
following chain of reasoning: File system corruption is
width, seek times are hardly budging, and the overall
a rare event, and when it does occur, repairing it takes
chance of an I/O error occurring somewhere on the disk
only a few minutes or maybe a few hours of downtime,
is increasing. The result: the traditional file system check
and if repair is too difficult or timeconsuming, "That's
and repair cycle will be not only longer, but also more
what backups are for." Unfortunately, if this reasoning
frequent, with disastrous consequences for data availabil-
was ever valid, it is being eroded by some inconvenient
ity. Data reliability will also decline with frequency of
truths about disk hardware trends.
corruption. With this reality in mind, we propose repair
driven file system design, the practice of designing file As Table 1 shows, Seagate projects that during the
systems with the explicit goal of fast and reliable file sys- same time that disk capacity increases by 16 times, disk
tem check, repair, and restore. bandwidth will increase by only 5 times, and seek times
will remain nearly unchanged. This is good news for
1 Introduction many common workloads--we can store more data and
File system repair is usually an afterthought for file read and write more of it at once. But it is terrible news
system designers. In our experience with ZFS[2], file for any workload that is (a) proportional to the size of the
system programmers tend not to be excited about writ- disk, (b) throughputintensive, and (c) seekintensive.
ing fsck; XFS was even initially shipped without any file One workload that fits this profile is file system check
system repair tool at all[13]. One reason is that repair is and repair. We calculate that file system check and re-
difficult and annoying to reason about. It's neither pos- pair time will increase by approximately a factor of 10
sible nor worthwhile to fix all error modes so we must between 2006 and 2013 with today's file system formats
focus our efforts on the ones that commonly occur, yet (see § 2.2).
we don't know what they are until we encounter them At the same time that capacity is increasing, the per
in the wild. In practice, most file system repair code is bit error rate is improving. However, for an overall im-
written in response to an observed corruption mode. File provement in the error rate for operations that read data
system repair is annoying because, by definition, some- proportional to the file system size (such as fsck), the
thing went wrong and we must think outside the state perbit error rate must improve as fast as capacity grows,
space of our beautifully designed system. In the end, de- which seems unlikely. We conclude that the frequency
signing a file system is more fun than designing a file of file system corruption and necessary check and repair
system checker. or restore is more likely to increase than decrease. This
1
combination of increasing fsck time and increasing fsck the most time consuming, as we must issue a set of small
frequency is what we call the fsck time crunch. scattered reads, wait for them to complete, read the ad-
Our response to this approaching train wreck is to dress of the next block, then issue another set of reads.
propose repairdriven file system design: the practice Some optimization of this stage has been done; mainly
of designing file systems with repair and recovery as a grouping sets of independent I/Os together, reordering
firstclass goal. The ondisk format should be designed them, and issuing them in order to the disk to minimize
for performance and reliability for both normal online seeks. However, good disk performance requires issuing
workloads and for repair workloads. While we cannot many closely grouped I/Os at once; as long as we don't
repair every possible corruption, we can design ondisk know which blocks we need to read it is difficult to opti-
formats that are robust and fast to repair. mize the manner in which we retrieve them.
Finally, we need CPU time and sufficient memory to
2 Motivation actually compute the consistency checks on the data we
To make a convincing argument for repairdriven file have read off disk. This takes a relatively small amount
system design, we must first investigate fsck in depth: of time compared to the time spent doing what are ef-
how it works, what the major performance factors are, fectively random 4 KB or similarsized reads, although
how performance can be improved. (For more depth, more complex file systems may burn more CPU time in
see [8].) The short version is that most checkers op- computing checksums or similar tasks.
erate in several passes, each of which checks a dif- In summary, the ways to reduce fsck time, in rough
ferent kind of consistency and/or builds up data struc- order of effectiveness, are to reduce seeks, reduce de-
tures for later passes. The things fsck checks may in- pendent reads, reduce the amount of metadata that needs
clude superblock sanity, block group summaries, intra to be read (either by reducing the overall quantity or the
inode consistency, inode link counts, block/inode allo- amount that needs to be read), and to reduce the com-
cation bitmaps, directory connectivity, orphaned inodes, plexity of the consistency checks themselves.
directory consistency, duplicate block allocations, indi- 2.2 Projecting file system check time
rect blocks/extents/trees, and checksums. To complete To get a rough estimate of how 16x capacity increase,
all of these checks, fsck must read every piece of meta- 5x bandwidth increase, and almost no change in seek
data in the file system--every inode, every bitmap, every time will affect fsck time in 2013, we ran fsck on the
directory entry, every indirect block, every block group /home partition (ext3 formatted) of a development laptop
summary. (see Table 2 for details on the disk used). We measured
2.1 Fsck performance the elapsed time and CPU time using the time com-
The fundamental limiting factors in the performance mand, and the number of individual I/O operations and
of fsck are amount of data read, number of separate I/Os, the total data read using iostat. Using this data and
how scattered the data is on disk, number of dependent projected changes in disk hardware, we made a rough es-
reads, and CPU time required to check and repair the timate of the time needed to complete a file system check
data read. The amount of memory available is a factor as on a moderately sized laptop file system in 2013.
well, though most fsck programs operate on an allor First, we estimated how the elapsed time is divided
nothing basis: Either there is enough memory to fit all between using the CPU, reading data off the disk, and
the needed metadata for a particular checking pass or the head travel between blocks (seek time plus rotational la-
checker simply aborts. tency). The total elapsed time to check a 37 GB file sys-
The time required to read the file system metadata tem with 21 GB of data is 450 seconds, 15 seconds of
is partially constrained by the bandwidth of the disk. which are spent in CPU time. That leaves 435 seconds
Depending on the file system, some kinds of file sys- in I/O. We measured 1.3GB of data read. We estimated
tem data, such as blocks of statically allocated inodes the amount of time to read 1.3 GB of data off the disk
or block group summaries, are located in contiguous by using dd to transfer that amount of data from the par-
chunks at known locations. Reading this data off disk tition into /dev/null, which took 42 seconds at optimal
is relatively swift. read size. The remaining time, 393 seconds, we assume
Other kinds of file system data are dynamically allo- is spent seeking between tracks and waiting for the data
cated, such as directory entries, indirect blocks, and ex- to rotate under the head. We measured 340,000 sepa-
tents, and hence are scattered all over the disk. Many rate I/O requests. Given that the time spent seeking is
modern file systems allocate nearly all their metadata dy- 393 seconds, the average seek time for this disk is 12 ms,
namically. The location of much of this kind of metadata and the average rotational latency is 7.4 ms, we estimate
is not known until the block of metadata pointing to it is that fsck required about 393/(0.010 + 0.0074) 20000
read off the disk, introducing many levels of dependent seeks (about one seek per every 15-20 I/Os).
reads. This portion of the file system check is usually To estimate how long fsck will take in 2013, we use
2
Total size 80 GB provements in the specified UER are predicted[4], but
Partition size 40 GB their relationship to shipped product is uncertain. In any
Used 23 GB case, we cannot depend on an improvement in the er-
Files 850,000 ror rate when the consequences of failure are hours of
Avg. seek time 12 ms downtime on even modestlysized file systems, and days
Avg. rotational latency 7.4 ms on any significantly sized file system. Along with hard-
ware errors, increases in the complexity of file systems
bring their own inevitable increase in file system corrup-
Table 2: Experiment disk characteristics (GB = 109 tion due to bugs.
bytes Complete disk failure has been measured at a rate of
1-6% per year, depending on disk batch[12, 6], making
Year Elapsed time I/Os Data read speedy file system restore from backup a priority. The
2006 7.5 m 340,000 1.3 GB author's web site was recently down for a week when the
2013 (est.) 80 m 5,400,000 21 GB RAID system on a file server with several thousand user
accounts failed catastrophically. The filebased restore
took over a week to restore only 320 GB of data.
Table 3: Projected fsck time.
2.4 The fsck time crunch
The combination of a 10fold increase in fsck time
Seagate's projections for changes in capacity, bandwidth, with, if anything, an increase in file system corruption
average seek time, and average rotational latency. We requiring fsck creates the fsck time crunch. File systems
generously assume that the CPU time is negligible. We will spend an ever increasing proportion of time unavail-
assume that the file system is 16 times bigger, and check- able simply due to being checked and repaired. Repair
ing requires reading 16 times the data and 16 times the driven file system design is one solution to the fsck time
I/Os. We'll assume that bandwidth is 5 times bigger, crunch.
and that the seek time is 10 ms, 1.2 times faster than
12 ms. The trend in rotational latency is about a 1.5x 3 Repairdriven file system design
improvement[9], so we will assume a 4.7 ms rotational How can we take file system check, repair, and restore
latency. into account when designing file systems? We discuss
The time required to read the data off the disk is 42s × several general classes of useful techniques: (1) Divide
(16/5) 130s. The time required to execute the seeks and isolate metadata, (2) make file system traversal fast,
is 16 × 20000 × (0.010s + 0.0047s) 4700s. The total (3) use more (relatively cheap) disk space, (4) keep it
time is 4700s + 130s 4800s, or about 80 minutes, simple, (5) make restore fast.
compared to about 7.5 minutes today, an increase of a 3.1 Divide and isolate metadata
factor of about 10. One of the primary reasons that file system check and
The above experiment is also best case in that ext3 repair are so painful is that they are O(file system size).
is an older file system with lots of statically allocated All ondisk file system layouts we are aware of funda-
metadata and no checksums or complex structures. Most mentally require examining every piece of metadata in
modern file systems have more dynamically allocated the system to determine, for example, whether a par-
metadata and take more CPU time to verify their more ticular data block is free or allocated. If, on the other
complex ondisk structures. Running fsck on a full 300 hand, we can divide up the metadata and create isolation
GB reiserfs file system with 60,000,000 files takes about boundaries that limit the effects of metadata corruption
2.5 hours! In any case, we do not expect that a 10fold inside the boundary on things outside the boundary, we
increase in the time it takes to check and repair file sys- will be able to make file system check O(size of isola-
tems will be acceptable to file system users. tion group)--a constant factor. This can be thought of as
2.3 Increasing file system corruption block groups on steroids. Any references that cross iso-
Disk manufacturers currently specify an uncorrectable lation group boundaries must have a simple, direct, con-
read error rate (UER) of 1 in 1013 to 1016 bits read[6], de- stant time method to determine the state of the relevant
pending on quality, or one per 11,000 TB of data read-- outofgroup metadata, such as back pointers. One way
far from a shockingly large amount of data to read by of doing this is to allow only inodes inside an isolation
today's standards. Empirical measurements of disk re- group to point to data blocks inside of it.
liability vary from 1-2 orders of magnitude below spec- A limitation is that we need to be able to find the
ification for a small sample size (n 20)[6] to a 1% exact location of file system corruption. Useful meth-
per year rate of disk replacement due to "substantial sec- ods include checksums, disk error reporting, dirty bits
tor failure" for a sample size in the thousands[12]. Im- or marks on isolation groups showing that an update is
3
in progress, and errorchecking code such as asserts and strip actual data stored. We should use more of that space
sanity checks in the file system code (such as in [11] and to make file system repair fast and reliable. In general,
[2]). In the case of completely silent corruption or cor- optimizations designed to save ondisk space are miss-
ruption spread over the entire file system, this technique ing the forest for the trees; the real point of reducing the
will not be of much help. Isolation group boundaries size of ondisk data is to save bandwidth, if anything.
need to be aligned with underlying device fault bound- Checksums and magic numbers are the most obvious ex-
aries, such as RAID stripes or disks. amples of useful additional ondisk structures. Multiple
A corollary to divide and isolate is structuring file sys- copies of metadata are another. Red zones to detect out
tem metadata so that changes and repairs do not propa- ofbounds writes and buffers between isolation groups
gate excessively to other parts of metadata. Many com- may also be useful.
plex tree structures require rebalancing and removal or 3.4 Keep it simple
addition of one node may affect many other nodes. The The more knobs, gewgaws, and ornamental curlicues
more work we have to do during repair, the longer it takes in the ondisk format, the more difficult it is to repair
and the more fragile the file system. errors correctly and the more failure modes multiply.
Another benefit of divide and isolate is that the checker The concept of having completely pluggable ondisk file
requires less memory, since it can check most of the file formats, as in Reiser4[1], is a file system repair night-
system in pieces, sequentially. If enough memory, CPUs, mare. Ondisk structures that have long tentacles of
and disks are available, checking can easily be multi causeandeffect reaching into and changing other on
threaded. disk structures also increase fragility.
3.2 Make file system traversal fast The problem with very simple, straightforward disk
Even if we only need to read part of the metadata, it formats is that they often do things like lookups in large
is still useful to make reading that metadata fast. Prob- directories very slowly. One way to have your cake and
ably the most useful idea is to add another bitmap: The eat it too is to keep the ondisk format extremely sim-
block metadata/data bitmap. This is a bitmap that marks ple, but build complex lookup structures in memory as
whether a particular block contains metadata or data, re- data is read in off the disk. The first operation may be
gardless of the type of metadata. This allows us to read slow, but subsequent operations will go as fast as if the
all the metadata in one sweep of the disk arm without lookup structure was stored on disk. Another option is
waiting for dependent reads. In the best case, this could to cache complex structures on disk, but have redundant
convert our 2013 fsck time to 42s × (16/5) 134s -- simpler structures which we can fall back on if the com-
only a third as long as current fsck. plex structure is damaged or nonexistent.
The metadata bitmap also improves reliability because 3.5 Make restore fast
it gives an extra piece of information about whether a When all else fails--fsck has a fatal bug, the disk is
block is data or metadata; currently fsck has to guess smoking in the chassis, your RAID system ate itself--we
in some cases. This allows us to rely more heavily on need to restore from backup. An interesting thing about
magic numbers in metadata during repair, as magic num- restoring from backup is that the application knows ex-
bers (used as a sanity check to identify different meta- actly how many files it will create, what size they will
data types) are more useful if we have some kind of out be, and what the directory structure will look like. If we
ofband indication of whether a block is data or meta- have an interface to communicate this information in ad-
data. Consider the reiserfs bug in which any data block vance of the actual creat(), write(), mkdir() calls, the file
containing what looked like a consistent inode, complete system might be able to complete the operations faster.
with magic number, was considered a lost inode and put It will also benefit applications like tar that also have de-
in lost+found. The scheme failed spectacularly when a tailed information about what file system operations it
disk image of a reiserfs file system was stored in a file will be doing.
in a reiserfs file system, since the inodes stored in the
contents of the file looked exactly like lost inodes. 4 Chunkfs
Greater fanout and back pointers (at least for com- We are working on a repairdriven file system de-
mon, single link cases) may speed up file system traver- sign called chunkfs[7]. The file system is divided into
sal. Another idea is to have hints about which blocks chunks which can be checked and repaired mostly in-
to read next--they may not need to be exactly accurate, dependent of other chunks. The metadata inside each
as long as on average they increase the speed at which chunk looks like fairly standard file system metadata
metadata is read into memory. layout--inodes, indirect blocks, etc.. However, any ref-
3.3 Space is cheap erences that cross chunk boundaries must have both for-
Modern disks have about 50% free space on ward and back pointers, so that external chunk references
average[5], and capacity will probably continue to out- can be quickly checked. The only references that cross
4
chunk boundaries are when files or directories outgrow a [2] ZFS at OpenSolaris.org. http://www.
chunk; when this happens, we create a "continuation in- opensolaris.org/os/community/zfs/.
ode" in another chunk and set up forward and back point-
ers between the original inode and its continuation. Only [3] Eric J. Bina and Perry A. Emrath. A faster fsck for
chunks that are damaged or dirty need to be checked, BSD UNIX. In USENIX Winter Technical Confer-
in addition to a crosschunk reference checking pass. ence, pages 173185, 1989.
Chunks can also be checked as needed and on demand, [4] Kurt Chan. A comparison of disk drives for en-
so that the entire file system need not be down while only terprise computing. ;login: The Usenix Magazine,
parts of it are being checked. 31(3), 2006.
5 Related work [5] John R. Douceur and William J. Bolosky. A large-
Some very basic optimizations for fsck, long imple- scale study of file-system contents. In Proceedings
mented in most checkers, are found in [3]. Many file sys- of the International Conference on Measurement
tems use journaling, copyonwrite, or soft updates to and Modeling of Computer Systems, 1999.
recover from partially finished file system updates, as of-
[6] Jim Gray and Catharine van Ingen. Empirical mea-
ten occurs with system panic or power loss. These tech-
surements of disk failure rates and error rates. Tech-
niques work only for completing inprogress updates;
nical Report MSR-TR-2005-166, Microsoft Re-
they have no effect on fsck time when repairing file sys-
search, 2005.
tem corruption from other sources. Peacock, et al. mod-
ified Solaris UFS to speed up fsck through a variety of [7] Val Henson, Arjan van de Ven, Amit Gud, and Zach
techniques[10]. Most techniques were intended to solve Brown. Chunkfs: Using divide-and-conquer to im-
the fast crash recovery problem; some, like recording the prove file system reliability and repair. In Hot Top-
last used inode, are useful for all sources of file system ics in System Dependability, 2006.
corruption. Their work was constrained by the need for
partial backwards compatibility with existing tools. [8] T. J. Kowalski and Marshall K. McKusick. Fsck
- the UNIX file system check program. Technical
6 Conclusion report, Bell Laboratories, 1978.
Traditional file system check and repair will become
[9] Mark Kryder. Future storage tech-
both more frequent and more timeconsuming due to
nologies: A look beyond the horizon.
disk hardware trends--the fsck time crunch. To combat
http://www.snwusa.com/documents/
this, we need repairdriven file system design to become
presentations-s06/MarkKryder.pdf.
the norm. Examples of repairdriven design are making
file system traversal fast using techniques such as meta- [10] J. Kent Peacock, Ashvin Kamaraju, and Sanjay
data bitmaps, dividing and isolating metadata so it can be Agrawal. Fast consistency checking for the Solaris
checked incrementally and so that corruption propagates file system. In USENIX Annual Technical Confer-
as little as possible, keeping the ondisk format as simple ence, 1998.
as possible, and taking advantage of ample disk space to
add checksums and redundancy to ease repair. [11] Vijayan Prabhakaran, Lakshmi N. Bairavasun-
daram, Nitin Agrawal, Haryadi S. Gunawi, An-
7 Acknowledgments drea C. Arpaci-Dusseau, and Remzi H. Arpaci-
Thanks to Arjan van de Ven, Zach Brown, Andreas Dusseau. IRON file systems. In SOSP '05, pages
Dilger, Kristal Pollack, Theodore Y. T'so, Brian Warner, 206220, 2005.
and Ric Wheeler for discussions, ideas, insightful [12] Thomas Schwarz, Mary Baker, Steven Bassi, Bruce
comments, and suggestions. Baumgart, Wayne Flagg, Catherine van Ingen,
Kobus Joste, Mark Manasse, and Mehul Shah.
Note to reviewers: Repair-driven file system design http://www.hpl.hp.com/personal/
has been briefly discussed in two previous publications: Mary_Baker/publications/wip.pdf,
"Chunkfs: Using divideandconquer to improve file 2006.
system reliability and repair," appeared in HotDep '06,
and in a review of the Linux File Systems Workshop '06 [13] Adam Sweeney, Doug Doucette, Wei Hu, Curtis
in Linux Weekly News. Anderson, Michael Nishimoto, and Geoff Peck.
Scalability in the XFS file system. In Proceedings
References of the 1996 USENIX Technical Conference, 1996.
[1] Namesys. http://www.namesys.com/.
5