Information about http://www.valhenson.org/review/repair.pdf

Repair­Driven File System Design …

Tags: bandwidth, corruption, disk hardware, file system check, hardware trends, ing, intel, intel corporation, linux, linux intel, mance, ms 8, open source technology, rare event, reliability, system crash, technology center, time ms, val henson, valerie,
Pages: 5
Language: english
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
                                  Repair­Driven File System Design

                                                 Valerie Henson
                                          Open Source Technology Center
                                                Intel Corporation
                                          val henson@linux.intel.com



                                                                                            2006    2009     2013     Change
                                                                     Capacity (GB)           500    2000     8000        16x
   File system design has traditionally focused on the
on­line performance of the file system while the perfor-             Bandwidth (Mb/s)       1000    2000     5000         5x
mance and reliability of file system check, repair, and              Seek time (ms)            8     7.2      6.5       1.2x
restore has been for the most part neglected. Journal-
ing and other techniques handle the most common case                       Table 1: Projected disk hardware trends[9].
of file system corruption--half­finished updates inter-
rupted by system crash or other event--but repairing
any other form of metadata corruption requires reading
                                                                       For many years, we could brush off the importance
and checking the entire file system's metadata. Unfortu-
                                                                    of making file system repair fast and reliable with the
nately, disk capacity is growing faster than disk band-
                                                                    following chain of reasoning: File system corruption is
width, seek times are hardly budging, and the overall
                                                                    a rare event, and when it does occur, repairing it takes
chance of an I/O error occurring somewhere on the disk
                                                                    only a few minutes or maybe a few hours of downtime,
is increasing. The result: the traditional file system check
                                                                    and if repair is too difficult or time­consuming, "That's
and repair cycle will be not only longer, but also more
                                                                    what backups are for." Unfortunately, if this reasoning
frequent, with disastrous consequences for data availabil-
                                                                    was ever valid, it is being eroded by some inconvenient
ity. Data reliability will also decline with frequency of
                                                                    truths about disk hardware trends.
corruption. With this reality in mind, we propose repair­
driven file system design, the practice of designing file              As Table 1 shows, Seagate projects that during the
systems with the explicit goal of fast and reliable file sys-       same time that disk capacity increases by 16 times, disk
tem check, repair, and restore.                                     bandwidth will increase by only 5 times, and seek times
                                                                    will remain nearly unchanged. This is good news for
1 Introduction                                                      many common workloads--we can store more data and
   File system repair is usually an afterthought for file           read and write more of it at once. But it is terrible news
system designers. In our experience with ZFS[2], file               for any workload that is (a) proportional to the size of the
system programmers tend not to be excited about writ-               disk, (b) throughput­intensive, and (c) seek­intensive.
ing fsck; XFS was even initially shipped without any file           One workload that fits this profile is file system check
system repair tool at all[13]. One reason is that repair is         and repair. We calculate that file system check and re-
difficult and annoying to reason about. It's neither pos-           pair time will increase by approximately a factor of 10
sible nor worthwhile to fix all error modes so we must              between 2006 and 2013 with today's file system formats
focus our efforts on the ones that commonly occur, yet              (see § 2.2).
we don't know what they are until we encounter them                    At the same time that capacity is increasing, the per­
in the wild. In practice, most file system repair code is           bit error rate is improving. However, for an overall im-
written in response to an observed corruption mode. File            provement in the error rate for operations that read data
system repair is annoying because, by definition, some-             proportional to the file system size (such as fsck), the
thing went wrong and we must think outside the state                per­bit error rate must improve as fast as capacity grows,
space of our beautifully designed system. In the end, de-           which seems unlikely. We conclude that the frequency
signing a file system is more fun than designing a file             of file system corruption and necessary check and repair
system checker.                                                     or restore is more likely to increase than decrease. This

                                                                1
combination of increasing fsck time and increasing fsck          the most time consuming, as we must issue a set of small
frequency is what we call the fsck time crunch.                  scattered reads, wait for them to complete, read the ad-
   Our response to this approaching train wreck is to            dress of the next block, then issue another set of reads.
propose repair­driven file system design: the practice           Some optimization of this stage has been done; mainly
of designing file systems with repair and recovery as a          grouping sets of independent I/Os together, reordering
first­class goal. The on­disk format should be designed          them, and issuing them in order to the disk to minimize
for performance and reliability for both normal on­line          seeks. However, good disk performance requires issuing
workloads and for repair workloads. While we cannot              many closely grouped I/Os at once; as long as we don't
repair every possible corruption, we can design on­disk          know which blocks we need to read it is difficult to opti-
formats that are robust and fast to repair.                      mize the manner in which we retrieve them.
                                                                    Finally, we need CPU time and sufficient memory to
2 Motivation                                                     actually compute the consistency checks on the data we
   To make a convincing argument for repair­driven file          have read off disk. This takes a relatively small amount
system design, we must first investigate fsck in depth:          of time compared to the time spent doing what are ef-
how it works, what the major performance factors are,            fectively random 4 KB or similar­sized reads, although
how performance can be improved. (For more depth,                more complex file systems may burn more CPU time in
see [8].) The short version is that most checkers op-            computing checksums or similar tasks.
erate in several passes, each of which checks a dif-                In summary, the ways to reduce fsck time, in rough
ferent kind of consistency and/or builds up data struc-          order of effectiveness, are to reduce seeks, reduce de-
tures for later passes. The things fsck checks may in-           pendent reads, reduce the amount of metadata that needs
clude superblock sanity, block group summaries, intra­           to be read (either by reducing the overall quantity or the
inode consistency, inode link counts, block/inode allo-          amount that needs to be read), and to reduce the com-
cation bitmaps, directory connectivity, orphaned inodes,         plexity of the consistency checks themselves.
directory consistency, duplicate block allocations, indi-        2.2   Projecting file system check time
rect blocks/extents/trees, and checksums. To complete                To get a rough estimate of how 16x capacity increase,
all of these checks, fsck must read every piece of meta-         5x bandwidth increase, and almost no change in seek
data in the file system--every inode, every bitmap, every        time will affect fsck time in 2013, we ran fsck on the
directory entry, every indirect block, every block group         /home partition (ext3 formatted) of a development laptop
summary.                                                         (see Table 2 for details on the disk used). We measured
2.1   Fsck performance                                           the elapsed time and CPU time using the time com-
   The fundamental limiting factors in the performance           mand, and the number of individual I/O operations and
of fsck are amount of data read, number of separate I/Os,        the total data read using iostat. Using this data and
how scattered the data is on disk, number of dependent           projected changes in disk hardware, we made a rough es-
reads, and CPU time required to check and repair the             timate of the time needed to complete a file system check
data read. The amount of memory available is a factor as         on a moderately sized laptop file system in 2013.
well, though most fsck programs operate on an all­or­                First, we estimated how the elapsed time is divided
nothing basis: Either there is enough memory to fit all          between using the CPU, reading data off the disk, and
the needed metadata for a particular checking pass or the        head travel between blocks (seek time plus rotational la-
checker simply aborts.                                           tency). The total elapsed time to check a 37 GB file sys-
   The time required to read the file system metadata            tem with 21 GB of data is 450 seconds, 15 seconds of
is partially constrained by the bandwidth of the disk.           which are spent in CPU time. That leaves 435 seconds
Depending on the file system, some kinds of file sys-            in I/O. We measured 1.3GB of data read. We estimated
tem data, such as blocks of statically allocated inodes          the amount of time to read 1.3 GB of data off the disk
or block group summaries, are located in contiguous              by using dd to transfer that amount of data from the par-
chunks at known locations. Reading this data off disk            tition into /dev/null, which took 42 seconds at optimal
is relatively swift.                                             read size. The remaining time, 393 seconds, we assume
   Other kinds of file system data are dynamically allo-         is spent seeking between tracks and waiting for the data
cated, such as directory entries, indirect blocks, and ex-       to rotate under the head. We measured 340,000 sepa-
tents, and hence are scattered all over the disk. Many           rate I/O requests. Given that the time spent seeking is
modern file systems allocate nearly all their metadata dy-       393 seconds, the average seek time for this disk is 12 ms,
namically. The location of much of this kind of metadata         and the average rotational latency is 7.4 ms, we estimate
is not known until the block of metadata pointing to it is       that fsck required about 393/(0.010 + 0.0074)  20000
read off the disk, introducing many levels of dependent          seeks (about one seek per every 15-20 I/Os).
reads. This portion of the file system check is usually              To estimate how long fsck will take in 2013, we use

                                                             2
           Total size                   80 GB                      provements in the specified UER are predicted[4], but
           Partition size               40 GB                      their relationship to shipped product is uncertain. In any
           Used                         23 GB                      case, we cannot depend on an improvement in the er-
           Files                       850,000                     ror rate when the consequences of failure are hours of
           Avg. seek time                12 ms                     downtime on even modestly­sized file systems, and days
           Avg. rotational latency      7.4 ms                     on any significantly sized file system. Along with hard-
                                                                   ware errors, increases in the complexity of file systems
                                                                   bring their own inevitable increase in file system corrup-
Table 2: Experiment disk characteristics (GB = 109                 tion due to bugs.
bytes                                                                 Complete disk failure has been measured at a rate of
                                                                   1-6% per year, depending on disk batch[12, 6], making
  Year           Elapsed time          I/Os     Data read          speedy file system restore from backup a priority. The
  2006                 7.5 m        340,000       1.3 GB           author's web site was recently down for a week when the
  2013 (est.)           80 m      5,400,000       21 GB            RAID system on a file server with several thousand user
                                                                   accounts failed catastrophically. The file­based restore
                                                                   took over a week to restore only 320 GB of data.
                Table 3: Projected fsck time.
                                                                   2.4   The fsck time crunch
                                                                      The combination of a 10­fold increase in fsck time
Seagate's projections for changes in capacity, bandwidth,          with, if anything, an increase in file system corruption
average seek time, and average rotational latency. We              requiring fsck creates the fsck time crunch. File systems
generously assume that the CPU time is negligible. We              will spend an ever increasing proportion of time unavail-
assume that the file system is 16 times bigger, and check-         able simply due to being checked and repaired. Repair­
ing requires reading 16 times the data and 16 times the            driven file system design is one solution to the fsck time
I/Os. We'll assume that bandwidth is 5 times bigger,               crunch.
and that the seek time is 10 ms, 1.2 times faster than
12 ms. The trend in rotational latency is about a 1.5x             3 Repair­driven file system design
improvement[9], so we will assume a 4.7 ms rotational                 How can we take file system check, repair, and restore
latency.                                                           into account when designing file systems? We discuss
   The time required to read the data off the disk is 42s ×        several general classes of useful techniques: (1) Divide
(16/5)  130s. The time required to execute the seeks               and isolate metadata, (2) make file system traversal fast,
is 16 × 20000 × (0.010s + 0.0047s)  4700s. The total               (3) use more (relatively cheap) disk space, (4) keep it
time is 4700s + 130s  4800s, or about 80 minutes,                  simple, (5) make restore fast.
compared to about 7.5 minutes today, an increase of a              3.1   Divide and isolate metadata
factor of about 10.                                                   One of the primary reasons that file system check and
   The above experiment is also best case in that ext3             repair are so painful is that they are O(file system size).
is an older file system with lots of statically allocated          All on­disk file system layouts we are aware of funda-
metadata and no checksums or complex structures. Most              mentally require examining every piece of metadata in
modern file systems have more dynamically allocated                the system to determine, for example, whether a par-
metadata and take more CPU time to verify their more               ticular data block is free or allocated. If, on the other
complex on­disk structures. Running fsck on a full 300             hand, we can divide up the metadata and create isolation
GB reiserfs file system with 60,000,000 files takes about          boundaries that limit the effects of metadata corruption
2.5 hours! In any case, we do not expect that a 10­fold            inside the boundary on things outside the boundary, we
increase in the time it takes to check and repair file sys-        will be able to make file system check O(size of isola-
tems will be acceptable to file system users.                      tion group)--a constant factor. This can be thought of as
2.3   Increasing file system corruption                            block groups on steroids. Any references that cross iso-
   Disk manufacturers currently specify an uncorrectable           lation group boundaries must have a simple, direct, con-
read error rate (UER) of 1 in 1013 to 1016 bits read[6], de-       stant time method to determine the state of the relevant
pending on quality, or one per 1­1,000 TB of data read--           out­of­group metadata, such as back pointers. One way
far from a shockingly large amount of data to read by              of doing this is to allow only inodes inside an isolation
today's standards. Empirical measurements of disk re-              group to point to data blocks inside of it.
liability vary from 1-2 orders of magnitude below spec-               A limitation is that we need to be able to find the
ification for a small sample size (n  20)[6] to a 1%               exact location of file system corruption. Useful meth-
per year rate of disk replacement due to "substantial sec-         ods include checksums, disk error reporting, dirty bits
tor failure" for a sample size in the thousands[12]. Im-           or marks on isolation groups showing that an update is

                                                               3
in progress, and error­checking code such as asserts and            strip actual data stored. We should use more of that space
sanity checks in the file system code (such as in [11] and          to make file system repair fast and reliable. In general,
[2]). In the case of completely silent corruption or cor-           optimizations designed to save on­disk space are miss-
ruption spread over the entire file system, this technique          ing the forest for the trees; the real point of reducing the
will not be of much help. Isolation group boundaries                size of on­disk data is to save bandwidth, if anything.
need to be aligned with underlying device fault bound-              Checksums and magic numbers are the most obvious ex-
aries, such as RAID stripes or disks.                               amples of useful additional on­disk structures. Multiple
   A corollary to divide and isolate is structuring file sys-       copies of metadata are another. Red zones to detect out­
tem metadata so that changes and repairs do not propa-              of­bounds writes and buffers between isolation groups
gate excessively to other parts of metadata. Many com-              may also be useful.
plex tree structures require rebalancing and removal or             3.4   Keep it simple
addition of one node may affect many other nodes. The                  The more knobs, gew­gaws, and ornamental curlicues
more work we have to do during repair, the longer it takes          in the on­disk format, the more difficult it is to repair
and the more fragile the file system.                               errors correctly and the more failure modes multiply.
   Another benefit of divide and isolate is that the checker        The concept of having completely pluggable on­disk file
requires less memory, since it can check most of the file           formats, as in Reiser4[1], is a file system repair night-
system in pieces, sequentially. If enough memory, CPUs,             mare. On­disk structures that have long tentacles of
and disks are available, checking can easily be multi­              cause­and­effect reaching into and changing other on­
threaded.                                                           disk structures also increase fragility.
3.2   Make file system traversal fast                                  The problem with very simple, straight­forward disk
   Even if we only need to read part of the metadata, it            formats is that they often do things like lookups in large
is still useful to make reading that metadata fast. Prob-           directories very slowly. One way to have your cake and
ably the most useful idea is to add another bitmap: The             eat it too is to keep the on­disk format extremely sim-
block metadata/data bitmap. This is a bitmap that marks             ple, but build complex lookup structures in memory as
whether a particular block contains metadata or data, re-           data is read in off the disk. The first operation may be
gardless of the type of metadata. This allows us to read            slow, but subsequent operations will go as fast as if the
all the metadata in one sweep of the disk arm without               lookup structure was stored on disk. Another option is
waiting for dependent reads. In the best case, this could           to cache complex structures on disk, but have redundant
convert our 2013 fsck time to 42s × (16/5)  134s --                 simpler structures which we can fall back on if the com-
only a third as long as current fsck.                               plex structure is damaged or non­existent.
   The metadata bitmap also improves reliability because            3.5   Make restore fast
it gives an extra piece of information about whether a                 When all else fails--fsck has a fatal bug, the disk is
block is data or metadata; currently fsck has to guess              smoking in the chassis, your RAID system ate itself--we
in some cases. This allows us to rely more heavily on               need to restore from backup. An interesting thing about
magic numbers in metadata during repair, as magic num-              restoring from backup is that the application knows ex-
bers (used as a sanity check to identify different meta-            actly how many files it will create, what size they will
data types) are more useful if we have some kind of out­            be, and what the directory structure will look like. If we
of­band indication of whether a block is data or meta-              have an interface to communicate this information in ad-
data. Consider the reiserfs bug in which any data block             vance of the actual creat(), write(), mkdir() calls, the file
containing what looked like a consistent inode, complete            system might be able to complete the operations faster.
with magic number, was considered a lost inode and put              It will also benefit applications like tar that also have de-
in lost+found. The scheme failed spectacularly when a               tailed information about what file system operations it
disk image of a reiserfs file system was stored in a file           will be doing.
in a reiserfs file system, since the inodes stored in the
contents of the file looked exactly like lost inodes.               4 Chunkfs
   Greater fan­out and back pointers (at least for com-                We are working on a repair­driven file system de-
mon, single link cases) may speed up file system traver-            sign called chunkfs[7]. The file system is divided into
sal. Another idea is to have hints about which blocks               chunks which can be checked and repaired mostly in-
to read next--they may not need to be exactly accurate,             dependent of other chunks. The metadata inside each
as long as on average they increase the speed at which              chunk looks like fairly standard file system metadata
metadata is read into memory.                                       layout--inodes, indirect blocks, etc.. However, any ref-
3.3   Space is cheap                                                erences that cross chunk boundaries must have both for-
  Modern disks have about 50% free space on                         ward and back pointers, so that external chunk references
average[5], and capacity will probably continue to out-             can be quickly checked. The only references that cross

                                                                4
chunk boundaries are when files or directories outgrow a           [2] ZFS at OpenSolaris.org. http://www.
chunk; when this happens, we create a "continuation in-                opensolaris.org/os/community/zfs/.
ode" in another chunk and set up forward and back point-
ers between the original inode and its continuation. Only          [3] Eric J. Bina and Perry A. Emrath. A faster fsck for
chunks that are damaged or dirty need to be checked,                   BSD UNIX. In USENIX Winter Technical Confer-
in addition to a cross­chunk reference checking pass.                  ence, pages 173­185, 1989.
Chunks can also be checked as needed and on demand,                [4] Kurt Chan. A comparison of disk drives for en-
so that the entire file system need not be down while only             terprise computing. ;login: The Usenix Magazine,
parts of it are being checked.                                         31(3), 2006.
5 Related work                                                     [5] John R. Douceur and William J. Bolosky. A large-
   Some very basic optimizations for fsck, long imple-                 scale study of file-system contents. In Proceedings
mented in most checkers, are found in [3]. Many file sys-              of the International Conference on Measurement
tems use journaling, copy­on­write, or soft updates to                 and Modeling of Computer Systems, 1999.
recover from partially finished file system updates, as of-
                                                                   [6] Jim Gray and Catharine van Ingen. Empirical mea-
ten occurs with system panic or power loss. These tech-
                                                                       surements of disk failure rates and error rates. Tech-
niques work only for completing in­progress updates;
                                                                       nical Report MSR-TR-2005-166, Microsoft Re-
they have no effect on fsck time when repairing file sys-
                                                                       search, 2005.
tem corruption from other sources. Peacock, et al. mod-
ified Solaris UFS to speed up fsck through a variety of            [7] Val Henson, Arjan van de Ven, Amit Gud, and Zach
techniques[10]. Most techniques were intended to solve                 Brown. Chunkfs: Using divide-and-conquer to im-
the fast crash recovery problem; some, like recording the              prove file system reliability and repair. In Hot Top-
last used inode, are useful for all sources of file system             ics in System Dependability, 2006.
corruption. Their work was constrained by the need for
partial backwards compatibility with existing tools.               [8] T. J. Kowalski and Marshall K. McKusick. Fsck
                                                                       - the UNIX file system check program. Technical
6 Conclusion                                                           report, Bell Laboratories, 1978.
   Traditional file system check and repair will become
                                                                   [9] Mark Kryder.     Future storage tech-
both more frequent and more time­consuming due to
                                                                       nologies: A look beyond the horizon.
disk hardware trends--the fsck time crunch. To combat
                                                                       http://www.snwusa.com/documents/
this, we need repair­driven file system design to become
                                                                       presentations-s06/MarkKryder.pdf.
the norm. Examples of repair­driven design are making
file system traversal fast using techniques such as meta-         [10] J. Kent Peacock, Ashvin Kamaraju, and Sanjay
data bitmaps, dividing and isolating metadata so it can be             Agrawal. Fast consistency checking for the Solaris
checked incrementally and so that corruption propagates                file system. In USENIX Annual Technical Confer-
as little as possible, keeping the on­disk format as simple            ence, 1998.
as possible, and taking advantage of ample disk space to
add checksums and redundancy to ease repair.                      [11] Vijayan Prabhakaran, Lakshmi N. Bairavasun-
                                                                       daram, Nitin Agrawal, Haryadi S. Gunawi, An-
7 Acknowledgments                                                      drea C. Arpaci-Dusseau, and Remzi H. Arpaci-
  Thanks to Arjan van de Ven, Zach Brown, Andreas                      Dusseau. IRON file systems. In SOSP '05, pages
Dilger, Kristal Pollack, Theodore Y. T'so, Brian Warner,               206­220, 2005.
and Ric Wheeler for discussions, ideas, insightful                [12] Thomas Schwarz, Mary Baker, Steven Bassi, Bruce
comments, and suggestions.                                             Baumgart, Wayne Flagg, Catherine van Ingen,
                                                                       Kobus Joste, Mark Manasse, and Mehul Shah.
   Note to reviewers: Repair-driven file system design                 http://www.hpl.hp.com/personal/
has been briefly discussed in two previous publications:               Mary_Baker/publications/wip.pdf,
"Chunkfs: Using divide­and­conquer to improve file                     2006.
system reliability and repair," appeared in HotDep '06,
and in a review of the Linux File Systems Workshop '06            [13] Adam Sweeney, Doug Doucette, Wei Hu, Curtis
in Linux Weekly News.                                                  Anderson, Michael Nishimoto, and Geoff Peck.
                                                                       Scalability in the XFS file system. In Proceedings
References                                                             of the 1996 USENIX Technical Conference, 1996.
 [1] Namesys. http://www.namesys.com/.

                                                              5