Tags: 5am, access times, active file, c diff, computing environment, disk 1, disk blocks, dribble, file server, infinite capacity, juke box, magnetic disk, magnetic disks, optical disk, pdos, plan9, quinlan, random access, rsc, russ cox,
Fossil
an archival file server
Russ Cox
rsc@mit.edu
PDOS Group Meeting
January 7, 2003
http://pdos/~rsc/talks
History .....................................................................................................................
Cached WORM file server (Quinlan and Thompson):
active file system on magnetic disk acts as worm cache
mark all disk blocks copy-on-write at 5am to take snapshot
slowly dribble snapshot to worm
maintain forward linked list of snapshots
present snapshot tree to users
became integral part of our computing environment
% ls -lp /n/dump/*/*/386/bin/8c | uniq
--rwxrwxr-x presotto sys 243549 Jan 21 1997 8c
...
--rwxrwxr-x presotto sys 298289 Dec 14 18:55 8c
%
% yesterday -D authsrv.c
diff -n /n/dump/2003/0106/sys/src/cmd/auth/authsrv.c authsrv.c
/n/dump/2003/0106/sys/src/cmd/auth/authsrv.c:100 c authsrv.c:100
< break;
---
> exits(0);
%
Quinlan, ``A Cached WORM File System'', SP&E December 1991.
http://plan9.bell-labs.com/~seanq/cw.pdf
History, ii ................................................................................................................
WORM was right choice in 1990
one jukebox is infinite: capacity grows faster than our storage needs
no head crashes
plausible random access times
magnetic disks too small, tape too slow
bootes (1990): 100MB mem, 1GB disk, 300GB juke box
emelie (1997): 350MB mem, 54GB disk, 1.2TB juke box
What about 1999?
disks cheap and big, getting cheaper and bigger
disks cheaper and bigger than optical disk
disks much faster than optical disk
disks have head crashes
build a better base out of magnetic disk?
Venti .........................................................................................................................
Archival block store (Quinlan and Dorward):
SHA1-addressed
blocks never reclaimed
omit duplicate blocks
compress
Implementation:
log of all blocks ever written
log broken into fixed-size (say, 500MB) chunks called arenas
arenas copied to other media (tape, DVD, etc.) as they fill
index on the side makes lookups efficient
Initial system:
iolaire (1999): 2GB mem, 36GB index, 480GB hw raid arenas
Quinlan and Dorward, ``Venti: a new approach to archival storage'', FAST 2002.
http://plan9.bell-labs.com/sys/doc/venti.pdf
Venti: storing data streams ............................................................................
Venti stores blocks. To store large data, use hash tree:
BtData+3 ...
BtData+2 ... ...
BtData+1 ... ... ... ...
BtData ... ... ... ...
Venti: storing complex data structures ....................................................
To store a list of streams, use a stream of VtEntry blocks.
same as data but has block types BtDir, BtDir+1, ...
Can encode tree-like structures
each stream is all data (a Venti file) or all entry blocks (a Venti
directory)
VtRoot Key
Venti file
Venti entry (VtEntry)
Venti directory
Venti pointer (SHA1 hash)
Can traverse hierarchy ignoring higher-level structure
general purpose copy
other utilities
Venti: storing a file system .............................................................................
Vac: Venti file system archive format
vac directory can be thought of as stream of inodes plus stream of
directory entries
VtRoot Key
fs root block Venti file
Venti entry (Entry)
root directory info block
Venti directory
root metadata Venti pointer (score)
............
. .
............
. . Vac file
Vac entry (DirEntry)
Vac directory
Vac pointer (integer index)
..........................
. .
.
. .
.
..........................
. .
. .
Venti: storing a file system .............................................................................
Vac compresses everything to 45 bytes:
% cd /sys/src/cmd/fossil
% vac -f fossil.vac *
% ls -l fossil.vac
--rw-rw-r-- M 8 rsc sys 45 Jan 6 14:51 fossil.vac
% cat fossil.vac
vac:1bc12e0a81baf8c1ab62aaba382f6c1a0b11633a
% ls -l /n/vac
--rwxrwxr-x rsc sys 61096 Dec 21 15:35 /n/vac/8.9ping
--rwxrwxr-x rsc sys 219307 Jan 5 13:11 /n/vac/8.flchk
--rwxrwxr-x rsc sys 217712 Jan 5 13:11 /n/vac/8.flfmt
...
%
Fossil .........................................................................................................................
Archival Venti-based file server (Quinlan, McKie, Cox)
Conceptually, rewrite of cached worm file server
lots of software engineering advances (not discussed here)
file system layout identical to vac
local disk block pointers: 32-bit disk block zero-padded to 160 bits
replace worm juke box with Venti store
replace disk-based cache with disk-based write buffer
write buffer can store file system if not using Venti
Snapshots................................................................................................................
Epoch-based snapshot procedure:
fs.epoch is logical snapshot clock (sequence number)
every block in write buffer records allocation epoch b.epoch
blocks with b.epoch < fs.epoch are copy on write.
To take snapshot: increment epoch, rewrite root block
My laptop takes snapshots on the hour:
% ls -lp /n/snap/2003/0106/0600/sys/src/cmd/fossil/fs.c
--rw-rw-r-- rsc sys 16943 Jan 5 13:03 fs.c
% ls -lp /n/snap/*/*/*/sys/src/cmd/fossil/fs.c | uniq
--rw-rw-r-- rsc sys 14895 Nov 28 02:05 fs.c
...
--rw-rw-r-- rsc sys 16918 Jan 5 12:48 fs.c
--rw-rw-r-- rsc sys 16943 Jan 5 13:03 fs.c
%
No Venti as described so far.
Archival...................................................................................................................
An archival snapshot goes into the archival tree.
My laptop takes archival snapshots daily, at 5AM:
% ls -lp /n/dump/2003/0106/sys/src/cmd/fossil/fs.c
--rw-rw-r-- M 1652 rsc sys 16943 Jan 5 13:03 fs.c
% ls -lp /n/dump/*/*/sys/src/cmd/fossil/fs.c | uniq
--rw-rw-r-- rsc sys 14230 Nov 9 02:51 fs.c
...
--rw-rw-r-- rsc sys 16943 Jan 5 13:03 fs.c
%
Background process archives tree to Venti
only knows about Venti hierarchy
rewrites pointers to point at Venti blocks
prints Venti hashes to console
% grep vac: console.log
...
Sat Jan 4 05:01:46 archive vac:c164dba46cbe319bf5a3a6b93a6aec0aa09198f0
Sun Jan 5 05:01:14 archive vac:96f48562b826b5b95fef854e488fb06e66ad9eca
Mon Jan 6 05:02:12 archive vac:722d61f18fff491d00103be309af66ebb7cba9f2
%
Block reclamation ...............................................................................................
Non-archival snapshots will eventually fill the disk
Want to retire old snapshots to free up disk space
Epoch-based reclamation:
fs.epochLow is epoch of earliest available snapshot
after copy-on-write, block is no longer in active file system
b.epochClose is epoch when b was copied-on-write
block only needed by snapshots in [b.epoch, b.epochClose).
if b.epochClose fs.epochLow then b can be reused
Fossil tricks ............................................................................................................
Fs won't boot, need to look at sources (on fs):
vacfs