Tags: advanced computing lab, advanced computing systems, alamos national labs, alamos new mexico, atlanta georgia usa, boots, cluster nodes, copyright notice, email office, job, kernel, linux, linux showcase, los alamos national labs, los alamos new mexico, noncommercial reproduction, proceedings, research purposes, ron minnich, usenix association,
USENIX Association
Proceedings of the
4th Annual Linux Showcase & Conference,
Atlanta
Atlanta, Georgia, USA
October 10 14, 2000
THE ADVANCED COMPUTING SYSTEMS ASSOCIATION
© 2000 by The USENIX Association All Rights Reserved For more information about the USENIX Association:
Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: office@usenix.org WWW: http://www.usenix.org
Rights to individual papers remain with the author or the author's employer.
Permission is granted for noncommercial reproduction of the work for educational or research purposes.
This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
LOBOS: (Linux OS Boots OS) Booting a kernel in 32-bit mode
Ron Minnich
August 14, 2000
Advanced Computing Lab, Los Alamos National Labs, indicate which kernel they needed to run, and part of the
Los Alamos, New Mexico work of starting a job on a set of cluster nodes might be
booting those nodes with the proper kernel.
To support our needs we have decided we will use a
netboot-style initialization for both normal operation as
Abstract well as loading and reloading the cluster nodes. Each time
the node is booted we can control which kernel to run,
LOBOS (Linux Os Boots OS) is a system call that allows how to get the kernel (over the network or on the disk),
a running Linux kernel to boot a new kernel, without leav- and what root partition to use, either local or via NFS,
ing 32-bit protected mode and, in particular, without using even though in many cases the operating system is booted
the BIOS in any way. This capability in turn allows Linux from the local disk, and the root flie system is chosen from
to be used as a network bootstrap program and even as a one partition of the local disk.
BIOS, both of which we are working on now. In this pa- In this paper we will describe our approach to netboot,
per we discuss how LOBOS works, how we use it, and which is to use Linux as the bootstrap instead of a special
how LOBOS makes Linux usable as a BIOS, replacing bootstrap program. We first provide an overview of how
the proprietary PC BIOSes we have today. 1 . LOBOS netboot has been done in the past, how it is being done
has been used by two other groups as a reference imple- now in the Windows/PC world, and the problems with the
mentation for their Linux-boots-Linux system calls. One current PC approaches. We close with a disussion of how
of these other implementations, bootimg, may become a we might extend our work and use Linux as the BIOS,
part of the 2.4 kernel. and hence save a few steps and a lot of time in the booting
process.
1 Introduction
2 Netboot overview
At the ACL we have built Linux clusters of 64 nodes, and
most recently have built a larger cluster of 128 nodes. Netbooting has been around in the workstation world for
While we currently use the 'magic floppy' approach for many years, with perhaps the most capable systems being
loading and reloading cluster nodes, we know that this ap- offered by Sun Microsystems. On a Sun system (or, nowa-
proach will not scale to even 256 nodes it takes far too days, any system that runs OpenBoot firmware such as a
much time and effort to put floppies into 256 nodes and Power Macintosh), one can simply type 'boot net' and the
make sure they boot properly. We have also found that PROM-based bootstrap code is able to:
we need to have absolute control at boot-time of what the
node does, even if we are not reloading or initializing the 1. Initialize the network interface
node. We might have half the cluster running a different
version of Linux with a different root file system at differ- 2. Send out broadcast or point-to-point IP packets to lo-
ent times. We might even let jobs in a queueing system cate a tftp server
1 It's a mere coincidence that many other things in New Mexico are
called LOBOS too. 3. Load a secondary bootstrap from the tftp server
1
The secondary bootstrap in turn is capable of mounting 2. We don't like code that only works for a particular
NFS partitions, disk partitions, and so on to locate and Ethernet card. There are a number of packages for
load the actual kernel to boot. Net booting on Suns has netboot available but their usefulness is strictly lim-
been used for almost 15 years. The protocols are open and ited to a small number of cards. We want to support
there are many open source tftp servers that can support any network card that Linux supports.
Sun clients for netboot.
In the PC world the situation is not nearly as good. 3. We don't see any point in reinventing the wheel. If
Even today, few PC BIOSes are capable of supporting a there is code available that supports lots of network
netboot option. Even if the BIOS understands netboot, the cards, file systems, disk types, and boot protocols,
user often has to procure a PROM for the network card, why start from scratch?
which of course only works on that one card, and only if
the card vendor has provided PROM software. Both the 4. We don't want to count on the features of any one
BIOS and the network card PROM are 16-bit 8086 code. motherboard. If a motherboard supports netboot,
As a result, 8086 mode operation is more important than that's no real help, since we don't expect to use that
ever. We would like to see 8086 emulation gradually grow motherboard forever.
less important and eventually disappear, but the netboot
5. We want standard protocols, such as NFS, bootp, and
standards being promulgated by Intel and Microsoft are
so on.
leading us the other way.
A further problem is the nature of the standards for net-
boot. The network card boot model has to conform to a
standard interface (NDIS2, a 16-bit Windows model) de-
3 The New Netboot
signed by Microsoft. Intel is working out the BIOS API We realized that the requirements for our netboot could
as well as the network protocols. be met in one of two ways: we could write a new netboot
As a result of these two trends, PC netboot is going program from scratch, or we could build a netboot using
to be 16-bit code cleaving to a network card APIdefined a minimal Linux kernel. Although there is an apparent
by Windows, using an Intel-defined BIOS API and Intel- advantage to writing our own program from scratch, ex-
defined protocols. Much of this code is proprietary, and perience shows that it is not a real advantage. The Sun
using the BIOS for netboot will require us to continue re- netboot code has to support many of the same capabili-
lying on an 8086 assembler. We end up more dependent ties as a full-blown operating system: it has to be able
on 16-bit code running on an emulation of a 20-year-old to do NFS mounts, mount disk partitions, and so on. At
processor, all of which is proprietary. This is not progress. the same time, there are many types of file systems it can
not use, such as msdos or AFS. Finally, there is no huge
savings in space: the network bootstrap is 128 Kbytes. A
2.1 Our requirements for netboot
minimal Linux kernel is 300K. Given the current cost of
Given this undesirable situtation we decided to give the storage, the difference is insignificant. We decided to go
problem another look, taking nothing for granted. Our with a minimal Linux kernel for our bootstrap.
goals are simple: we want to load something onto the
CPU that in turn can load boot parameters over the net-
work interface, find out what to do, and then load a kernel.
3.1 How the new Netboot works
Whatever it is has to be Open Source we are no longer The new netboot works as follows. The netboot code is
interested in burning proprietary binaries into PROMs. actually a tiny Linux kernel. It doesn't have much ba-
We have a few other goals: sically disk, filesystem, network and NFS code. In the
current version it does not even need to be able to run
1. We don't like assembly code. Also, we have no de- user-mode programs it never exits kernel mode. All it
sire to put a lot of effort into x86 assembly and then has to do is the following:
repeat our effort with, e.g., Alpha assembly. There-
fore, any code we write will be C or better, unless it 1. Boot (eventually from NVRAM, for now from
is impossible to escape assembly. floppy, CDROM, or hard drive)
2
2. Contact BOOTP server and get parameters for this .long SYMBOL_NAME(sys_ni_syscall) /*
machine streams1 */
.long SYMBOL_NAME(sys_ni_syscall) /*
3. Mount a remote file system via NFS or AFS; or streams2 */
mount the disk or floppy or CDROM. .long SYMBOL_NAME(sys_vfork) /* 190 */
.long SYMBOL_NAME(sys_lobos) /* 191 */
4. Overlay the currently running kernel with the new
file. Figure 1: Additional system call entry for lobos at 191 in
the 2.2.13 kernel
Items 1-3 exist in current Linux. The only thing missing
is the ability to overlay the kernel with a new kernel. In a
6. Jump to the final bootstrap code. The final bootstrap
sense we need exec for the kernel. The steps required to
code copies the new kernel into the right place and
support this operation are:
jumps to it.
1. In kernel mode, open the file and read it into memory.
This step is done in kernel mode so that we need not We call this "kernel exec" LOBOS, for Linux Os Boots
depend on starting /etc/init and having a user pro- OS. In the next section we discuss its operation in more
gram read the file in. In other words, a kernel can detail.
boot a new kernel without even starting any user-
mode programs. The file must be read into mem- 3.2 LOBOS implementation
ory but not into any area of memory occupied by the
existing kernel the existing kernel has to keep run- The LOBOS implementation consists of five major pieces,
ning, so overlaying the current kernel code as the file resulting in the addition of 300 or so lines to the kernel.
is read in is out of the question. Overlaying the run- A context diff to apply these changes to a 2.2.13 kernel is
ning kernel is the last step. available at www.acl.lanl.gov/~rminnich. The basic pieces
are as follows:
2. Move critical kernel structures into a safe place.
These structures must be moved out of the way when 1. Entry for the lobos system call in
the new kernel is copied over the running kernel. So arch/i386/kernel/entry.S
far these structures include Virtual Memory (VM)
support structures such as page tables and, on the 2. Some additions to the arch/i386/kernel/head.S to
Pentium, the Global Descriptor Table (GDT); and the make room for the 'safe areas' for the GDT, page
parameters used by the kernel when it boots to locate tables, kernel startup parameters, and other informa-
the root partition, as well as any arguments passed to tion
the kernel from the boot command line. These struc-
tures will soon also include the log buffer, so that 3. The code to read in the new file, in kernel/sys.c
kernel printk messages are not lost on reboot.
4. The code to turn off interrupts, move the processor
3. Turn off interrupts. This is the point of no return, page tables and GDT, and switch over to the new
so any error checking should have been done by this page tables and GDT, in arch/i386/kernel/process.c
point.
5. The code to copy the new kernel to the right place
4. Switch the VM hardware over to the new page tables and jump to it, in kernel/sys.c
(and GDT, on the Pentium).
We will go over each of these in turn.
5. Copy the final bootstrap code to a safe place where
it will not be overlayed by the new kernel code. The 3.2.1 System Call Entry Point
final bootstrap code is simple: it performs a copy of
the kernel to the standard location (0x100000), over- The system call entry point is simply an additional line to
laying the currently running kernel. arch/i386/kernel/entry.S, as shown in 1
3
/* here begins the support for a kernel rebooting a kernel.
Not all this stuff
* is used yet. Also, at some point, the logbuffer goes here
/* get a dentry via lookup, then use the open_private func-
so that logs are
tion to open
* preserved across reboots
* it, then use read_exec to read it.
*/
*/
ENTRY(reboot_gdt)
asmlinkage int sys_lobos(char *file)
.org 0x7000
{
ENTRY(reboot_pgdir)
char *name;
.org 0x8000
struct dentry *d;
ENTRY(reboot_code)
name = getname(file);
/* leave padding for later use, i.e. a log buffer that survives
printk("sys_bootfile: file ptr is %p\n", file);
reboot*/
if (! name)
.org 0x10000
return -EFAULT;
printk("the name is %s\n", name);
.globl SYMBOL_NAME(reboot_gdt)
d = lookup_dentry(name, 0, 1 /* read only */);
.globl SYMBOL_NAME(reboot_pgdir)
if (d)
.globl SYMBOL_NAME(reboot_code)
{
/* end reboot stuff */
void *v;
Figure 2: How the safe areas are declared in head.S int result;
int good = 1;
size_t count;
3.2.2 Safe Areas printk("good open, dentry is %p\n", d);
if (! d->d_inode)
The safe areas consist of a few additional pages at the good = 0;
beginning of the kernel virtual address space. The lo- if (! good) printk("NO INODE!\n");
bos bootstrap code knows not to touch these pages, and if (good) {
they are not used in normal kernel operation. Hence this count = d->d_inode->i_size;
memory represents a safe place to put data that will not be printk("the size is %d\n", count);
changed by either lobos or the kernel. Currently the GDT, printk("let's try to mallo that much\n");
reboot code, and kernel parameters are saved here. The v = vmalloc(count);
code for the safe areas is shown in 2. The reboot_pgdir if (v) {
area is not currently used. result = read_exec(d, 0, v, count, 1);
printk("read result is %d\n", result);
3.2.3 Reading in the file if (result == count)
run_boot_file(v, count);
The real meat of this system call is the work done to read
}
in a file and set the kernel up for reboot. This work oc-
else printk("alloc failed\n");
curs in a few places. The first is the sys_lobos system
}
call, which we show in 3. This function is called with a
}
name. It first gets a copy of the file name via getname,
else
then performs a lookup on the file.
printk("open failed, d is null\n");
The function has to get access to a file, which is done
return -EINVAL;
via the lookup_dentry call. We double check to make sure
}
there is a real inode associated with the dentry, although
this level of checking is probably unnecessary. The size of Figure 3: Top level of the sys_lobos system call
the file is contained in the inode structure. We allocate that
amount of memory and, if the allocation succeeds, call
the kernel read_exec function to actually read the file into
4
memory. Although read_exec is intended for reading in
executable files, it also serves perfectly for our purposes.
At this point much of the work is done. The final steps
are handled by the function run_boot_file, which is called
with a pointer to the kernel area and a size. This function
is shown in 4.
This function copies the final bootstrap, do_boot_file, void run_boot_file(void *kernel, size_t count)
to the safe memory location. It calls os_restart to set up {
the virtual memory structures (GDT and page tables on extern void os_restart(int);
the Pentium), and finally calls the final bootstrap code to extern char saved_bootparams[4096];
the do actual final step of copying the new kernel over the extern void *reboot_code;
current kernel. If anything fails, the current behaviour is int result;
to hang forever, although obviously the correct long-term unsigned long *test = kernel;
behavior is to reset the machine. void *setup = 0, *kernelstart, *bootsector = 0;
size_t funcsize = ((unsigned long) end_boot_file) -
((unsigned long) do_boot_file);
3.2.4 Setting up the page tables and GDT void *v;
typedef int (*z)(void *v, size_t count, void *setup, void
This work is done by the os_restart function. This func- *kernelstart,
tion has to change the state of the virtual memory hard- void *bootsector, int testonly);
ware and by its very nature represents the most machine- z bf;
dependent code in LOBOS (the assembly code presented cli();
above for reserving space and system call table entries kernelstart = __va(0x100000);
could just as easily be C code, and is in many kernels). v = &reboot_code;
The main goal of this function is to move the GDT and /* copy it out */
page tables out of the way, and to do it in a way that al- memcpy(v, do_boot_file, funcsize);
lows the VM hardware to function until the new kernel os_restart(0);
takes over and loads the hardware with the new kernel's /* copy out saved_bootparams ..*/
GDT and page tables. Currently, the GDT is put in the printk("copying out ssaved_bootparams\n");
safe area, and the page tables are put in an area allocated memcpy(__va(0x90000), saved_bootparams, 4096);
in high memory. We use the allocated memory for the /* now call it */
page tables as they can vary in size for different types of printk("allocated %d bytes, now call %p\n", funcsize, v);
processor. Pentium-compatible processors that support 4 bf = v;
MByte page table entries only need one page to address 4 result = (*bf)(kernel, count, setup, kernelstart, bootsec-
Gbytes of memory; processors that only support 4 Kbyte tor, 0);
page table entries need much more space. printk("RETURNED FROM do_boot_file: HANGING
The steps here are as follows: store the current gdt into FOREVER\n");
curgdt, so we can find out where it is. Get the pointer to while(1);
the safe gdt, and copy the first page of the current gdt to }
it. We only need a very small part of the GDT, but for
now we just grab the whole first page. Next we allocate a Figure 4: The run_boot_file function.
new page table and copy the current page table to it. Note
that for now this code only works for machines with 4 MB
page table entries. Next we switch to the new GDT (the
sgdt instruction); and finally we switch to the new page
tables. At this point the kernel can be safely overwritten
by the final bootstrap. The only assembly code in this
function is for very low-level hardware support.
5
void
do_boot_file(void *v, size_t count, void *kernelstart, int
void os_restart(int notused)
testonly)
{
{
void *newgdt = 0;
int i;
extern char *reboot_gdt;
void (*f)(void) = kernelstart;
pgd_t *newpagedir = 0;
extern char *reboot_gdt, *get_options;
unsigned long cp;
volatile unsigned char *src = (char *) v;
void *gdtbase;
volatile unsigned char *dst = (char *) kernelstart;
int gdtsize;
unsigned long *l;
unsigned long l;
unsigned long curpagetable;
for(i = 0; i < count; i++, src++, dst++)
unsigned long x;
{
int i;
if ((dst >= &reboot_gdt) && (dst < &get_options)) {
printk("os_restart ...\n");
continue;
curgdt[0] = curgdt[1] = 0;
}
__asm__ __volatile__ ("sgdt %0" : "=m" (curgdt));
if (testonly) {
newgdt = & reboot_gdt;
}
gdtsize = 4095;
else {
memcpy(newgdt, gdtbase, gdtsize + 1);
*dst = *src;
/* build the new page dir that is out of the way ... */
}
newpagedir = get_pgd_slow();
}
if (! newpagedir) {
if (testonly)
printk("newpagedir allocate failed\n");
return;
return;
f();
}
}
memcpy(newpagedir ,
swapper_pg_dir, sizeof(swapper_pg_dir))
l = (unsigned long) newgdt; Figure 6: Final bootstrap, do_boot_file
curgdt[1] = l >> 16;
curgdt[0] = ((l & 0xffff) 1)
into memory, and then calls the bootimg system call. The
name = argv[1];
kernel code is responsible for parsing the header of the file
printf("name is %p\n", name);
and unzipping the code.
bootfile(name);
} Bootimg turns virtual memory (i.e. paging) off at some
point, but leaves i386-style segments on. Turning VM off
Figure 7: The bootfile program complicates a number of issues. Since the user buffer is in
virtual memory, bootimg must first copy it in to physically
contiguous kernel memory that can be addressed with VM
off. Also, in the future kernel components may not all
3.4 The LOBOS command
be in phsyically contiguous memory; we certainly do not
want to count on it. Finally, on systems such as Alpha,
Following the model of fastboot(1), we have created a turning off VM is tantamount to turning off all protection.
command called lobos. Lobos puts a binary, uncom- Given that even the lowest-level NVRAM software on the
pressed kernel image in /tmp, and creates a file called /lo- Alpha runs with VM enabled, we are worried about any
bos. We have modified the reboot script so that if the /lo- approach that involves turning VM off.
bos file exists, the bootfile program is invoked with the
Bootimg constitutes about 1100 lines of code, of which
uncompressed kernel image as the argument.
at least 600 are architecture-dependent. There are only 40
To reboot any kernel, the user can type the full kernel or so lines of assembly. One issue is that bootimg does
path, or simply the intermediate part of the name, e.g. the define a number of structures (such as GDTs) that need to
command 'lobos linux-2.2.13' will reboot the the kernel maintained in synch with the kernel.
/usr/src/linux-2.2.13/vmlinux. Bootimg can be used with the LinuxBIOS.
5.2 Two Kernel Monte
4 Performance and usability.
Two kernel monte (TKM) takes a very different approach
to the problem. TKM at some point turns off BOTH VM
Booting a kernel via LOBOS is much faster and easier (paging) AND i386-style segmentation. In order to avoid
than the standard BIOS-based boot. There is no long wait copies and the requirement for a large area of physically
common with BIOS boots. The unnecessary memory test contiguous memory, TKM builds an internal virtual-to-
and zero is a thing of the past, as is the wait for the many physical page map so that when VM is off, TKM can still
unnecessary tasks that exist only to support DOS 1.0. get to the new kernel image. Also, once real mode is off,
We now have a log buffer that survives reboots and that TKM can call the BIOS to reset hardware that may not
has proven to be a major plus. We much prefer this style work properly after a reboot. TKM can not work with the
of booting to the 16-bit BIOS-based style used on PCs to LinuxBIOS, since it depends on the BIOS for a critical
date. part of the reboot step.
7
5.3 Summary of the three systems For our own purposes we will probably continue using
LOBOS. The two determinants are the ability to boot a
TKM is probably the most architecture-dependent of the new kernel entirely from the kernel, and the fact that the
three, and LOBOS is probably the least architecture- log buffer is preserved across reboots. LOBOS has also
dependent. LOBOS is less than half the size of the others, demonstrated portability across a wide range of kernels
and has only a fraction as much assembly code. Bootimg due to its simplicity.
is the most polished in certain ways: it does the most thor-
ough permission checking and ramdisk support, for exam-
ple. TKM will probably work with just about any kind of
video hardware, since it calls the video bios to reset the 6 Next Steps
video card. TKM will probably never work with the Lin-
uxBIOS. We are working on putting a LOBOS-enabled kernel into
All three systems deal with the problem of VM in very the FLASH RAM on our Intel 440GX motherboards. We
different ways. LOBOS keeps paging and segmentation are using code from the OpenBIOS project to bootstrap
turned on, and relies on the presence of the "safe areas" our kernel into memory. The kernel we boot serves as
to maintain through the reboot process. LOBOS needs to a true network bootstrap, in that it comes up and asks a
relocate the GDT and page tables once. Bootimg turns manager node what it should do, which may include sim-
VM off, and relies on the presence of physically con- ply booting from the disk. We report on the new BIOS
tiguous memory in kernel mode to get around the lack work in a companion paper.
of VM. Bootimg relocates the GDT four times during a Our work on LOBOS has been used by other re-
boot. TKM turns VM off and relies on its own virtual-to- searchers. Werner Almesberger has developed bootimg,
physical map to keep track of memory. TKM reloads the which will probably appear in the 2.4 kernel. Researchers
GDT once. We feel most comfortable with keeping VM at Scyld Computing had started a project similar to LO-
turned on at all times, especially as we move to the Alpha, BOS but had gotten stuck; they were able to use our work
where there is not support for segmentation. to finish their system, Two Kernel Monte.
TKM and Bootimg require an external program to load
the image. LOBOS does not; we supply such a program,
but a LOBOS-equipped kernel can, given a file name, boot 7 Conclusions
that file. When the LinuxBIOS boots from NVRAM, it
can further boot a different kernel (e.g. at the direction of LOBOS is a system call that allows a running kernel to
a DHCP server) without ever having to run a user-mode boot another kernel. Once a kernel is running it has no
program. need to use the BIOS to boot other kernels. This new ca-
Only LOBOS allows the kernel log buffer to survive pability allows us to use Linux kernels as a network boot-
across reboots. We have found this capability very useful, strap, as opposed to using a special network bootstrap pro-
since we no longer need to wait for klogd to clean up be- gram. It is also very easy to boot new kernels: we simply
fore rebooting. We are trying to reach a 3-second reboot type in 'lobos ' and the new kernel is up
time, and the fewer processes we have to wait for when and running in less than a minute. We don't really need
we reboot, the better. LILO any more.
In terms of security, all three systems are no more (and LOBOS also makes it possible to replace the BIOS with
no less) secure than a standard reboot system call. a Linux-based BIOS. The benefits to our work are clear:
There is a final question: which of these systems will the BIOS is the last great barrier to truly Open Source-
make it as the "standard" in the Linux kernel? While based clusters.The BIOS also represents a major stum-
we prefer the LOBOS implementation, and especially bling block to managing large clusters, due to its primitive
some of the LOBOS design decisions, we believe that the structure and limited capabilities, as well as to its 16-bit
standard system call for 2.4 will be bootimg. At some unprotected-mode origins. We feel that LOBOS repre-
point we will then need to revisit portability issues, since sents a first step to freeing Linux users from the BIOS and
bootimg depends very heavily (over 30% of the code, as all its constraints.
opposed to several tens of lines in LOBOS) on aspects of LOBOS has to date been used as a reference by two
i386 Linux that do not exist in other architectures. other groups to build working system calls with similar
8
capabilities. One of these system calls, bootimg, will
probably be a part of the standard 2.4 kernel.
9