Information about http://research.microsoft.com/~dtarditi/dist/ismm-2000b.pdf

Compact Garbage Collection Tables …

Tags: bers, compiler optimizations, counter values, david tarditi, executable code, garbage collection, indexes, location descriptions, map, microsoft, microsoft research, offsets, ping program, pointer, pointers, redmond wa, return addresses, sparse array, stack frames, ues,
Pages: 9
Language: english
Created: Thu Jul 27 17:25:54 2000
Display cached document
Page 1
image
Page 2
image
Page 3
image
Page 4
image
Page 5
image
Page 6
image
Page 7
image
Page 8
image
Page 9
image
                            Compact Garbage Collection Tables

                                                        David Tarditi
                                                    Microsoft Research
                                                Redmond, WA 98052, USA
                                                  dtarditi@microsoft.com



ABSTRACT                                                          location descriptions. Furthermore, the number of indexes
Garbage collection tables for finding pointers on the stack       can be reduced by using compiler optimizations that pack
can be represented in 20-25% of the space previously re-          pointer-containing locations close together in stack frames.
ported. Live pointer information is often the same at many        The design also uses several techniques for compactly map-
call sites because there are few pointers live across most call   ping program counter values to those small indexes. The
sites. This allows live pointer information to be represented     techniques all assign numbers to call sites and use those
compactly by a small index into a table of descriptions of        numbers to index an array of small indexes. One technique
pointer locations. The mapping from program counter val-          is to assign adjacent call sites that have the same live pointer
ues to those small indexes can be represented compactly           information the same number [5]. Another technique is to
using several techniques. The techniques all assign num-          represent an array of return addresses by using a two-level
bers to call sites and use those numbers to index an array        table with 16-bit offsets. A final technique is to use a sparse
of small indexes. One technique is to represent an array of       array of return addresses and interpolate the number by dis-
return addresses by using a two-level table with 16-bit off-      assembling the executable code.
sets. Another technique is to use a sparse array of return           The paper studies the properties of 20 benchmarks com-
addresses and interpolate the exact number via disassembly        piled using a whole-program Java compiler. The bench-
of the executable code.                                           marks include the SPEC JVM98 benchmarks [4]. It also
                                                                  uses those benchmarks to evaluate the design and compare
                                                                  various techniques for mapping program counter values to
1.   INTRODUCTION                                                 small indexes.
   A copying garbage collector has to find all memory lo-            Diwan et. al [5] describe garbage collection tables for opti-
cations on the call stack that point to heap-allocated data.      mized code. Their primary focus is not on building compact
One successful approach to this problem is for a compiler to      tables, although they address the issue of building tables
generate tables that the garbage collector uses to traverse       with reasonable sizes. They describe how to support derived
the call stack and find the locations. A drawback of the ap-      interior pointers that are produced by optimizations such as
proach is that the tables can be large relative to executable     strength reduction and induction variable elimination. The
code size. Diwan et al. report tables that are 16% of VAX         work described here focuses on building compact tables and
code size [5] and Stichnoth et al. report tables that are 20%     does not address support for derived interior pointers. It also
of 80x86 code size [12]. When code size is important, the         studies benchmarks that are one to four orders of magnitude
large tables are a significant cost to pay for garbage collec-    larger than the benchmarks studied by Diwan et. al. Em-
tion support.                                                     pirical observations of these larger benchmarks show effects
   This paper describes a design for compact garbage collec-      that cannot be observed with smaller benchmarks, such as
tion tables that can be accessed and decoded quickly. The         live pointer information being the same at many call sites.
tables are on average 3.6% of code size for programs com-            Stichnoth et al. [12] describe tables that allow garbage
piled with an optimizing whole-program Java compiler for          collection to begin at almost every instruction. For multi-
the 80x86. The design is based on the empirical observa-          threaded programs, it reduces the potential start-up laten-
tion that live pointer information is often the same at many      cies of garbage collection at the expense of larger tables. It
call sites. This is because most call sites have few pointers     is unnecessary for single-threaded programs. In contrast,
live across them. The live pointer information can be repre-      the design described in this paper allows garbage collection
sented compactly using a small index into a table of pointer      to begin only at specified program points. This produces
                                                                  smaller tables, but it may increase the start-up latency of
                                                                  garbage collection. The latency can be bounded by mak-
                                                                  ing the beginnings of extended basic blocks [2] or the tops
                                                                  of loops [5] garbage collection points. An explicit check
                                                                  can be added or the language runtime system can insert
                                                                  a breakpoint to halt thread execution when those points are
                                                                  reached.
                                                                     Stichnoth et al. also compress the binary representation
                                                                  of their tables by using Huffman encoding. The design de-
scribed here relies on explicit sharing instead of a generic        addresses. For each call site, the return address where exe-
compression algorithm.                                              cution will resume and live pointer information is recorded.
   The remainder of the paper is organized as follows. Sec-         Live pointer information includes the set of stack locations
tion 2 describes the basic structure of the tables, how the         and callee-save registers that contain pointers live across the
runtime system interprets the tables, and how the tables are        call site and the set of callee-saved registers that are un-
generated. Section 3 describes how to reduce the size of the        changed on all paths to the call site.
mapping from program counter values to indexes. Section 4              First, the set of descriptors is computed. The list of call
describes the testbed used to evaluate the tables. Section 5        sites is copied and sorted lexicographically using the live
measures the size of the tables, the sharing that occurs, and       pointer information. The sorted list is scanned and every
the effects of improvements.                                        time the live pointer information changes, a new index is
                                                                    assigned and the live pointer information is added to the
2.   USING SMALL INDEXES TO REPRESENT                               list of descriptors. The descriptor index is recorded for the
                                                                    run of call sites with the same information.
     LIVE POINTER INFORMATION                                          Finally, the three tables are generated. The call site table
   This section describes how the garbage collection tables         is formed by creating an array from the return addresses
use small indexes to represent the descriptions of pointer lo-      stored in the list of call sites The descriptor index table
cations and how the tables are interpreted and generated. It        is formed by creating an array from the descriptor indices
also describes compiler optimizations that reduce the num-          stored in the list of call site sets. The descriptor table is
ber of indexes and the format of the descriptions of pointer        created by converting the list of descriptors to an array of
locations for the 80x86.                                            32-bit integers encoding the descriptors.
   Garbage collection is assumed to occur only upon a call
from a thread into the runtime system, that is, at defined          2.4    Compiler optimizations to reduce the num-
program points. For single-threaded programs those points                  ber of indexes
are allocation sites. For multi-threaded programs, those
points are allocation sites or system calls that cause a thread        The number of distinct descriptions of pointer locations
to suspend its execution. The problem of finding pointer lo-        (and thus indexes) can be reduced by improving the spatial
cations becomes the problem of mapping the return address           locality of pointers on the stack. First, the stack frame is
for each call back to a description of the locations of pointers    divided into pointer-containing and non-pointer containing
that are live when the call was made.                               sections. The pointers are placed in the stack slots closest
                                                                    to the beginning of the local variable area. Second, stack
2.1 Table organization                                              slots are colored so that they are reused for variables with
   Figure 1 shows the organization of the tables. The call site     disjoint lifetimes. This is important for large functions that
table and the descriptor index table map program counter            have many live variables.
values to small indexes. More compact representations of
the call site table are discussed in Section 3. The call site ta-   2.5    Describing pointer locations for the 80x86
ble is a sorted array of 32-bit return addresses that maps call        The descriptor formats depend on the register usage, the
sites to their numbers. The descriptor index table maps call        calling convention, and the stack layout. The 80x86 has
site numbers to indexes; it is an array of indexes. The de-         8 general-purpose registers, two of which are used as the
scriptor table contains descriptions of the locations of point-     stack pointer and frame pointer. The remaining registers are
ers.                                                                divided into three callee-save registers and three caller-save
   Descriptors have two forms. The compact form encodes             registers. Arguments are passed in the caller-save registers
the pointer locations into a 32-bit word. The escape form is        and on the stack. The formats currently assume that all
a pointer to a variable-length record that describes pointer        functions use frame pointers.
locations. The least-significant bit of a descriptor is used to        Figure 3 shows the layout of a stack frame. The stack
distinguish between the the two forms. The details of the           grows downward toward lower addresses. The frame pointer
formats are described in Section 2.5.                               points to the saved frame pointer of the caller. Argument
                                                                    values are at positive offsets from the frame pointer and
2.2 Table interpretation                                            local variables are at negative offsets. Callee-save registers
   Figure 2 show pseudo-code for walking the stack and pro-         are saved in the memory locations immediately below the
cessing stack frames. The walkStack function takes the              old value of the frame pointer.
frame pointer for the youngest Java stack frame and the re-            Figure 4 shows the format of a compact descriptor. Each
turn address where execution will resume using that frame.          compact descriptor has five bitfields. The first 3-bit field
It then walks the stack frames until it finds a non-Java stack      describes what callee-save registers were saved on the stack
frame. For each stack frame, it uses a binary search of the         at function entry. The second 6-bit field describes the usage
call site table to find the call site number. It then uses          of the callee-save registers across a call. The third 4-bit field
the call site number to look up the descriptor index in the         describes the highest 32-bit word that contains a live pointer
descriptorIndexTable and uses the descriptor index to look          in the argument section of the stack. The bitfield is zero
up the actual descriptor. The function processFrame (not            if there are no live pointer arguments. The fourth bitfield
shown) decodes the descriptor and forwards live heap point-         is a bitmask that indicates what argument stack locations
ers for the call.                                                   contain live pointers. Its length is variable and determined
                                                                    by the third bitfield. The fifth bitfield is a bitmask that uses
2.3 Table construction                                              the remaining bits and indicates what local variable stack
  Table construction starts with a list of all call sites where     locations contain pointers.
garbage collection may occur. The list is ordered by call site         Escape descriptor records have three variants that de-
               return                          call site                         descriptor
              address         Call site        number           Descriptor         index          Descriptor     descriptor
                               table                           index table                          table




                             Maps program counter to index

                                               Figure 1: Table organization


// , fp = frame pointer, ra = current return address
walkStack(char *fp, char *ra) {
   while (ra is for a Java function) {
      find the call site number n of ra using binary search
      index = descriptorIndexTable[n]
      descriptor = descriptorTable[index]
      processFrame(descriptor,fp);
      a = *(fp+4)
      fp = *(fp)
   }
}


                                       Figure 2: Pseudo-code for walking the stack

                                          Higher addresses


                                                             Arguments to function


                                                             Return address
                                                                                                  Stack growth
                    Frame pointer                            Previous frame
                                                              pointer value

                                                                Callee-save
                                                              register values

                                                              Local variables
                                                                of function

                       Stack pointer
                                           Lower addresses

                                          Figure 3: The layout of a stack frame.


                  31                           13+n            13            9                3       0
                                                                    n


                                                                                                  Descriptor type
                    Bitmask for local                        Highest pointer-
                                                                containing                  Callee-save
                        variables
                                                              argument slot               registers saved
                                              Bitmask for                   Callee-save       on entry
                                              arguments                    register usage
                                                                             across call

                                    Figure 4: The format of a compact descriptor
scribe 8, 16, and 32-bit offsets respectively. The first field      return address and that precedes it. Second, the exact call
describes the variant type and also includes the bitfields for      site set number is determined by interpolation. The code is
the callee-save registers. The second field describes the num-      disassembled from the entry address to the return address
ber of pointers on the stack and the rest of the record con-        and the number of call site sets between the two points is
tains the frame pointer offsets of the pointer locations. All       counted. The call site number of the sparse entry is used to
pointers are aligned, so the offsets are in units of 4 bytes.       index into the call site bitmask to figure out what call sites
                                                                    to count. The number of call site sets is then added to the
                                                                    call site set number of the sparse entry.
3.    COMPACT MAPPINGS FROM PROGRAM                                    The back-mapping table is broken into segments that map
      COUNTER VALUES TO INDEXES                                     sparse call entries to 16-bit offsets for call site and call site
  The call site table described in the previous section was         set numbers. The offsets are added to base call site and call
an array of 4-byte return addresses. This section describes         site set numbers for the segment. On average, each sparse
three techniques for reducing the size of the call site table.      entry needs only 4 bytes of information in the table.
                                                                       The disassembler only needs to recognize call instructions
3.1    Grouping adjacent call sites into sets                       and calculate instruction lengths. The runtime system code
   Diwan et al. [5] observe that adjacent call sites often          to do this for the 80x86 consists of 120 lines of table-driven
have the same live pointer information. These adjacent call         C code and 300 lines of table initialization code, even though
sites, along with call sites that cannot garbage collect, can       the 80x86 has a highly irregular instruction format with vari-
be grouped into sets. The call site table no longer maps            able length instructions. The disassembler would be simpler
individual return addresses to call site numbers. Instead,          for a RISC architecture where all instructions are the same
it maps ranges of program counter values to call site set           length.
numbers.                                                               The cost of disassembling code can be significant on the
                                                                    80x86 because the irregular instruction format means that
3.2    Using a 2-level table with 16-bit offsets                    code must be examined one byte at a time. A 16-element
   The call site table can be implemented as a 2-level table        LRU cache is used to amortize the cost of disassembling
with 16-bit offsets. Each 64K segment of the instruction            code; the cache maps return addresses to their call site set
address space has its own table that keeps the offsets of call      numbers.
sites within the segment. The starting call site number is             An advantage of the sparse call site table over the 2-level
also recorded for each segment.                                     table is that the compiler no longer needs to include an as-
   If call sites are reasonably dense, this technique roughly       sembler because it does not need to calculate precise offsets.
halves the call site table size, replacing 32 bits of information   A disadvantage of the sparse call site table is that a debug-
with 16 bits of information. A drawback of the technique is         ger may confuse the disassembly process when it inserts a
that the compiler must know the offset of each label from the       breakpoint instruction. The breakpoint instruction may re-
base of the code segment to calculate the 64K segment for           place a call instruction or, on a machine like the 80x86 with
the label and the 16-bit offset within that segment. For the        variable length instructions, overwrite part of an instruction.
80x86, which has variable length instructions, the Marmot           To avoid confusing the disassembler, the debugger must in-
compiler uses it own assembler and generates object files           form the disassembler about any changes that it makes to
directly when using a 2-level call site table.                      the executable code.

3.3    Using a sparse table and interpolating via                   4.   EXPERIMENTAL TESTBED
       disassembly                                                     This section describes the experimental testbed. It con-
   Another approach to reducing the call site table size is         sists of an optimizing compiler, a set of benchmarks, and
based on the observation that executable code itself con-           the machine used for timing measurements.
tains some of the information in the table. Specifically, the
number of a call site can be calculated by disassembling the        4.1 The Marmot optimizing compiler for Java
executable code from the start of the code to the current              Marmot [6] is a whole-program, optimizing Java com-
return address, assuming that no data has been intermixed           piler that was developed as a research platform. It pro-
with the code. From this perspective, tables accelerate the         duces 80x86 executables and includes standard scalar op-
disassembly process and allow the disassembly process to            timizations such as constant and copy propagation, com-
skip data that is intermixed with code. The tables also sup-        mon subexpression elimination, dead-assignment elimina-
port grouping call sites into sets as described in Section 3.1.     tion, loop invariant code motion, induction variable elim-
   Three tables can be used to interpolate call site set num-       ination, strength reduction, and inlining of small, statically
bers by disassembling the executable code. The first table          called functions. It also includes basic object-oriented opti-
is a sparse call site table that has a subset of the entries in     mizations such as virtual call rebinding based on class hier-
the original table. For example, the sparse table could have        archy analysis, null check elimination, type test elimination,
1/n of the entries of the original table, where n is a compiler     and removal of uninvoked methods and advanced optimiza-
parameter. The second table is a back-mapping table that            tions such as stack allocation of objects [7], array bounds
maps each sparse entry to its call site number and a call           check elimination, and the elimination of unnecessary syn-
site set number. The third table is a bitmask that indicates        chronization operations [11].
which call sites are the beginnings of call site sets.                 All programs were compiled using a generational garbage
   The calculation of the call site set number for a given          collector with a write barrier. A card-marking write barrier
return address proceeds in two steps. First, the sparse call        was used by default because a sequential store buffer (SSB)
site table is searched to find the entry that is closest to the     write barrier may increase the size of tables (see Section 5.8).
4.2 The benchmarks                                                   In practice, call sites are grouped into call site sets (see
   Table 1 lists the benchmarks. The benchmarks are divided        Section 3.1). Even compared to the number of call site sets,
into three groups. The SPEC JVM98 group includes the               the number of indexes is small. The fifth column of Table 2
programs from the SPEC JVM98 benchmark suite [4]. The              shows the number of call site sets and the sixth column
MiscJava group includes various Java programs. It includes         shows the number of indexes as a percentage of call site
the Marmot compiler, which is the largest program used             sets.
in this study. The IMPACT group includes C programs
translated to Java by the IMPACT/NET project [9]. Several
                                                                   5.3    Most indexes fit into 9 bits
of the original C programs came from the SPEC95 suite [3].           The number of indexes is small enough that for most pro-
   Section 5 focuses on four benchmarks that illustrate a          grams they fit into 8 or 9 bits. Table 2 illustrates this.
range of results: 201 compress, 202 jess, wc, and Marmot.          In fact, of the 20 benchmarks, 13 need 8-bit indexes and
201 compress and 202 jess are typical of the SPEC JVM98            5 need 9-bit indexes. Only two programs need more than
and Misc programs. Wc is typical for the IMPACT pro-               9 bits: 213 javac needs 10-bit indexes and Marmot needs
grams. Marmot has the largest table size relative to code          13-bit indexes.
size of all the benchmarks.
                                                                   5.4    Most call sites have few pointers live across
4.3    Hardware platform                                                  them
  Benchmarks were run on an otherwise unloaded 300 MHz                The number of indexes is small relative to the number of
dual Pentium II (x86 Family 6 Model 5) processor Gateway           call sites because most call sites have few pointers live across
2000 E-5000 running Windows NT 2000. The machine was               them. These call sites need inordinately few indexes.
disconnected from the local area network. High-resolution             The following charts use call site sets to exclude call sites
on-chip cycle counters were used to measure the time spent         where garbage collection cannot occur. Most call site sets
garbage collecting; entry cycle counts were subtracted from        have few pointers live across them. Figure 7 graphs the cu-
exit cycle counts. The benchmarks were run repeatedly until        mulative distribution of call site sets as a function of the
standard deviations were nominal.                                  number of pointers live across those sets. There are 3 or
                                                                   fewer pointers live across 90% of the call sites sets for wc,
                                                                   201 compress, and 202 jess. For Marmot, which is an ex-
5.    EMPIRICAL EVALUATION                                         treme case, there are 7 or fewer pointers live across 90% of
  In this section, the size of the tables, the sharing that oc-    its call site sets. On average, most benchmarks have 4 or
curs and the effects of table improvements on size are evalu-      fewer pointers live across 90% of their call site sets.
ated using the experimental testbed described in Section 4.           The reuse of indexes is very high for call site sets with
                                                                   only a few pointers live across them. Figure 8 shows the
5.1    Size of tables                                              ratio of indexes to call site sets. A low ratio indicates high
   Figure 5 shows the size of garbage collection tables as a       reuse. As program sizes increase, the reuse improves.
fraction of code size using the best configuration (call sites
grouped into sets, a sparse call site table, and a divided stack   5.5    Call site sets reduce program counter map-
with stack slots reused for variables with disjoint lifetimes)            pings by more than half
Code size is the size of generated Java code; code for the            Call site sets substantially reduce the size of the mapping
runtime system is excluded. The average table size for all         from program counter values to descriptor indexes. Table 2
20 benchmarks is 3.6% of code size.                                also shows that grouping call sites into sets typically halves
   Each bar shows the contribution of various tables. The          the amount of data that must be recorded, reducing both
call site table is the sparse version of the call site table. It   the size of the call site table and the descriptor index table.
includes only every tenth entry in the original table. The
auxiliary tables include the back-mapping table and the call       5.6    Sparse call site tables use the least amount
site bitmask. Only two programs need escape descriptors:                  of space and negligibly increase execution
Marmot and impgo. All call sites in the remaining programs                times
can be described with compact descriptors.                            Figure 6 compares three different call site table imple-
   There is considerable variation in the size of the descriptor   mentations: the original implementation that uses 4-byte
index tables. In part, this is because only 8-bit and 16-bit       entries, the two-level table implementation, and the sparse
indexes are supported currently. This causes table sizes to        call site table. As expected, the two-level implementation is
abruptly double when the number of indexes exceeds 256.            roughly half the size of the the original implementation and
With more implementation work, indexes whose sizes are             the sparse call site tables are even better than the two-level
not byte multiples could be used.                                  implementation. In all cases, call sites are grouped into sets.
                                                                      Compressing the call site table is important for producing
5.2 The number of indexes is small                                 compact tables. A comparison with Figure 5 shows that the
   The number of indexes relative to the number of call sites      original call site table is usually larger than the entire size
is small. Table 2 shows some basic statistics for 201 compress,    of the garbage collection tables in the best configuration.
202 jess, Marmot, and wc. The first column shows the dif-             Using a sparse call site table instead of the original call
ferent sizes of the programs; it gives the executable code         site table negligibly increases overall benchmark execution
size for the Java code. The second column shows the num-           times. It also has a small effect on garbage collection time
ber of call sites. The third column shows the number of            unless garbage collection time is very small. Table 3 shows
indexes. The fourth column shows the number of indexes as          the percentage of running time spent garbage collecting for
a percentage of call sites.                                        selected benchmarks when using the original call site table.
                                  Name                      Lines   Description
                                                          of code
                                                                              The SPECJVM98 group
                                  201   compress           927      Compression program compressing and decompressing files
                                  202   jess              11K       Java Expert Shell System solving set of puzzles
                                  209   db                1028      An in-memory database program performing a sequence of operations
                                  213   javac         unknown       Java bytecode compiler
                                  222   mpegaudio     unknown       MPEG audio program decompressing audio files in MPEG Layer-3 format
                                  227   mtrt              3753      Multi-threaded ray tracer
                                  228   jack          unknown       Java parser generator generating its own parser
                                                                                The Misc Java group
                                  marmot                   127K     Marmot compiling 213 javac
                                  cn2                        578    CN2 induction algorithm
                                  javacup                   8760    JavaCup generating a Java parser
                                  jlex100                   14K     JLex generating a lexer for sample.lex, run 100 times.
                                  plasma                     648    A constrained plasma field simulation/visualization
                                  slice                      989    Viewer for 2D slices of 3D radiology data
                                  SVD                       1359    Singular-Value Decomposition (100x600 matrices)
                                                                                 The IMPACT group
                                  impdes                     561    The IMPACT benchmark DES encoding a large file
                                  impgo                     32K     The IMPACT transcription of the SPEC95 go.099 benchmark
                                  impgrep                    551    The IMPACT transcription of the UNIX grep utility on a large file
                                  impijpeg                  8446    The IMPACT transcription of the SPEC95 ijpeg.132 benchmark
                                  docli                     8864    The IMPACT transcription of the SPEC95 li.130 benchmark, modified
                                                                    to use the system garbage collector.
                                  impwc                      148    The IMPACT transcription of the UNIX wc utility on a large file

                                                                    Table 1: The Java benchmark programs



                        0.07                                                                                         0.07
                                                                        Escape descriptors
                                                                                                                                                               Original
                        0.06                                            Descriptor table                             0.06                                      Two-level
                                                                        Descriptor index                                                                       Sparse
                                                                        table
Fraction of code size




                                                                                             Fraction of code size




                        0.05                                            Auxiliary tables                             0.05

                                                                        Call site table
                        0.04                                                                                         0.04


                        0.03                                                                                         0.03


                        0.02                                                                                         0.02


                        0.01                                                                                         0.01


                          0                                                                                            0
                               201_compress    202_jess        Marmot              wc                                        201_compress    202_jess     Marmot           wc


      Figure 5: GC table sizes as fractions of code size                                                                    Figure 6: Comparison of call site tables



                                              Program           Code size           Call     Indexes                           Index/CS      Call site   Indexes/CSS
                                                                                    sites                                                         sets
                                                                  (bytes)                                                             (%)                          (%)
                                              201 compress         64,240         1,845                                115             6.2        554              20.8
                                              202 jess            199,056         5,520                                351             6.4      2,873              12.1
                                              Marmot            2,049,920        61,609                              4,972             8.1     34,477              14.5
                                              wc                   37,008           941                                105            11.1        404              26.3

                                          Table 2: The number of indexes is much smaller than the number of call sites
                                         1                                                                                                                        1

                                        0.9                                                                                                                                     Marmot
                                                                                                                                                                 0.9
                                                                                                                                                                                202_jess
Cumulative fraction of call site sets




                                                                                                                       Indexes as a fraction of call site sets
                                        0.8                                                                                                                      0.8            average

                                        0.7                                                                                                                      0.7            201_compress
                                                                                                                                                                                wc
                                        0.6                                                                                                                      0.6

                                        0.5                                                                                                                      0.5
                                                                                                 Marmot
                                        0.4                                                                                                                      0.4
                                                                                                 202_jess
                                        0.3                                                                                                                      0.3
                                                                                                 201_compress
                                        0.2                                                                                                                      0.2
                                                                                                 wc
                                        0.1                                                                                                                      0.1

                                         0                                                                                                                        0
                                              0   1   2   3   4   5   6   7   8   9   10 11 12 13 14 15 16 17 18 19+                                                    0       1          2         3          4    5   6
                                                                      Number of live pointers                                                                                              Number of live pointers


Figure 7: Most call site sets have few pointers live                                                                                                             Figure 8: The ratio of indexes to call site sets
across them


It also shows the percentage decrease in the speed of the                                                                                                              using previous techniques. The tables share descriptions of
garbage collector due to the use of sparse call site tables.                                                                                                           pointer locations at many garbage collection points. The
   The decrease in GC speed for docli is large because stack                                                                                                           sharing is possible because there are relatively few pointers
walking represents a large fraction of the (small) GC time.                                                                                                            live across most garbage collection points. The size of the
Docli allocates a large amount of data that dies quickly.                                                                                                              descriptions of pointer locations is typically less than 1% of
There are many collections of the younger generation, each                                                                                                             executable code size.
of which includes a stack walk, but very little data survives                                                                                                             The mapping from program counter values to indexes is
each collection, so GC time is small.                                                                                                                                  represented compactly using several techniques. First, adja-
                                                                                                                                                                       cent call sites that have the same live pointer information or
5.7                                           Reusing stack slots is important for larger                                                                              cannot garbage collect are grouped into sets. This roughly
                                              programs                                                                                                                 halves the number of garbage collection points. Second, the
   Reusing stack slots for variables with disjoint lifetimes is                                                                                                        table of return addresses is represented as a two-level table
important for reducing the size of the tables for larger pro-                                                                                                          with 16-bit offsets or sparsely. The two-level table works
grams. Figure 9 shows the effect of disabling the reuse of                                                                                                             well because there are usually many return addresses within
stack slots for variables with disjoint lifetimes. The size of                                                                                                         a 64K segment of code. The sparse table uses the fact that
tables for Marmot increases by 21%. Other benchmarks af-                                                                                                               the executable code contains most of the information in the
fected by this include 213 javac (9%), java cup (16%), and                                                                                                             table. By disassembling the executable code, the exact call
impgo (35%). The primary cause of the increase is an in-                                                                                                               site number can be interpolated.
crease in the size of escape descriptors. A secondary cause
is an increase in the number of descriptors, which increases
the size of the descriptor table.
                                                                                                                                                                       7.   REFERENCES
                                                                                                                                                                        [1] Andrew W. Appel. Simple generational garbage
5.8                                           Effect of using an SSB write barrier                                                                                          collection and fast allocation. Software: Practice and
   A generational collector uses a write barrier to track point-                                                                                                            Experience, 19(2):171­183, February 1989.
ers from older generations to younger generations. This al-                                                                                                             [2] Andrew W. Appel. Compiling with Continuations.
lows younger generations to be collected independently from                                                                                                                 Cambridge University Press, 1992.
older generations. A check is placed at each store of a pointer                                                                                                         [3] Standard Performance Evaluation Corporation. SPEC
into a heap object. A sequential store buffer (SSB) write                                                                                                                   CPU 95 benchmarks. Online version at
barrier appends the locations of cross-generational pointers                                                                                                                http://www.spec.org/osg/cpu95, 1995.
to a buffer [1, 10, 8]. A garbage collection may be triggered                                                                                                           [4] Standard Performance Evaluation Corporation. SPEC
when the buffer overflows.                                                                                                                                                  JVM98 benchmarks. Online version at
   Thus, using an SSB write barrier introduces additional                                                                                                                   http://www.spec.org/osg/jvm98, 1998.
garbage collection points at pointer stores into heap objects.                                                                                                          [5] Amer Diwan, J. Eliot B. Moss, and Richard L.
This increases the size of the tables. Figure 10 shows the                                                                                                                  Hudson. Compiler support for garbage collection in a
effect of this increase. It is primarily due to an increase in                                                                                                              statically typed language. In Proceedings of the
the number of call site sets. This increases the size of the                                                                                                                SIGPLAN '92 Conference on Programming Language
tables that map program counter values to indexes.                                                                                                                          Design and Implementation, pages 273­282, San
                                                                                                                                                                            Francisco, California, June 1992. SIGPLAN, ACM
6.                                            SUMMARY                                                                                                                       Press.
   This paper describes how to produce compact garbage col-                                                                                                             [6] Robert Fitzgerald, Todd B. Knoblock, Erik Ruf,
lection tables that are 20-25% of the size of tables produced                                                                                                               Bjarne Steensgaard, and David Tarditi. Marmot: An
                                                             Program           GC time with                               Decrease in GC speed
                                                                                original table                                with sparse table
                                                                         (% of running time)                                               (%)
                                                             213 javac                      17                                              0.2
                                                             227 mtrt                        3                                              1.3
                                                             228 jack                        3                                              2.0
                                                             cn2                            20                                              0.2
                                                             javacup                        13                                              0.2
                                                             docli                         0.5                                             18.9

                                          Table 3: Decrease in garbage collector speed from using a sparse call site set table




                             25%                                                                                  20%

                                                                                                                  18%

                             20%                                                                                  16%
Increase in size of tables




                                                                                     Increase in size of tables




                                                                                                                  14%

                             15%                                                                                  12%

                                                                                                                  10%

                             10%                                                                                  8%

                                                                                                                  6%

                             5%                                                                                   4%

                                                                                                                  2%

                             0%                                                                                   0%
                                   201_compress   202_jess    Marmot         wc                                         201_compress   202_jess   Marmot   wc


Figure 9: Effect of disabling the reuse of stack slots Figure 10: Effect of using an SSB write barrier on
for variables with disjoint lifetimes on the size of the size of tables
tables
       optimizing compiler for Java. Software: Practice and
       Experience, 30(3):199­232, March 2000.
 [7]   David Gay and Bjarne Steensgaard. Fast escape
       analysis and stack allocation for object-based
       programs. In 9th International Conference on
       Compiler Construction (CC'2000), Lecture Notes in
       Computer Science. Springer-Verlag, March 2000. to
       appear.
 [8]   Antony L. Hosking, J. Eliot B. Moss, and Darko
       Stefanovi`. A comparative performance evaluation of
                 c
       write barrier implementations. In Proceedings of the
       ACM '92 Conference on Object-Oriented Programming
       Systems, Languages, and Applications (OOPSLA),
       pages 92­109, Vancouver, British Columbia, October
       1992. ACM Press.
 [9]   C.-H. A. Hsieh, J. C. Gyllenhaal, and W. W. Hwu.
       Java bytecode to native code translation: the Caffeine
       prototype and preliminary results. In IEEE, editor,
       Proceedings of the 29th annual IEEE/ACM
       International Symposium on Microarchitecture,
       December 2­4, 1996, Paris, France, 1109 Spring
       Street, Suite 300, Silver Spring, MD 20910, USA,
       1996. IEEE Computer Society Press.
[10]   Richard L. Hudson, J. Eliot B. Moss, Amer Diwan,
       and Christopher F. Weight. A language-independent
       garbage collector toolkit. COINS Technical Report
       91-47, University of Massachusetts, Amherst, June
       1991.
[11]   Erik Ruf. Removing synchronization operations from
       Java. In Proceedings of the ACM SIGPLAN '00
       Conference on Programming Language Design and
       Implementation, Vancouver, Canada, June 2000. ACM
       Press. to appear.
[12]   James M. Stichnoth, Guei-Yuan Lueh, and Michal
       Cierniak. Support for garbage collection at every
       instruction in a java compiler. In Proceedings of the
       SIGPLAN '99 Conference on Programming Language
       Design and Implementation, pages 118­127, May 1999.