Tags: bers, compiler optimizations, counter values, david tarditi, executable code, garbage collection, indexes, location descriptions, map, microsoft, microsoft research, offsets, ping program, pointer, pointers, redmond wa, return addresses, sparse array, stack frames, ues,
Compact Garbage Collection Tables
David Tarditi
Microsoft Research
Redmond, WA 98052, USA
dtarditi@microsoft.com
ABSTRACT location descriptions. Furthermore, the number of indexes
Garbage collection tables for finding pointers on the stack can be reduced by using compiler optimizations that pack
can be represented in 20-25% of the space previously re- pointer-containing locations close together in stack frames.
ported. Live pointer information is often the same at many The design also uses several techniques for compactly map-
call sites because there are few pointers live across most call ping program counter values to those small indexes. The
sites. This allows live pointer information to be represented techniques all assign numbers to call sites and use those
compactly by a small index into a table of descriptions of numbers to index an array of small indexes. One technique
pointer locations. The mapping from program counter val- is to assign adjacent call sites that have the same live pointer
ues to those small indexes can be represented compactly information the same number [5]. Another technique is to
using several techniques. The techniques all assign num- represent an array of return addresses by using a two-level
bers to call sites and use those numbers to index an array table with 16-bit offsets. A final technique is to use a sparse
of small indexes. One technique is to represent an array of array of return addresses and interpolate the number by dis-
return addresses by using a two-level table with 16-bit off- assembling the executable code.
sets. Another technique is to use a sparse array of return The paper studies the properties of 20 benchmarks com-
addresses and interpolate the exact number via disassembly piled using a whole-program Java compiler. The bench-
of the executable code. marks include the SPEC JVM98 benchmarks [4]. It also
uses those benchmarks to evaluate the design and compare
various techniques for mapping program counter values to
1. INTRODUCTION small indexes.
A copying garbage collector has to find all memory lo- Diwan et. al [5] describe garbage collection tables for opti-
cations on the call stack that point to heap-allocated data. mized code. Their primary focus is not on building compact
One successful approach to this problem is for a compiler to tables, although they address the issue of building tables
generate tables that the garbage collector uses to traverse with reasonable sizes. They describe how to support derived
the call stack and find the locations. A drawback of the ap- interior pointers that are produced by optimizations such as
proach is that the tables can be large relative to executable strength reduction and induction variable elimination. The
code size. Diwan et al. report tables that are 16% of VAX work described here focuses on building compact tables and
code size [5] and Stichnoth et al. report tables that are 20% does not address support for derived interior pointers. It also
of 80x86 code size [12]. When code size is important, the studies benchmarks that are one to four orders of magnitude
large tables are a significant cost to pay for garbage collec- larger than the benchmarks studied by Diwan et. al. Em-
tion support. pirical observations of these larger benchmarks show effects
This paper describes a design for compact garbage collec- that cannot be observed with smaller benchmarks, such as
tion tables that can be accessed and decoded quickly. The live pointer information being the same at many call sites.
tables are on average 3.6% of code size for programs com- Stichnoth et al. [12] describe tables that allow garbage
piled with an optimizing whole-program Java compiler for collection to begin at almost every instruction. For multi-
the 80x86. The design is based on the empirical observa- threaded programs, it reduces the potential start-up laten-
tion that live pointer information is often the same at many cies of garbage collection at the expense of larger tables. It
call sites. This is because most call sites have few pointers is unnecessary for single-threaded programs. In contrast,
live across them. The live pointer information can be repre- the design described in this paper allows garbage collection
sented compactly using a small index into a table of pointer to begin only at specified program points. This produces
smaller tables, but it may increase the start-up latency of
garbage collection. The latency can be bounded by mak-
ing the beginnings of extended basic blocks [2] or the tops
of loops [5] garbage collection points. An explicit check
can be added or the language runtime system can insert
a breakpoint to halt thread execution when those points are
reached.
Stichnoth et al. also compress the binary representation
of their tables by using Huffman encoding. The design de-
scribed here relies on explicit sharing instead of a generic addresses. For each call site, the return address where exe-
compression algorithm. cution will resume and live pointer information is recorded.
The remainder of the paper is organized as follows. Sec- Live pointer information includes the set of stack locations
tion 2 describes the basic structure of the tables, how the and callee-save registers that contain pointers live across the
runtime system interprets the tables, and how the tables are call site and the set of callee-saved registers that are un-
generated. Section 3 describes how to reduce the size of the changed on all paths to the call site.
mapping from program counter values to indexes. Section 4 First, the set of descriptors is computed. The list of call
describes the testbed used to evaluate the tables. Section 5 sites is copied and sorted lexicographically using the live
measures the size of the tables, the sharing that occurs, and pointer information. The sorted list is scanned and every
the effects of improvements. time the live pointer information changes, a new index is
assigned and the live pointer information is added to the
2. USING SMALL INDEXES TO REPRESENT list of descriptors. The descriptor index is recorded for the
run of call sites with the same information.
LIVE POINTER INFORMATION Finally, the three tables are generated. The call site table
This section describes how the garbage collection tables is formed by creating an array from the return addresses
use small indexes to represent the descriptions of pointer lo- stored in the list of call sites The descriptor index table
cations and how the tables are interpreted and generated. It is formed by creating an array from the descriptor indices
also describes compiler optimizations that reduce the num- stored in the list of call site sets. The descriptor table is
ber of indexes and the format of the descriptions of pointer created by converting the list of descriptors to an array of
locations for the 80x86. 32-bit integers encoding the descriptors.
Garbage collection is assumed to occur only upon a call
from a thread into the runtime system, that is, at defined 2.4 Compiler optimizations to reduce the num-
program points. For single-threaded programs those points ber of indexes
are allocation sites. For multi-threaded programs, those
points are allocation sites or system calls that cause a thread The number of distinct descriptions of pointer locations
to suspend its execution. The problem of finding pointer lo- (and thus indexes) can be reduced by improving the spatial
cations becomes the problem of mapping the return address locality of pointers on the stack. First, the stack frame is
for each call back to a description of the locations of pointers divided into pointer-containing and non-pointer containing
that are live when the call was made. sections. The pointers are placed in the stack slots closest
to the beginning of the local variable area. Second, stack
2.1 Table organization slots are colored so that they are reused for variables with
Figure 1 shows the organization of the tables. The call site disjoint lifetimes. This is important for large functions that
table and the descriptor index table map program counter have many live variables.
values to small indexes. More compact representations of
the call site table are discussed in Section 3. The call site ta- 2.5 Describing pointer locations for the 80x86
ble is a sorted array of 32-bit return addresses that maps call The descriptor formats depend on the register usage, the
sites to their numbers. The descriptor index table maps call calling convention, and the stack layout. The 80x86 has
site numbers to indexes; it is an array of indexes. The de- 8 general-purpose registers, two of which are used as the
scriptor table contains descriptions of the locations of point- stack pointer and frame pointer. The remaining registers are
ers. divided into three callee-save registers and three caller-save
Descriptors have two forms. The compact form encodes registers. Arguments are passed in the caller-save registers
the pointer locations into a 32-bit word. The escape form is and on the stack. The formats currently assume that all
a pointer to a variable-length record that describes pointer functions use frame pointers.
locations. The least-significant bit of a descriptor is used to Figure 3 shows the layout of a stack frame. The stack
distinguish between the the two forms. The details of the grows downward toward lower addresses. The frame pointer
formats are described in Section 2.5. points to the saved frame pointer of the caller. Argument
values are at positive offsets from the frame pointer and
2.2 Table interpretation local variables are at negative offsets. Callee-save registers
Figure 2 show pseudo-code for walking the stack and pro- are saved in the memory locations immediately below the
cessing stack frames. The walkStack function takes the old value of the frame pointer.
frame pointer for the youngest Java stack frame and the re- Figure 4 shows the format of a compact descriptor. Each
turn address where execution will resume using that frame. compact descriptor has five bitfields. The first 3-bit field
It then walks the stack frames until it finds a non-Java stack describes what callee-save registers were saved on the stack
frame. For each stack frame, it uses a binary search of the at function entry. The second 6-bit field describes the usage
call site table to find the call site number. It then uses of the callee-save registers across a call. The third 4-bit field
the call site number to look up the descriptor index in the describes the highest 32-bit word that contains a live pointer
descriptorIndexTable and uses the descriptor index to look in the argument section of the stack. The bitfield is zero
up the actual descriptor. The function processFrame (not if there are no live pointer arguments. The fourth bitfield
shown) decodes the descriptor and forwards live heap point- is a bitmask that indicates what argument stack locations
ers for the call. contain live pointers. Its length is variable and determined
by the third bitfield. The fifth bitfield is a bitmask that uses
2.3 Table construction the remaining bits and indicates what local variable stack
Table construction starts with a list of all call sites where locations contain pointers.
garbage collection may occur. The list is ordered by call site Escape descriptor records have three variants that de-
return call site descriptor
address Call site number Descriptor index Descriptor descriptor
table index table table
Maps program counter to index
Figure 1: Table organization
// , fp = frame pointer, ra = current return address
walkStack(char *fp, char *ra) {
while (ra is for a Java function) {
find the call site number n of ra using binary search
index = descriptorIndexTable[n]
descriptor = descriptorTable[index]
processFrame(descriptor,fp);
a = *(fp+4)
fp = *(fp)
}
}
Figure 2: Pseudo-code for walking the stack
Higher addresses
Arguments to function
Return address
Stack growth
Frame pointer Previous frame
pointer value
Callee-save
register values
Local variables
of function
Stack pointer
Lower addresses
Figure 3: The layout of a stack frame.
31 13+n 13 9 3 0
n
Descriptor type
Bitmask for local Highest pointer-
containing Callee-save
variables
argument slot registers saved
Bitmask for Callee-save on entry
arguments register usage
across call
Figure 4: The format of a compact descriptor
scribe 8, 16, and 32-bit offsets respectively. The first field return address and that precedes it. Second, the exact call
describes the variant type and also includes the bitfields for site set number is determined by interpolation. The code is
the callee-save registers. The second field describes the num- disassembled from the entry address to the return address
ber of pointers on the stack and the rest of the record con- and the number of call site sets between the two points is
tains the frame pointer offsets of the pointer locations. All counted. The call site number of the sparse entry is used to
pointers are aligned, so the offsets are in units of 4 bytes. index into the call site bitmask to figure out what call sites
to count. The number of call site sets is then added to the
call site set number of the sparse entry.
3. COMPACT MAPPINGS FROM PROGRAM The back-mapping table is broken into segments that map
COUNTER VALUES TO INDEXES sparse call entries to 16-bit offsets for call site and call site
The call site table described in the previous section was set numbers. The offsets are added to base call site and call
an array of 4-byte return addresses. This section describes site set numbers for the segment. On average, each sparse
three techniques for reducing the size of the call site table. entry needs only 4 bytes of information in the table.
The disassembler only needs to recognize call instructions
3.1 Grouping adjacent call sites into sets and calculate instruction lengths. The runtime system code
Diwan et al. [5] observe that adjacent call sites often to do this for the 80x86 consists of 120 lines of table-driven
have the same live pointer information. These adjacent call C code and 300 lines of table initialization code, even though
sites, along with call sites that cannot garbage collect, can the 80x86 has a highly irregular instruction format with vari-
be grouped into sets. The call site table no longer maps able length instructions. The disassembler would be simpler
individual return addresses to call site numbers. Instead, for a RISC architecture where all instructions are the same
it maps ranges of program counter values to call site set length.
numbers. The cost of disassembling code can be significant on the
80x86 because the irregular instruction format means that
3.2 Using a 2-level table with 16-bit offsets code must be examined one byte at a time. A 16-element
The call site table can be implemented as a 2-level table LRU cache is used to amortize the cost of disassembling
with 16-bit offsets. Each 64K segment of the instruction code; the cache maps return addresses to their call site set
address space has its own table that keeps the offsets of call numbers.
sites within the segment. The starting call site number is An advantage of the sparse call site table over the 2-level
also recorded for each segment. table is that the compiler no longer needs to include an as-
If call sites are reasonably dense, this technique roughly sembler because it does not need to calculate precise offsets.
halves the call site table size, replacing 32 bits of information A disadvantage of the sparse call site table is that a debug-
with 16 bits of information. A drawback of the technique is ger may confuse the disassembly process when it inserts a
that the compiler must know the offset of each label from the breakpoint instruction. The breakpoint instruction may re-
base of the code segment to calculate the 64K segment for place a call instruction or, on a machine like the 80x86 with
the label and the 16-bit offset within that segment. For the variable length instructions, overwrite part of an instruction.
80x86, which has variable length instructions, the Marmot To avoid confusing the disassembler, the debugger must in-
compiler uses it own assembler and generates object files form the disassembler about any changes that it makes to
directly when using a 2-level call site table. the executable code.
3.3 Using a sparse table and interpolating via 4. EXPERIMENTAL TESTBED
disassembly This section describes the experimental testbed. It con-
Another approach to reducing the call site table size is sists of an optimizing compiler, a set of benchmarks, and
based on the observation that executable code itself con- the machine used for timing measurements.
tains some of the information in the table. Specifically, the
number of a call site can be calculated by disassembling the 4.1 The Marmot optimizing compiler for Java
executable code from the start of the code to the current Marmot [6] is a whole-program, optimizing Java com-
return address, assuming that no data has been intermixed piler that was developed as a research platform. It pro-
with the code. From this perspective, tables accelerate the duces 80x86 executables and includes standard scalar op-
disassembly process and allow the disassembly process to timizations such as constant and copy propagation, com-
skip data that is intermixed with code. The tables also sup- mon subexpression elimination, dead-assignment elimina-
port grouping call sites into sets as described in Section 3.1. tion, loop invariant code motion, induction variable elim-
Three tables can be used to interpolate call site set num- ination, strength reduction, and inlining of small, statically
bers by disassembling the executable code. The first table called functions. It also includes basic object-oriented opti-
is a sparse call site table that has a subset of the entries in mizations such as virtual call rebinding based on class hier-
the original table. For example, the sparse table could have archy analysis, null check elimination, type test elimination,
1/n of the entries of the original table, where n is a compiler and removal of uninvoked methods and advanced optimiza-
parameter. The second table is a back-mapping table that tions such as stack allocation of objects [7], array bounds
maps each sparse entry to its call site number and a call check elimination, and the elimination of unnecessary syn-
site set number. The third table is a bitmask that indicates chronization operations [11].
which call sites are the beginnings of call site sets. All programs were compiled using a generational garbage
The calculation of the call site set number for a given collector with a write barrier. A card-marking write barrier
return address proceeds in two steps. First, the sparse call was used by default because a sequential store buffer (SSB)
site table is searched to find the entry that is closest to the write barrier may increase the size of tables (see Section 5.8).
4.2 The benchmarks In practice, call sites are grouped into call site sets (see
Table 1 lists the benchmarks. The benchmarks are divided Section 3.1). Even compared to the number of call site sets,
into three groups. The SPEC JVM98 group includes the the number of indexes is small. The fifth column of Table 2
programs from the SPEC JVM98 benchmark suite [4]. The shows the number of call site sets and the sixth column
MiscJava group includes various Java programs. It includes shows the number of indexes as a percentage of call site
the Marmot compiler, which is the largest program used sets.
in this study. The IMPACT group includes C programs
translated to Java by the IMPACT/NET project [9]. Several
5.3 Most indexes fit into 9 bits
of the original C programs came from the SPEC95 suite [3]. The number of indexes is small enough that for most pro-
Section 5 focuses on four benchmarks that illustrate a grams they fit into 8 or 9 bits. Table 2 illustrates this.
range of results: 201 compress, 202 jess, wc, and Marmot. In fact, of the 20 benchmarks, 13 need 8-bit indexes and
201 compress and 202 jess are typical of the SPEC JVM98 5 need 9-bit indexes. Only two programs need more than
and Misc programs. Wc is typical for the IMPACT pro- 9 bits: 213 javac needs 10-bit indexes and Marmot needs
grams. Marmot has the largest table size relative to code 13-bit indexes.
size of all the benchmarks.
5.4 Most call sites have few pointers live across
4.3 Hardware platform them
Benchmarks were run on an otherwise unloaded 300 MHz The number of indexes is small relative to the number of
dual Pentium II (x86 Family 6 Model 5) processor Gateway call sites because most call sites have few pointers live across
2000 E-5000 running Windows NT 2000. The machine was them. These call sites need inordinately few indexes.
disconnected from the local area network. High-resolution The following charts use call site sets to exclude call sites
on-chip cycle counters were used to measure the time spent where garbage collection cannot occur. Most call site sets
garbage collecting; entry cycle counts were subtracted from have few pointers live across them. Figure 7 graphs the cu-
exit cycle counts. The benchmarks were run repeatedly until mulative distribution of call site sets as a function of the
standard deviations were nominal. number of pointers live across those sets. There are 3 or
fewer pointers live across 90% of the call sites sets for wc,
201 compress, and 202 jess. For Marmot, which is an ex-
5. EMPIRICAL EVALUATION treme case, there are 7 or fewer pointers live across 90% of
In this section, the size of the tables, the sharing that oc- its call site sets. On average, most benchmarks have 4 or
curs and the effects of table improvements on size are evalu- fewer pointers live across 90% of their call site sets.
ated using the experimental testbed described in Section 4. The reuse of indexes is very high for call site sets with
only a few pointers live across them. Figure 8 shows the
5.1 Size of tables ratio of indexes to call site sets. A low ratio indicates high
Figure 5 shows the size of garbage collection tables as a reuse. As program sizes increase, the reuse improves.
fraction of code size using the best configuration (call sites
grouped into sets, a sparse call site table, and a divided stack 5.5 Call site sets reduce program counter map-
with stack slots reused for variables with disjoint lifetimes) pings by more than half
Code size is the size of generated Java code; code for the Call site sets substantially reduce the size of the mapping
runtime system is excluded. The average table size for all from program counter values to descriptor indexes. Table 2
20 benchmarks is 3.6% of code size. also shows that grouping call sites into sets typically halves
Each bar shows the contribution of various tables. The the amount of data that must be recorded, reducing both
call site table is the sparse version of the call site table. It the size of the call site table and the descriptor index table.
includes only every tenth entry in the original table. The
auxiliary tables include the back-mapping table and the call 5.6 Sparse call site tables use the least amount
site bitmask. Only two programs need escape descriptors: of space and negligibly increase execution
Marmot and impgo. All call sites in the remaining programs times
can be described with compact descriptors. Figure 6 compares three different call site table imple-
There is considerable variation in the size of the descriptor mentations: the original implementation that uses 4-byte
index tables. In part, this is because only 8-bit and 16-bit entries, the two-level table implementation, and the sparse
indexes are supported currently. This causes table sizes to call site table. As expected, the two-level implementation is
abruptly double when the number of indexes exceeds 256. roughly half the size of the the original implementation and
With more implementation work, indexes whose sizes are the sparse call site tables are even better than the two-level
not byte multiples could be used. implementation. In all cases, call sites are grouped into sets.
Compressing the call site table is important for producing
5.2 The number of indexes is small compact tables. A comparison with Figure 5 shows that the
The number of indexes relative to the number of call sites original call site table is usually larger than the entire size
is small. Table 2 shows some basic statistics for 201 compress, of the garbage collection tables in the best configuration.
202 jess, Marmot, and wc. The first column shows the dif- Using a sparse call site table instead of the original call
ferent sizes of the programs; it gives the executable code site table negligibly increases overall benchmark execution
size for the Java code. The second column shows the num- times. It also has a small effect on garbage collection time
ber of call sites. The third column shows the number of unless garbage collection time is very small. Table 3 shows
indexes. The fourth column shows the number of indexes as the percentage of running time spent garbage collecting for
a percentage of call sites. selected benchmarks when using the original call site table.
Name Lines Description
of code
The SPECJVM98 group
201 compress 927 Compression program compressing and decompressing files
202 jess 11K Java Expert Shell System solving set of puzzles
209 db 1028 An in-memory database program performing a sequence of operations
213 javac unknown Java bytecode compiler
222 mpegaudio unknown MPEG audio program decompressing audio files in MPEG Layer-3 format
227 mtrt 3753 Multi-threaded ray tracer
228 jack unknown Java parser generator generating its own parser
The Misc Java group
marmot 127K Marmot compiling 213 javac
cn2 578 CN2 induction algorithm
javacup 8760 JavaCup generating a Java parser
jlex100 14K JLex generating a lexer for sample.lex, run 100 times.
plasma 648 A constrained plasma field simulation/visualization
slice 989 Viewer for 2D slices of 3D radiology data
SVD 1359 Singular-Value Decomposition (100x600 matrices)
The IMPACT group
impdes 561 The IMPACT benchmark DES encoding a large file
impgo 32K The IMPACT transcription of the SPEC95 go.099 benchmark
impgrep 551 The IMPACT transcription of the UNIX grep utility on a large file
impijpeg 8446 The IMPACT transcription of the SPEC95 ijpeg.132 benchmark
docli 8864 The IMPACT transcription of the SPEC95 li.130 benchmark, modified
to use the system garbage collector.
impwc 148 The IMPACT transcription of the UNIX wc utility on a large file
Table 1: The Java benchmark programs
0.07 0.07
Escape descriptors
Original
0.06 Descriptor table 0.06 Two-level
Descriptor index Sparse
table
Fraction of code size
Fraction of code size
0.05 Auxiliary tables 0.05
Call site table
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
201_compress 202_jess Marmot wc 201_compress 202_jess Marmot wc
Figure 5: GC table sizes as fractions of code size Figure 6: Comparison of call site tables
Program Code size Call Indexes Index/CS Call site Indexes/CSS
sites sets
(bytes) (%) (%)
201 compress 64,240 1,845 115 6.2 554 20.8
202 jess 199,056 5,520 351 6.4 2,873 12.1
Marmot 2,049,920 61,609 4,972 8.1 34,477 14.5
wc 37,008 941 105 11.1 404 26.3
Table 2: The number of indexes is much smaller than the number of call sites
1 1
0.9 Marmot
0.9
202_jess
Cumulative fraction of call site sets
Indexes as a fraction of call site sets
0.8 0.8 average
0.7 0.7 201_compress
wc
0.6 0.6
0.5 0.5
Marmot
0.4 0.4
202_jess
0.3 0.3
201_compress
0.2 0.2
wc
0.1 0.1
0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19+ 0 1 2 3 4 5 6
Number of live pointers Number of live pointers
Figure 7: Most call site sets have few pointers live Figure 8: The ratio of indexes to call site sets
across them
It also shows the percentage decrease in the speed of the using previous techniques. The tables share descriptions of
garbage collector due to the use of sparse call site tables. pointer locations at many garbage collection points. The
The decrease in GC speed for docli is large because stack sharing is possible because there are relatively few pointers
walking represents a large fraction of the (small) GC time. live across most garbage collection points. The size of the
Docli allocates a large amount of data that dies quickly. descriptions of pointer locations is typically less than 1% of
There are many collections of the younger generation, each executable code size.
of which includes a stack walk, but very little data survives The mapping from program counter values to indexes is
each collection, so GC time is small. represented compactly using several techniques. First, adja-
cent call sites that have the same live pointer information or
5.7 Reusing stack slots is important for larger cannot garbage collect are grouped into sets. This roughly
programs halves the number of garbage collection points. Second, the
Reusing stack slots for variables with disjoint lifetimes is table of return addresses is represented as a two-level table
important for reducing the size of the tables for larger pro- with 16-bit offsets or sparsely. The two-level table works
grams. Figure 9 shows the effect of disabling the reuse of well because there are usually many return addresses within
stack slots for variables with disjoint lifetimes. The size of a 64K segment of code. The sparse table uses the fact that
tables for Marmot increases by 21%. Other benchmarks af- the executable code contains most of the information in the
fected by this include 213 javac (9%), java cup (16%), and table. By disassembling the executable code, the exact call
impgo (35%). The primary cause of the increase is an in- site number can be interpolated.
crease in the size of escape descriptors. A secondary cause
is an increase in the number of descriptors, which increases
the size of the descriptor table.
7. REFERENCES
[1] Andrew W. Appel. Simple generational garbage
5.8 Effect of using an SSB write barrier collection and fast allocation. Software: Practice and
A generational collector uses a write barrier to track point- Experience, 19(2):171183, February 1989.
ers from older generations to younger generations. This al- [2] Andrew W. Appel. Compiling with Continuations.
lows younger generations to be collected independently from Cambridge University Press, 1992.
older generations. A check is placed at each store of a pointer [3] Standard Performance Evaluation Corporation. SPEC
into a heap object. A sequential store buffer (SSB) write CPU 95 benchmarks. Online version at
barrier appends the locations of cross-generational pointers http://www.spec.org/osg/cpu95, 1995.
to a buffer [1, 10, 8]. A garbage collection may be triggered [4] Standard Performance Evaluation Corporation. SPEC
when the buffer overflows. JVM98 benchmarks. Online version at
Thus, using an SSB write barrier introduces additional http://www.spec.org/osg/jvm98, 1998.
garbage collection points at pointer stores into heap objects. [5] Amer Diwan, J. Eliot B. Moss, and Richard L.
This increases the size of the tables. Figure 10 shows the Hudson. Compiler support for garbage collection in a
effect of this increase. It is primarily due to an increase in statically typed language. In Proceedings of the
the number of call site sets. This increases the size of the SIGPLAN '92 Conference on Programming Language
tables that map program counter values to indexes. Design and Implementation, pages 273282, San
Francisco, California, June 1992. SIGPLAN, ACM
6. SUMMARY Press.
This paper describes how to produce compact garbage col- [6] Robert Fitzgerald, Todd B. Knoblock, Erik Ruf,
lection tables that are 20-25% of the size of tables produced Bjarne Steensgaard, and David Tarditi. Marmot: An
Program GC time with Decrease in GC speed
original table with sparse table
(% of running time) (%)
213 javac 17 0.2
227 mtrt 3 1.3
228 jack 3 2.0
cn2 20 0.2
javacup 13 0.2
docli 0.5 18.9
Table 3: Decrease in garbage collector speed from using a sparse call site set table
25% 20%
18%
20% 16%
Increase in size of tables
Increase in size of tables
14%
15% 12%
10%
10% 8%
6%
5% 4%
2%
0% 0%
201_compress 202_jess Marmot wc 201_compress 202_jess Marmot wc
Figure 9: Effect of disabling the reuse of stack slots Figure 10: Effect of using an SSB write barrier on
for variables with disjoint lifetimes on the size of the size of tables
tables
optimizing compiler for Java. Software: Practice and
Experience, 30(3):199232, March 2000.
[7] David Gay and Bjarne Steensgaard. Fast escape
analysis and stack allocation for object-based
programs. In 9th International Conference on
Compiler Construction (CC'2000), Lecture Notes in
Computer Science. Springer-Verlag, March 2000. to
appear.
[8] Antony L. Hosking, J. Eliot B. Moss, and Darko
Stefanovi`. A comparative performance evaluation of
c
write barrier implementations. In Proceedings of the
ACM '92 Conference on Object-Oriented Programming
Systems, Languages, and Applications (OOPSLA),
pages 92109, Vancouver, British Columbia, October
1992. ACM Press.
[9] C.-H. A. Hsieh, J. C. Gyllenhaal, and W. W. Hwu.
Java bytecode to native code translation: the Caffeine
prototype and preliminary results. In IEEE, editor,
Proceedings of the 29th annual IEEE/ACM
International Symposium on Microarchitecture,
December 24, 1996, Paris, France, 1109 Spring
Street, Suite 300, Silver Spring, MD 20910, USA,
1996. IEEE Computer Society Press.
[10] Richard L. Hudson, J. Eliot B. Moss, Amer Diwan,
and Christopher F. Weight. A language-independent
garbage collector toolkit. COINS Technical Report
91-47, University of Massachusetts, Amherst, June
1991.
[11] Erik Ruf. Removing synchronization operations from
Java. In Proceedings of the ACM SIGPLAN '00
Conference on Programming Language Design and
Implementation, Vancouver, Canada, June 2000. ACM
Press. to appear.
[12] James M. Stichnoth, Guei-Yuan Lueh, and Michal
Cierniak. Support for garbage collection at every
instruction in a java compiler. In Proceedings of the
SIGPLAN '99 Conference on Programming Language
Design and Implementation, pages 118127, May 1999.