getting real, getting dirty (without getting real dirty) ron k. cytron joint work with krishna kavi...

Getting Real, Getting Dirty(without getting real dirty)

Ron K. CytronJoint work with Krishna Kavi

University of Alabama at Huntsville

April 2001

Dante Cannarozzi, Sharath Cholleti, Morgan Deters, Steve Donahue

Mark Franklin, Matt Hampton, Michael Henrichs, Nicholas Leidenfrost, Jonathan Nye, Michael Plezbert, Conrad Warmbold

Center for Distributed Object ComputingDepartment of Computer Science

Washington University

Funded by the National Science Foundation under grant 0081214

Funded by DARPA under contract F33615-00-C-1697

Outline

• Motivation• Allocation• Collection• Conclusion

Traditional architecture and object-oriented programs

• Caches are still biased toward Fortran-like behavior

• CPU is still responsible for storage management

• Object-management activity invalidates caches– GC disruptive– Compaction

CPU + cache

L2 cache

An OO-biased design using IRAMs(with Krishna Kavi)

• CPU and cache stay the same, off-the-shelf

• Memory system redesigned to support OO programs

CPU + cache

L2 cache

IRAM interface

CPU + cache

L2 cache

malloc

Stable address for an object allows better cache behavior

Object can be relocated within IRAM, but its address to the CPU is constant

IRAM interface

CPU + cache

L2 cache

putfield/getfield

Object referencing—tracked inside IRAM–-supports garbage collection

IRAM interface

CPU + cache

L2 cache

gccompactprefetch

Goal: relegate storage-management functions to IRAM

Macro accesses

CPU + cache

L2 cache

p.getLeft().getNext()

*(*(p+12)+32)

Observe: code sequences contain common gestures (superoperators)

Gesture abstraction

CPU + cache

L2 cache

*(*(p+12)+32)

M143(x):

*(*(x+12)+32)

Goal: decrease traffic between CPU and storage

Gesture application

CPU + cache

L2 cache

Macro 143 (p)

M143(x):

*(*(x+12)+32)p.getLeft().getNext()

Gesture application

CPU + cache

L2 cache

Macro 143 (p)

M143(x):

*(*(x+12)+32)

Automatic prefetchingGoal: decrease traffic between CPU and storage

CPU + cache

L2 cache

Fetch p

Automatic prefetchingGoal: decrease traffic between CPU and storage

CPU + cache

L2 cache

Fetch p

Challenges

• Algorithmic– Bounded-time methods for allocation and

collection– Good average performance as well

• Architectural– Lean interface between the CPU and IRAM– Efficient realization

Storage Allocation (Real Time)

• Not necessarily fast• Necessarily predictable• Able to satisfy any reasonable request

– Developer should know “maxlive” characteristics of the application

– This is true for non-embedded systems as well

How much storage?

• curlive—the number of objects live at a point in time

• curspace—the number of bytes live at a point in time

Handles

Object Space

Objects concurrently live

How much object space?

Storage Allocation—Free List

• Linked list of free blocks

• Search for desired fit

• Worst case O(n) for n blocks in the list

Worst-case free-list behavior

• The longer the free-list, the more pronounced the effect

• No a priori bound on how much worse the list-based scheme could get

• Average performance similar

Slowdown of List-Based Allocator

180 3000

Number of objects allocated

Knuth’s Buddy System

• Free-list segregated by size

• All requests rounded up to a power of 2

Knuth’s Buddy System (1)

• Begin with one large block

• Suppose we want a block of size 16

• Recursively subdivide

• Yield 2 blocks size 16

• One of those blocks can be given to the program

• Yield: 2 blocks size 16

Worst-case free-list behavior

• The longer the free-list, the more pronounced the effect

• No a priori bound on how much worse the list-based scheme could get

• Average performance similar

Speedup of Buddy over List

0.89 0.910

1020304050607080

180 3000

Number of objects allocated

Spec Benchmark Results

Speedup of Buddy over List

Buddy System

• If a block can be found, it can be found in log(N), where N is the size of the heap

• The application cannot make that worse

Defragmentation

• To keep up with the diversity of requested block sizes, an allocator may have to reorganize smaller blocks into larger ones

Defragmentation—Free List

• Free-list permutes adjacent blocks

• Storage becomes fragmented, with many small blocks and no large ones

Blocks in memory

Free list

• Two issues:

Blocks in memory

Free list

– Join adjacent blocks

• Two issues:

Blocks in memory

Free list

– Reorganize holes (move live storage)

• Two issues:

Blocks in memory

Free list

– Reorganize holes

• Organization by address can help [Kavi]

• The blocks resulting from subdivision are viewed as “buddies”

• Their address differs by exactly one bit

• The address of a block of size 2 differs with its buddy’s address at bit n

Buddies—joining adjacent blocks

…0…

…1…

• When a block becomes free, it tries to rejoin its buddy

• A bit in its buddy tells whether the buddy is free

• If so, they glue together and make a block twice as big

Two problems

• Oscillation—Buddy looks like it may split, glue, split, glue—isn’t this wasted effort?

• Fragmentation—What happens when Buddy can’t glue but has space it would like to combine?

Buddy—oscillation

Problem is lack of hysteresis

• Some programs allocate objects which are almost immediately deallocated.– Continuous, incremental approaches to

garbage collection only make this worse!• Oscillation is expensive: blocks are glued

only to be quickly subdivided again

Estranged Buddy System

• Variant of Knuth’s idea• When deallocated, blocks are not eager to

rejoin their buddies• Evidence of value [Kaufman, TOPLAS ’84]• Slight improvement on spec benchmarks• Algorithmic improvement over Kaufman

Buddy-Busy and Buddy-Free

Blocks whose buddies are busy

Blocks whose buddies are free

Estranged Buddy—Allocation

Allocation heuristic

1. Buddy-busy

2. Buddy-free

3. Glue one level below, buddy-free

4. Search up (Knuth)

5. Glue below

Buddy-busy

Buddy-free

How well does Estranged Buddy do?(contrived example)

Size-8150 objects

200 300 400 500 600 700 800

Estranged

Estranged Buddy on SpecSpeedup of Estranged Buddy over Knuth

00.20.40.60.8

11.21.41.6

ace db

trtjac

Recall: two problems

• Oscillation—Buddy looks like it may split, glue, split, glue—isn’t this wasted effort?– Typically not, but can be

• Fragmentation—What happens when Buddy can’t glue but has space it would like to combine?

Buddy System—Fragmentation

• Internal fragmentation from rounding-up of requests to powers of two

• Not really a concern these days• Assume a program can run in maxlive

bytes• How much storage needed so Buddy

never has to defragment?• What is a good algorithm for Buddy

defragmentation?

Buddy Configurations

Allocated

Buddy Configurations

Allocated

Heap FullAllocated

Buddy can’t allocate size-2 blockAllocated

How Big a Heap for Non-Blocking Buddy (M = maxlive)?

• Easy bound: M log M• Better bound: M k,

where k is the number of distinct sizes to be allocated

• Sounds like a good bound, but it isn’t

• Defragmentation may be necessary

M bytes each level

Managing object relocation

• Every object has a stable handle, whose address does not change

• Every handle points to its object’s current location

• All references to objects are indirect, through a handle

Buddy Defragmentation

• When stuck at level k– No blocks free above

level k– No glueable blocks

free below level k– Assume maxlive still

suffices• Example: k=6, size 64

not available

Defragmentation Algorithm

• Recursively visit below to develop two buddies that can be glued

• Analogous to the recursive allocation algorithm

• Still, choices to be made….studies underway

Need 4 bytes

Allocated

Move 3 bytes? Move 1 byte?

Recall: two problems• Oscillation—Buddy looks like it may split,

glue, split, glue—isn’t this wasted effort?– Typically not, but can be

• Fragmentation—What happens when Buddy can’t glue but has space it would like to combine?– New algorithm to defragment Buddy– Selective approach—should beat List– Optimizations needed

Towards an IRAM implementation

• VHDL of Buddy System complete– DRAM clocked at 150 MHz– 10 cycles per DRAM access

• Need 7 accesses per level to split blocks• For 16Mbyte heap—24 levels

– 1680 cycles worst case: 11us– 168x slower than a read

• Can we do better?

Two tricks

• Find a suitable free-block quickly• Return its address quickly

Finding a suitable free block

• No space at 16, but 16 points to the level above it that has a block to offer

Finding a suitable free block

• Every level points to the level above it that has a block to offer

• Pointers are maintained using Tarjan’s path-compression

• Locating pointers are not stored in DRAM

Alternative free-block finder

• Path-compression may be too complex for hardware

• Instead, track the largest available free block

Alternative free-block finder

• Path-compression may be too complex for hardware

• Instead, track the largest available free block

• Tends to break up large blocks and favor formation of small ones

Fast return for malloc

• Want 16 bytes• Zip to the 64 display• WLOG we return the

first part of that block immediately to the requestor

• Want 16 bytes• Zip to the 64 display• WLOG we return the

first part of that block immediately to the requestor

• Adjustment to the structures happens in parallel with the return

Fast return for malloc

Improved IRAM allocator• ~10 cycles fast return• ~1000 cycles to recover, worst case• Is this good enough?

– Compare software implementation• ~1000 cycles worst case• ~600 cycles average on spec benchmarks

– Hardware can be much faster– Depends on recover time

Do programs allow us to recover?

• Run of jack—JVM instructions between requests

• 56% of requests separated by at least 100 JVM instructions• Assume 10x expansion, JVM to native code• For the 56%, we return in 10 cycles• Code motion might improve others

Min Median Max

3 181 174053

Garbage Collection

• While allocators are needed for most modern languages, garbage collection is not universally accepted

• Generational and incremental approaches help most applications

• Embedded and real-time need assurances of bounded behavior

Why not garbage collect?

• Some programmers want ultimate control over storage

• Real-Time applications need bounded-time overhead– RT Java spec relegates allocation and

collection to user control– Isn’t this a step back from Java?

Marking Phase—the problem

• To discover the dead objects, we use calculatus eliminatus– Find live objects– All others are dead

• To discover the dead objects, we– Find live objects

stack heap

• Pointers from the stack to the heap make objects live

• To discover the dead objects, we– Find live objects

• Pointers from the stack to the heap make objects live

• These objects make other objects live

stack heap

• To discover the dead objects, we– Find live objects– Sweep all others

away as dead

stack heap

• To discover the dead objects, we– Find live objects– Sweep all others

away as dead– Perhaps compact the

stack heap

Problems with “mark” phase

• Takes an unbounded amount of time• Can limit it using generational collection

but then it’s not clear what will get collected

• We seek an approach that spends a constant amount of time per program operation and collects objects continuously

Two Approaches

• Variation on reference counting• Contaminated garbage collection [PLDI00]

Reference Counting

• An integer is associated with every object, summing– Stack references– Heap references

• Objects with reference count of zero are dead

stack heap

Problems with Reference Counting

• Standard problem is that objects in cycles

stack heap

• Standard problem is that objects in cycles (and those touched by such objects) cannot be collected

stack heap

• Standard problem is that objects in cycles (and those touched by such objects) cannot be collected

• Contaminated gc will collect such objects

• Overhead of counting can be high

• Untyped stack complicates things

stack heap

The Untyped Stack

• The stack is a collection of untyped cells

• In JVM, safety is verified at class-load time

• No need to tag stack locations with what they contain

• Leads to imprecision in all gc methods

stack heap

Address?

• When a stack frame pops, all of its cells are dead

• Don’t worry about tracking cell pointers• Instead, associate an object with the last

stack frame that can reference the object

Reference Counting Approach

• s is zero or one, indicating none or at least one stack reference to the object

• h precisely reflects the number of heap references to the object

• If s+h=0 object is dead

Our treatment of stack activity

• Object is associated with the last-to-be-popped frame that can reference the object

• When that frame pops– If object is returned,

the receiving frame owns the object

• When that frame pops– Otherwise the object

is dead

Our reference-counting implementation

• The objects associated with the frame are linked together

stack heap

• When a stack frame pops, all of its cells no longer can point at anything

stack heap

• This object is unlinked, but still thought to be live

stack heap

• This object is dead and is collected

stack heap

• This object is also dead

stack heap

• This object is still live

stack heap

• Now the frame is gone

stack heap

1frame

• This object was linked to its frame all along

stack heap

1frame

• When heap count becomes zero, the object is scheduled for deletion in that frame

stack heap

1frame

• When frame pops, all are dead

stack heap

1frame

stack heap

1frame

stack heap

0frame

stack heap

0frame

Reference Counting

• Predictable, constant overhead for each JVM instruction– putfield decreases count at old pointed-to

object, increases count at new pointed-to object

– areturn associates object with stack frame if not already associated below

• How well does it do? We shall see!

Contaminated Garbage Collection

• Need to collect objects involved in reference cycles without resorting to marking live objects

• Idea– Associate each object with a stack frame

such that when that frame returns, the object is known to be dead

– Like escape analysis, but dynamic

Contaminated garbage collection

• Initially each object is associated with the frame in which it is instantiated

• When B references A, A becomes as live as B

• Now A, B, and C are as live as C

• Even though D is less live than C, it gets contaminated

• Should something reference D later, all will be affected

• Static finger of life• Now all objects

appear to live forever

• Static finger of life• Now all objects

appear to live forever• Even if E points away!

• Every object is a member of an equilive set– All objects in a set are scheduled for

deallocation at the same time– Sets are maintained using Tarjan’s disjoint

union/find algorithm• Nearly constant amount of overhead per

operation

Contaminated GC

• Each equilive set is associated with a frame

Contaminated GC

• Suppose an object in one set references an object in another set (in either direction)

Contaminated GC

• Suppose an object in one set references an object in another set (in either direction)

• Contamination!• The sets are unioned

Contaminated GC

• When a frame pops, objects associated with it are dead

Contaminated GC

Summary of methods

Reference counting• Can’t handle cycles• Handles pointing at

and then away

Contaminated GC• Tolerates cycles• Can’t track pointing

and then pointing away

Both techniques:

•Incur cost at putfield, areturn

•(Nearly) constant overhead per operation

Implementation details

• SUN JDK 1.1 interpreter version• Many subtle places where references are

generated: String.intern(), ldc instruction, class loader, JNI

• Each gc method took about 3 months to implement

• Can run either method or both in concert• Fairly optimized, more is possible

Size 1 Absolute

100000

150000

200000

250000

300000

350000

400000

450000

compress jess raytrace db javac mpegaudio mtrt jack checkit

RefCount

Spec benchmark effectiveness

size 10 Absolute

100000

200000

300000

400000

500000

600000

700000

800000

900000

compress jess raytrace db javac mpegaudio mtrt jack checkit

RefCount

Spec benchmark effectiveness

raytrace

mpegaudio

Exactness of Equilive Sets

raytrace

mpegaudio

Distance to die in frames

Speed of CGC

1.24 1.21.14 1.17

1.21.1

ace db

trtjac

over JDK bigheapover JDK sameheap

Speedups of Mark-Free Approaches

0.000.200.400.600.801.001.201.40

ace db

trtjac

RefCount

Future Plans

• VHDL simulation of more efficient buddy allocator

• VHDL simulation of garbage collection methods

• Better buddy defragmentation• Experiment with informed allocation• Comparison/integration with other IRAM-

based methods (with Krishna Kavi)

Informed Storage Management

• Evidence that programs allocate many objects of the same size

Benchmark jack20% fragmentation

0 5 10 15 20 25

Buddy block-size (log)

Benchmark raytrace12% fragmentation

0 5 10 15 20 25

Benchmark compress34% fragmentation

0 5 10 15 20 25

• Not surprising, in Java

same type same size

• Not surprising, in Java

same type same size• In C and C++ programmers brew their own

allocators to take advantage of this• What can we do automatically?

•Capture program malloc requests by phase

•Generate a .class file and put it in CLASSPATH

•Load the .class file and inform the allocator

Different phasesdifferent distributions

8 16 24 32

Block size

Phase 1

Phase 2

Phase 3

Phase 4

Phase 5

Phase 6

raytrace

How long is a phase?

500091000

461000

Phases1-3

Phase 4

Phase 5

Phase 6

•Phases 1-3 are common to all programs—JVM startup

•Phases are keyed to allocations, not time, for portability

Questions?

getting real, getting dirty (without getting real dirty) ron k. cytron joint work with krishna kavi...

Documents

getting your hands dirty

otos by bonnie hobbs/centr getting dirty for a good cause

getting down and dirty: values in education for · pdf...

real winemakers get their hands dirty. hands black with...

parks for produce. before getting your hands dirty,...before...

week 1: getting dirty - part 1

getting down n’ dirty with mobile flex 4.5 projects

getting down & dirty with responsive web design

sk40c + pic16f887 - cytron · robot . head to toe sk40c +...

getting their hands dirty: raccoons, freegans, and urban...

cse 436—software development models ron k. cytron...

getting your hands dirty testing magento 2 (at london...

getting your hands dirty with...

cytron catalog 2009

guia rapida cytron

getting dirty with the flex sdk

getting dirty in the data dirty... · sec schools ##all...

getting down and dirty…… but in a good way presented by:...

getting hands dirty with php7

getting your hands dirty testing magento 2 (at magetitansit)