effective and inexpensive (memory) race recording

73
Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors: Mark Hill, Rastislav Bodik Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood

Upload: aileen

Post on 11-Jan-2016

18 views

Category:

Documents


0 download

DESCRIPTION

Effective and Inexpensive (Memory) Race Recording. Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors: Mark Hill, Rastislav Bodik Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood. Effective. Inexpensive. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Effective and Inexpensive (Memory) Race Recording

Effective and Inexpensive(Memory) Race Recording

Min Xu

Thesis Defense

05/04/2006

Electrical and Computer Engineering Department, UW-Madison

Advisors: Mark Hill, Rastislav Bodik

Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood

Page 2: Effective and Inexpensive (Memory) Race Recording

2OverviewIncreasingly useful to replay multithreaded code• Race recording: key to dealing with nondeterminism

A Case Study• Long recording: 1 byte/kilo-instr• Always-on recording: less than 2% overhead• Low cost: 24 KB RAM/core• Support both SC & TSO (x86-like)

Effective Inexpensive

Race Recorder

Long

Rec

ordi

ng

Mor

e App

licab

le

Low O

verh

ead

Low C

ost

Page 3: Effective and Inexpensive (Memory) Race Recording

3

Order-ValueHybrid

RTRAlgorithm

Thesis Contributions

Set/LRUApproximation

CoherencePiggyback

Effective Inexpensive

Low CostHardware

SmallLog Size

Low RuntimeOverhead

SC & TSOApplicability

Page 4: Effective and Inexpensive (Memory) Race Recording

4Outline

Motivation & Problem

An Effective and Inexpensive Race Recorder

Evaluation Method & Results

RTRAlgorithm

Set/LRUApproximation

CoherencePiggyback

Order-ValueHybrid

Conclusion & My Other Research

5slides

21

6

3

Page 5: Effective and Inexpensive (Memory) Race Recording

Motivation & Problem

Page 6: Effective and Inexpensive (Memory) Race Recording

6Multithreaded Debugging

% gcc hash.c% a.outSegmentation fault%

% gdb a.outgdb> runProgram received SIGSEGV.In get() at hash.c:4545 a = bucket->d;

% gdb a.outgdb> runProgram exited normally.gdb>

% gcc para-hash.c% a.outSegmentation fault%

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%

Page 7: Effective and Inexpensive (Memory) Race Recording

7Race Recording

X=6

X = 1

X++

print(X)

X = 1

X++

print(X)

-X = X*5

--

---

X = X*5-

Thread IThread J

Original Replay

X=10

Recording

X= 6

-X = X*5

--

Log

Thread IThread J

Page 8: Effective and Inexpensive (Memory) Race Recording

8Recording for Multithreaded Replay

Race Recording• Not-an-issue for a single thread• Create the same general & data races

Checkpointing• Provide a snapshot of the program state• Many proposals (e.g., SafetyNet), not focus

Input Recording• Provide repeatable inputs• Some proposals (e.g., part of FDR), not focus

Focus

Page 9: Effective and Inexpensive (Memory) Race Recording

9A Good Race Recorder

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%

Long recording:small log

Low runtimeoverhead

Low cost

Applicability

Page 10: Effective and Inexpensive (Memory) Race Recording

10Desired & Existing Race Recorders

RecordingLength

Applicability

Overhead Cost

DesiredRecorder

Small Log Size

MPRacey

Code

SC

TSONegligible Slowdown

Little Hardware

InstRply ’87

R&C ’90

Bacon’91

Netzer’93

Déjà Vu ’98

RecPlay ’00JaRec ’04Our

Recorder

Page 11: Effective and Inexpensive (Memory) Race Recording

Order-ValueHybrid

Set/LRUApproximation

RTRAlgorithm

CoherencePiggyback

SmallLog Size

Page 12: Effective and Inexpensive (Memory) Race Recording

12

Reproduce exact same conflicts: no more, no less

Problem Formulation

ld A

Thread I Thread J

Recording

st B

st C

sub

ld B

add

st C

ld B

st A

st C

Thread I Thread J

Replay

Log

ld D

st D

ld A

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Conflicts(red)

Dependence(black)

Page 13: Effective and Inexpensive (Memory) Race Recording

13

Detect conflicts Write log

Log All Conflicts

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 14 35 46

Log I: 23

Log Size: 5*16=80 bytes(10 integers)

Dependence Log

16 bytes

Assign IC(logical Timestamps)But too many conflicts

Page 14: Effective and Inexpensive (Memory) Race Recording

14Netzer’s Transitive Reduction

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

TR reduced Log J: 23

35 46

Log I: 23

Log Size: 64 bytes(8 integers)

TR Reduced Log

Page 15: Effective and Inexpensive (Memory) Race Recording

15The Intuition of the New RTR Algorithm

After Reduction

From I to J

From J to I

Vectors

VectorsRegulate Replay (RTR)

Page 16: Effective and Inexpensive (Memory) Race Recording

16

Stricter Dependences to Aid Vectorization

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 45

Log I: 23

Log Size: 48 bytes(6 integers)

New Reduced Log

stricter

Reduced

Page 17: Effective and Inexpensive (Memory) Race Recording

17Compress Vectorized Dependencies

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: x=3,5, ∆=1

Log I: x=3, ∆=1

Log Size: 40 bytes(5 integers)

Vectorized Log

VectorDeps.

Reduce log size to KB/core/second

Page 18: Effective and Inexpensive (Memory) Race Recording

Order-ValueHybrid

Set/LRUApproximation

RTRAlgorithm

CoherencePiggyback

Low RuntimeOverhead

Page 19: Effective and Inexpensive (Memory) Race Recording

19Detect Conflicts

1

2

3

1

2

3

4

ld A

Thread I Thread J

Recording

st B

st C

add

st C

ld B

st A

A.readers.add(I, 1)

if (C.writer != I) log(WAW)foreach C.readers if (reader != I) log(WAR)C.readers.clear( )C.writer = (I, 3)

B.writer = (I, 2) C.writer =(J, 2)

if (B.writer != J) log(RAW)B.readers.add(J,3)

Expensive in software

A.readers

A.writer

Page 20: Effective and Inexpensive (Memory) Race Recording

20Use Cache and Cache Coherence

ProcI

Tag State Data TimestampA S … 1B M … 4

ProcJ

Tag State Data TimestampA S … 3B I … 2

A.readersA.writer

B.readersB.writer

ld B

Get/S Request

Data Response

Timestamp

Detect conflict in hardware with little runtime cost

RAWDetected& Logged

Page 21: Effective and Inexpensive (Memory) Race Recording

21Cache Evictions and Writebacks

ProcI

Tag State Data TimestampA S … 1B M … 4

ProcJ

Tag State Data TimestampA S … 3B I … 2

st A

OK with nonsilent eviction & directory eviction

C M … 3

Directory of A: Shared(I,J) Owner()

Get/SInv

AckTimestamp? WAR

Detected& Logged

M … 4

Page 22: Effective and Inexpensive (Memory) Race Recording

22Implement TR and RTR in Hardware

Ideal TR requires vector timestamps• Too expensive• New idea: Pairwise-TR (use scalar timestamp)• Enable pairwise transitive reduction

Optimal RTR algorithm is likely expensive• Implement a greedy RTR algorithm• One-pass, online algorithm• Keep a sliding window of vectorizable

dependencies

Page 23: Effective and Inexpensive (Memory) Race Recording

23Hardware Implementation

CacheEviction/writeback Solved, more details

later

Directory protocols Solved

Snooping protocols Partly solved

Two-level coherence Not yet solved

ProcessorOut-of-order/Prefetching Solved

Unordered message Solved

Counter overflow Solved

Thread Migration Not yet solved

Page 24: Effective and Inexpensive (Memory) Race Recording

Order-ValueHybrid

Set/LRUApproximation

RTRAlgorithm

CoherencePiggyback

Low CostHardware

Page 25: Effective and Inexpensive (Memory) Race Recording

25Timestamp Approximation

Tag State Data TimestampA S … 1B M … 2

One Set of I’s $

Correct, but more evictions more logged conflicts

1

2

3

1

2

3

J

ld A

Thread I Thread J

Recording

st B

st C

add

st C

ld B

st AI ld D

Use current IC of thread

I

C M … 3

Directory of A: Shared(I)

Page 26: Effective and Inexpensive (Memory) Race Recording

HardwareCost

Log Size

Page 27: Effective and Inexpensive (Memory) Race Recording

27

Tag State Data TimestampA S … 1B M … 2

One Set of I’s $ 1

2

3

1

2

3

J

ld A

Thread I Thread J

st B

st C

add

st C

ld B

st AI ld D

C M … 3

Recording

Set/LRU Approximation

Use current IC of thread

I

LRU guarantee B’s TS > A’s TS

Set/LRU better preserve reducibilitySmall $ more misses but still small log

Page 28: Effective and Inexpensive (Memory) Race Recording

28Hardware Cost of Timestamps

Coupled timestamp memory: overhead cache size• Not flexible• 64B line + 64b (24b) timestamp 12.5% (4.7%)

overhead• 192 KB for a 4MB L2

Need to modify cache

Tag State Data TimestampA S … 1B M … 2

Coupled Timestamp Memory

Page 29: Effective and Inexpensive (Memory) Race Recording

29Decoupled Timestamp Memory

Decoupling Small timestamp memory (Set/LRU)• e.g., 32-set, 64-way 99% transitive reduction• Timestamps Memory 24 KB

No need to modify cache

Tag State Data TimestampA S … 1B M … 2

Tag State DataA S …B M …

Tag TimestampA 1B 2

Cache

Timestamp Memory

Coupled Timestamp Memory

From 192 KB to 24 KB: 8x reduction

Page 30: Effective and Inexpensive (Memory) Race Recording

30

Order-ValueHybrid

Set/LRUApproximation

RTRAlgorithm

CoherencePiggyback

SC & TSOApplicability

Page 31: Effective and Inexpensive (Memory) Race Recording

31

ld A

ld B

st A,1

st B,1

ld A

ld B

st A,1

st B,1

ld A

ld B

st A,1

st B,1

A=1B=0

A=0B=1

A=1B=1

Recording with Total Store Order (TSO)

Majority of existing MP are non-SC

TSO is well defined, x86-like

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

A=B=0

ld A

ld B

st A,1

st B,1

A=0B=0

SC

TSO

Page 32: Effective and Inexpensive (Memory) Race Recording

32TSO Execution

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

A=B=0 ld A

ld B

st A,1

st B,1

A=0B=0

st A,1

st B,1

I

WrBuf

Memory System

J

WrBuf

A=0 B=0A=0 B=0

A=1 B=1

Page 33: Effective and Inexpensive (Memory) Race Recording

33Order-Value-Hybrid Recording

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

Recording

A=B=0

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

Replay Value UsedA=0

ld A

ld B

st A,1

st B,1

A=0B=0

st A,1

st B,1I

WrBuf

Memory System

J

WrBuf

A=0 B=0

WAROmitted Value

Logged

A=0 B=0

A=1 B=1

StartMonitor A

StartMonitor B

A Changed!

StopMonitor B

Page 34: Effective and Inexpensive (Memory) Race Recording

34Hybrid Recording with TR and RTR

Hybrid recording• All loads get correct values• Hardware similar to OoO SC [Gharachorloo et al.

’91]

Hybrid + TR & RTR• TR will not use the omitted WAR in reduction• RTR vectorize dependencies more conservatively

Page 35: Effective and Inexpensive (Memory) Race Recording

Evaluation Method & Results

Page 36: Effective and Inexpensive (Memory) Race Recording

36Put-it-together: Determinizer/CMP

Shared L2 Cache(L1 Dir)

TSM TSM

TSM TSM

Core1

Core2

Core4

Core3

L1_I$ L1_D$

TSM

IC

L1CoherenceController

Log TRReg

RTRReg

Page 37: Effective and Inexpensive (Memory) Race Recording

37Simulation Method

Commercial server hardware• GEMS: http://www.cs.wisc.edu/gems• Full-system (OS + application) executions• 4-core CMP (Sequential Consistent)

• 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory

Commercial server software• Apache – static web serving• SpecJBB – middleware• OLTP – TPC-C like• Zeus – static web serving

Page 38: Effective and Inexpensive (Memory) Race Recording

38Log Size: 1 byte/kilo-instr

Well within in the capability of current machines• Long recording (days – months) need improvement

0.0

0.5

1.0

1.5

2.0byte/core/kilo-instr

ApacheJBB OLTP Zeus AVG0

50

100

150

200KB/core/s

ApacheJBB OLTP Zeus AVG

Page 39: Effective and Inexpensive (Memory) Race Recording

39Runtime Overhead

Baseline With race recorder

0

20

40

60

80

100

Execution Time

Apache JBB OLTP Zeus

Interconnection Msg. B/W

Our recorder can be “always-on”

0

80

100

Apache JBB OLTP Zeus

60

40

20

Page 40: Effective and Inexpensive (Memory) Race Recording

40Benefits of RTR and Set/LRU (Log Size)

Pairwise-TR

Our RTR

Improvement by RTR

0

20

40

60

80

100

ApacheJBB OLTP ZeusAVG

Perfect TSM

24KB Set/LRU TSM

Effectiveness of Set/LRU

0

20

40

60

80

100

Apache JBB OLTP Zeus AVGL

og

S

ize

Lo

g

Siz

e

Page 41: Effective and Inexpensive (Memory) Race Recording

41Why RTR and Set/LRU Work Well?

RTR• Processors execute instructions at similar speed• Therefore, we can find “vectorizable”

dependencies

Set/LRU• Temporal locality makes the LRU timestamps old• We only need to know if a timestamp is “old-

enough”

Page 42: Effective and Inexpensive (Memory) Race Recording

42Sensitivity and Scalability

A design space of the timestamp memory (TSM)• Size: smaller TSM -> larger log• Read/write timestamp: should be used when TSM is

large• Partial timestamp: 24-bit enough• Associativity: higher better for RTR

Scalability of the recorder• Studied with modest processors (2p – 16p)• Commercial workloads, not scientific workloads• Log size increase slowly with number of cores

Page 43: Effective and Inexpensive (Memory) Race Recording

Conclusion & My Other Research

Page 44: Effective and Inexpensive (Memory) Race Recording

44Race Recording

Race recording Key to combat nondeterminism

My thesis An effective & inexpensive Recorder• RTR algorithm small log size• Coherence piggyback Negligible slowdown• Timestamp approximation Low hardware cost• Order-value hybrid support SC & TSO

Future work• Improve race recording algorithm • Improve race recorder implementation• Study race replay

Page 45: Effective and Inexpensive (Memory) Race Recording

45

Serializability Violation Detector [PLDI’05]Like a race detectorNo a priori annotation requirement

• “critical sections” are inferredIntend to detect bugs “actually” happen

• Check for a 2-Phase-Locking condition

Read in1

Read in2Write out1

Write out2

Write local

Read local

SharedVariables

A “Critical Section”

Page 46: Effective and Inexpensive (Memory) Race Recording

46Publications

FDR (ISCA’03)• Adopted by UCSD BugNet (ISCA’05)

SVD (PLDI’05)• Cited by Vaziri et al. (POPL’06)• Influenced new data race definition

RTR, Set/LRU & Hybrid• Submitted for publication

Page 47: Effective and Inexpensive (Memory) Race Recording

Thank you!

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%

Page 48: Effective and Inexpensive (Memory) Race Recording

48Acknowledgements

Joint work with my advisors• Mark Hill, Ras Bodik

Ph.D. Committee• David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau,

Barton Miller

Multifacet Group• Milo Martin, Dan Sorin, Carl Mauer, Brad Beckmann,

Kevin Moore, Alaa Alameldeen, Mike Marty, Luke Yen

Affiliates & Companies• Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric Bach,

Gang Luo, Alex Chow, IBM, Intel, Microsoft, Sun

Page 49: Effective and Inexpensive (Memory) Race Recording

49Deterministic Replay is Useful

Deterministic Replay is logically recreating a program execution

Present applications• Cyclic Debugging ([Pancake & Netzer ‘93])• Fault Tolerance (ExtraVirt [Lucchetti et al. ’05])• Intrusion Analysis (ReVirt [Dunlap et al. ’02])

Future applications• Data Recovery • Replay-based Synchronization

Page 50: Effective and Inexpensive (Memory) Race Recording

50Multicore and Multithreading

Multicore is common• AMD X2• IBM Power 5/6, Cell• Intel Pentium D, Core Duo• Sun SPARC T1

Multithreading is common• Server: high throughput• Scientific: high performance• Desktop/embedded: low response time

Page 51: Effective and Inexpensive (Memory) Race Recording

51Race Recording: Key to Determinism

Races: general race & data race [Netzer & Miller]• Both cause nondeterminism• Race recording can help, but

Existing race recorders are inadequate• Some generate large logs• Some have high runtime overhead• Some have high hardware cost (space overhead)• Support only sequential consistency

Need a better race recorder

Page 52: Effective and Inexpensive (Memory) Race Recording

52Recording/Replay & Debugging

Online Recorder

Crash

Dump “Core”

P1

P2

P3

P4

Checkpoint B Checkpoint C

Store log A Store log B Store log C

Checkpoint A

Crash

Read Checkpoint B

Replaying fromlog B, C

Deterministic Replayer

Page 53: Effective and Inexpensive (Memory) Race Recording

53Deterministic Replay & Fault Tolerance

Fault Recovery• Replay after a failure

Fault Detection• Replay then compare

(Courtesy of VMware)

Page 54: Effective and Inexpensive (Memory) Race Recording

54Future: Record/Replay & Undo/Redo

VM as a software platform• Ease software development• Fine granularity in Undo and Redo

Windows XP

Page 55: Effective and Inexpensive (Memory) Race Recording

55Future: Replay-based Synchronization

Three steps• Coarse-grain sync. fine-grain sync. hardware sync.

Results: higher performance

Works only if static control flow & fixed data addr• DSP kernels

ld Ast B

Unlock()

lock()st Ald B

Recording

ld Ast B st A

ld BReplay

Log

Page 56: Effective and Inexpensive (Memory) Race Recording

56Race Recording Related Work

Total-order recorders Partial-order recordersBacon ’91(Hardwar

e)

RecPlay ’00

JaRec ’04

R&C’90

Déjà Vu ’98

Bacon ’91(Hardware

)

Instant Replay ’87

Netzer ’93

Bus transactio

ns

Lamport Clocks

SchedulingBus

transaction groups

Variable versionVector clocks

Large log Small log Small log Large log Large log Small log

Low overhead

Low overhead

(sync only)

Low overhead(non-MP)

Low overhead

High overheadHigh

overhead

Low replay parallelism High replay parallelism

Page 57: Effective and Inexpensive (Memory) Race Recording

57Correctness of Order-Value-Hybrid

Removing WAR dependencies• Say thread I read, thread J write• Removing the WAR affects I’s read, not J’s write• But, for every dependence removed, thread I

reads correct value from the value log• Therefore, all reads get the correct value

Page 58: Effective and Inexpensive (Memory) Race Recording

58TR and TSO

TR affects dependencies reduced by a WAR• The WAR itself may later be removed during replay• Solution: Not use WAR in TR if the WAR can be

removed• Respond with a special flag when a loaded cache line

is stolen

1

2

1

2

st A

Thread I Thread J

st C

st B

st C

Recording

3 3ld B ld A

Must notbe reduced

Page 59: Effective and Inexpensive (Memory) Race Recording

59RTR and TSO

The sliding window may expose the ordered loads• Shrink the sliding window to avoid it

1

2

1

2

st A

Thread I Thread J

add

add

sub

Recording

3 3st B ld A

4 4ld C ld Bordered

in write bufffer

orderednew winfor j:3old win

for j:3

Not allowedby new window

Page 60: Effective and Inexpensive (Memory) Race Recording

60Deadlock Avoidance of RTR

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Recording

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Avoid deadlock by adhere to a SC total order

i:4j:1 j:2 i:3 i:4

Replay Cycle

Page 61: Effective and Inexpensive (Memory) Race Recording

61Recording Race-free Executions

No data races

Only need to record synchronization race

Deterministic replay up until the first data race

Page 62: Effective and Inexpensive (Memory) Race Recording

62Replay Parallelism

Replay performance depends on

(1)Number of synchronizations(2)Extra wait incurred by the

synchronizations

Page 63: Effective and Inexpensive (Memory) Race Recording

63Directory Protocols

Add sticky states in the directory• Retain states after writebacks• Need extra acknowledgements

Or, add extra timestamp memory in the directory• Helps to avoid extra acknowledgements

A tradeoff• Sticky states can be cheaper• But extra timestamp memory can be faster

Page 64: Effective and Inexpensive (Memory) Race Recording

64Snooping Protocols

Key problem is combined/implicit response• Not a problem for AMD Hammer

ProcI

Tag State Data TimestampA S … 1B M … 4

ProcJ

Tag State Data TimestampA S … 3B I … 2

st A

Get/XPull Shared

WARDetected& Logged

+ Current IC

Page 65: Effective and Inexpensive (Memory) Race Recording

65Nonsilent Evictions

ProcI

Tag State Data TimestampA S … 1B M … 4

ProcJ

Tag State Data TimestampA S … 3B I … 2

st A

Directory eviction: more false conflict, like snooping

C M … 3

Directory of A: Shared(J) Owner() StickyS(I,J)

Get/S

M … 4

AckTimestamp

TimestampMemory

Eviction

Page 66: Effective and Inexpensive (Memory) Race Recording

66Out-of-Order & Hardware Prefetching

Speculative execution• No IC assigned yet

Hardware prefetching• No IC assigned

Key idea: receive observation• Can associate a ld/st with current commit

instruction

Page 67: Effective and Inexpensive (Memory) Race Recording

67Unordered Messages in Interconnect

Message arrive out-of-order

Can affect reduction

But better add a sequence number• Reconstruct the message order• Enable IC compression by sending deltas

Page 68: Effective and Inexpensive (Memory) Race Recording

68Integer Overflow

IC and timestamps may overflow

IC: make it 64bit, will not overflow for a long time

Timestamps: use approximation techniques• MSB of IC + LSB of Timestamps

Page 69: Effective and Inexpensive (Memory) Race Recording

69Varying TSM Size

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Apache-1TS-RTRApache-1TS-TRApache-2TS-RTRApache-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

OLTP-1TS-RTROLTP-1TS-TROLTP-2TS-RTROLTP-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

SPECjbb-1TS-RTRSPECjbb-1TS-TRSPECjbb-2TS-RTRSPECjbb-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Zeus-1TS-RTRZeus-1TS-TRZeus-2TS-RTRZeus-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

Page 70: Effective and Inexpensive (Memory) Race Recording

70Varying Associativity

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Zeus-CurrentIC-RTRZeus-CurrentIC-TRZeus-SetLRU-TRZeus-SetLRU-RTR

(64KB, Full R/W Timestamps)

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

SPECjbb-CurrentIC-RTRSPECjbb-CurrentIC-TRSPECjbb-SetLRU-TRSPECjbb-SetLRU-RTR

(64KB, Full R/W Timestamps)

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

OLTP-CurrentIC-RTROLTP-CurrentIC-TROLTP-SetLRU-TROLTP-SetLRU-RTR

(64KB, Full R/W Timestamps)

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Apache-CurrentIC-RTRApache-CurrentIC-TRApache-SetLRU-TRApache-SetLRU-RTR

(64KB, Full R/W Timestamps)

Page 71: Effective and Inexpensive (Memory) Race Recording

71Varying Partial Timestamp Width

10 15 20 25 30

Partial Timestamp Width

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Zeus-TRZeus-RTR

(64sets, 64ways, Set/LRU)

10 15 20 25 30

Partial Timestamp Width

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

SPECjbb-TRSPECjbb-RTR

(64sets, 64ways, Set/LRU)

10 15 20 25 30

Partial Timestamp Width

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

OLTP-TROLTP-RTR

(64sets, 64ways, Set/LRU)

10 15 20 25 30

Partial Timestamp Width

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Apache-TRApache-RTR

(64sets, 64ways, Set/LRU)

Page 72: Effective and Inexpensive (Memory) Race Recording

72Log Size Scaling

2 4 8 16

Number of Cores

0.0

0.2

0.4

0.6

0.8

1.0

Log

Siz

e (

MB

/core

/s)

ApacheSPECjbbOLTPZeus

Page 73: Effective and Inexpensive (Memory) Race Recording

73In Retrospect …

What are you most proud of?• RTR improves TR after 13 years

What would you do differently if doing it again?• “replaying me is deterministic” (just kidding)• I wish I focused on race recording earlier

What the industry should do?• Implement the recorder as a VMM extension