aleksandar milenković

42
for Unobtrusive Real-time Compression of Instruction and Data Address Traces Aleksandar Milenković (collaborative work with Milena Milenković, IBM and Martin Burtscher, Cornell University) The LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama in Huntsville Email: [email protected] Web: http://www.ece.uah.edu/~milenka http://www.ece.uah.edu/~lacasa

Upload: desma

Post on 14-Jan-2016

63 views

Category:

Documents


1 download

DESCRIPTION

Algorithms and Data Structures for Unobtrusive Real-time Compression of Instruction and Data Address Traces. Aleksandar Milenković (collaborative work with Milena Milenković, IBM and Martin Burtscher, Cornell University) The LaCASA Laboratory Electrical and Computer Engineering Department - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Aleksandar Milenković

Algorithms and Data Structures forUnobtrusive Real-time Compression of

Instruction and Data Address Traces

Aleksandar Milenković

(collaborative work with Milena Milenković, IBM andMartin Burtscher, Cornell University)

The LaCASA Laboratory

Electrical and Computer Engineering Department

The University of Alabama in Huntsville

Email: [email protected]

Web: http://www.ece.uah.edu/~milenka

http://www.ece.uah.edu/~lacasa

Page 2: Aleksandar Milenković

2

Outline

Program Execution Traces: An Introduction Background and Motivation Techniques for Trace Compression Trace Compressor in Hardware Instruction Address Trace Compression

Stream Detection Stream Caches N-tuple Compression Using Tuple History Table

Data Address Trace Compression Results Conclusions

Page 3: Aleksandar Milenković

3

Program Execution Traces: An Introduction

What are they? A stream of recorded events Trace types

Basic block traces for control flow analysis Address traces for cache studies (instruction and data addresses) Instruction words for processor studies Operands for arithmetic unit studies

Who is using traces? Computer architects for evaluation of new architectures Computer analysts for workload characterization Software developers for program tuning, optimization, and debugging

What are trace issues? Trace collection Trace reduction Trace processing

Page 4: Aleksandar Milenković

4

Program Execution Traces: An Introduction

int main(void) {

int a[100], b[100], c[100];

int s = 5, sum = 0, i = 0;

// init arrays

for(i=0; i<100; i++) {

a[i] = 2;

b[i] = 3;

}

for(i=0; i<100; i++) {

c[i] = s*a[i] + b[i];

sum = sum + c[i];

}

printf("sum = %d\n", sum);

}

.L11: mov r1, ip, asl #2

ldr r2, [r4, r1]

ldr r3, [lr, r1]

mla r0, r2, r8, r3

add ip, ip, #1

cmp ip, #99

add r6, r6, r0

str r0, [r5, r1]

ble .L11

.L6: mov r3, ip, asl #2

str r4, [r5, r3]

add ip, ip, #1

cmp ip, #99

str r1, [lr, r3]

ble .L6

Page 5: Aleksandar Milenković

5

Program Execution Traces: An Introduction

for(i=0; i<100; i++) {

c[i] = s*a[i] + b[i];

sum = sum + c[i];

}

2 0x020001f4

0 0x020001f8 0xbfffbe24

0 0x020001fc 0xbfffbc94

2 0x02000200

2 0x02000204

2 0x02000208

2 0x0200020c

1 0x02000210 0xbfffbb04

2 0x02000214

@ 0x020001f4: mov r1,r12, lsl #2

@ 0x020001f8: ldr r2,[r4, r1]

@ 0x020001fc: ldr r3,[r14, r1]

@ 0x02000200: mla r0,r2,r8,r3

@ 0x02000204: add r12,r12,#1 (1 >>> 0)

@ 0x02000208: cmp r12,#99 (99 >>> 0)

@ 0x0200020c: add r6,r6,r0

@ 0x02000210: str r0,[r5, r1]

@ 0x02000214: ble 0x20001f4

InstructionAddress

DataAddressType

Dinero+ Execution Trace

Page 6: Aleksandar Milenković

6

Outline

Program Execution Traces: An Introduction Background and Motivation Techniques for Trace Compression Trace Compressor in Hardware Instruction Address Trace Compression

Stream Detection Stream Caches N-tuple Compression Using Tuple History Table

Data Address Trace Compression Results Conclusions

Page 7: Aleksandar Milenković

7

Problem: Traces Are Very Large

Difficult (expensive) to store, transfer, and use them How large?

An example of tracing Collect instruction and data address traces for a program that is

running 2 minutes on a real machine Assumptions

Single core superscalar processor executing 2 instructions every clock cycle

3 GHz clock rate; 64-bit addresses (8 bytes) Load and store instruction make 40% of all instructions

Trace size: 2*60s*3*109*2*1.4*8 = 7.3 TBytes (1 T = 240) That’s not all

Multiple cores on a single chip More detailed information needed

(e.g., include time stamps when an event occurs) Need to compress traces

Page 8: Aleksandar Milenković

8

Problem: Debugging Is Far From Fun

Traditional debugging Stop execution and examine the CPU/memory state

When to stop? On every instruction? But, we have trillions of them for minutes of execution time!

Stop on breakpoints to save time; But, you may miss a critical state that leads to an erroneous task behavior (you do not have whole history)

Difficult, time-consuming, not fun, but you have to do it Even more problems

When you stop the processor, you perturb the interaction of that processor’s task with other processors and I/O devices

Often, the very process of looking for a bug in your program, will make that the bug disappears (we interfere with normal program execution)

Problems are amplified in multi-core processors (complex interactions between processors, synchronization)

Need a cost-effective and unobtrusive tracing mechanism

Page 9: Aleksandar Milenković

9

Outline

Program Execution Traces: An Introduction Background and Motivation Techniques for Trace Compression Trace Compressor in Hardware Instruction Address Trace Compression

Stream Detection Stream Caches N-tuple Compression Using Tuple History Table

Data Address Trace Compression Results Conclusions

Page 10: Aleksandar Milenković

10

Existing Solutions

What are we are looking for? Effective reduction techniques: lossless, high compression ratio,

fast decompression General purpose compression algorithms

Ziv-Lempel (gzip) Burroughs-Wheeler transformation (bzip2) Sequitur

Trace specific compression techniques Are better tuned to exploit redundancy in traces Better compression, faster, and can be combined with

general-purpose compression algorithms Problem: They are targeting software implementations;

But we would like real-time, unobtrusive trace compression

Page 11: Aleksandar Milenković

11

Existing Solutions: Trace-Specific Compression Technique

Lossless Compression

Instructions Instructions + data

- Acyclic path (WPP [Larus 1999], Time Stamped WPP [Zhang and Gupta 2001])

- N-tuple [Milenkovic, Milenkovic and Kulick 2003]

- Instruction (PDI [Johnson, Ha and Zaidi 2001])

Graph with number of repetitions in nodes

Replacing an execution sequence with its identifier

Control flow graph + trace of transitions

Offset

Offset + repetitions

Link data addresses to dynamic basic block

Link data addresses to loop

Regenerate addresses

Abstract execution

Value Predictor

Mache [Samples 1989],LBTC [Luo and John 2004]

QPT [Larus 1993]

[Hamou-Lhadj and Lethbridge 2002]

PDATS [Johnson, Ha and Zaidi 2001]

[Pleszkun 1994],SBC [Milenkovic and Milenkovic, 2003]

[Elnozahy 1999], SIGMA [DeRose, et al. 2002]

[Eggers, et al. 1990],[Larus 1993]

VPC [Burtscher and Jeeradit 2003],TCGEN [Burtscher and Sam 2005]

Page 12: Aleksandar Milenković

12

Outline

Program Execution Traces: An Introduction Background and Motivation Techniques for Trace Compression Trace Compressor in Hardware Instruction Address Trace Compression

Stream Detection Stream Caches N-tuple Compression Using Tuple History Table

Data Address Trace Compression Results Conclusions

Page 13: Aleksandar Milenković

13

Trace Compression in Hardware

How does it work? We propose a set of compression algorithms targeting on-the-fly

compression of instruction and data address traces How much does it cost?

We strive to provide a good compression ratio while minimizing required chip area and the number of pins on the trace port

Who is going to benefit from it? Software developer who are debugging emerging SOCs

(system-on-a-chip), multi-core (RISC, DSP) devices Developers/performance analysts of real-time embedded systems Maybe even more advanced uses

Goals Small on-chip area and small number of pins Real-time compression (never stall the processor) Achieve a good compression ratio

Page 14: Aleksandar Milenković

14

Trace Compressor: System Overview

SCIT

Stream Cache(SC)

Data Address Stride Cache (DASC)

2nd LevelCompressor

Processor Core

SCMT DT DMT

Program

Counter

Data Address

Task Switch

Trace Output Controller

To External Unit

DAPC

Data Address

Buffer

Data Repetitions

Processor Core

Memory

Trace Compressor

System Under Test

Trace port

External Trace Unitfor Storing/Processing(PC or intelligent drive)

Page 15: Aleksandar Milenković

15

Outline

Program Execution Traces: An Introduction Background and Motivation Techniques for Trace Compression Trace Compressor in Hardware Instruction Address Trace Compression

Stream Detection Stream Caches N-tuple Compression Using Tuple History Table

Data Address Trace Compression Results Conclusions

Page 16: Aleksandar Milenković

16

Instruction Address Trace Compression

How does it work? Detect instruction streams

Def.: An instruction stream is defined as a sequential run of instructions, from the target of a taken branch to the first taken branch in the sequence

Our previous study showed that the number of unique streams in an application is fairly limited (ACM TOMACS’07)

The average number of instructions in an instruction stream is 12 for SPEC CPU2000 integer applications and 117 for SPEC CPU 2000 floating-point applications (ACM TOMACS’07)

(S.SA, S.L) uniquely identify an instruction stream Replace an instruction stream with the corresponding

stream cache index

Page 17: Aleksandar Milenković

17

Stream Detector + Stream Cache

F(S.SA, S.SL)

iSet

Hit/Miss

SCMT (SA, SL) SCIT

’00…0’S.SA & S.L

Stream Cache (SC)

NSET - 1

…NWAY - 1

=?

iWay

S.SA & S.LFrom InstructionStream Buffer

Stream Cache Index Trace

Stream Cache Miss Trace

iWay

PC

PPC

-

S.SA S.L

SA

=! 4

SL

Instruction Stream Buffer

SA

SA

0

1

i

01

reserved

SA L

Page 18: Aleksandar Milenković

18

Detect and Compress An Ins. Stream

Detect a new instruction stream1. Get next PC; 2. ndiff = PC – PPC; 3. if (ndiff != 4 or SL == MaxS) {4. Place (SA & SL) into the instruction stream buffer;5. SL = 1;6. SA = PC;7. } else SL++;8. PPC = PC;Compress instruction stream1. Get the next instruction stream record

from the instruction stream buffer(S.SA, S.SL);2. Lookup in the stream cache with iSet = F(S.SA, S.SL);3. if (hit) 4. Emit(iSet && iWay) to SCIT; 5. else {6. Emit reserved value 0 to SCIT;7. Emit stream descriptor (S.SA, S.SL) to SCMT;8. Select an entry (iWay) in the iSet set to be replaced;9. Update stream cache entry: SC[iSet][iWay].Valid = 1 10. SC[iSet][iWay].SA = S.SA, SC[iSet][iWay].SL = S.SL;}11.Update stream cache replacement indicators;

Page 19: Aleksandar Milenković

19

Instruction Trace Compression -An Analytical Model (General case with SCIT packing)

Definitions SL.Dyn – Average stream length

(dynamic) CR(SC.I) – Compression ratio for

the instruction component N – Number of instructions SC.Hit(Nset,Nway) - Stream cache

hit rate with NsetNway entries Stream cache has NsetNway

entries => Log2(NsetNway) bits for SCIT components

Sizes: 1 byte for stream length

(stream are cut on 256) 4 bytes for stream starting address

).1(5)(log8

1.4

).(

5).1(.

)(

8

)(log

.)(

4)(

)()(

)().(

2

2

WAYSNSETNWAYSSET

WAYSNSETN

WAYSSET

HitSCNN

DynSLISCCR

BytesHitSCDynSL

NSCMTSize

BytesNN

DynSL

NSCITSize

BytesNItraceSize

SCMTSizeSCITSize

ItraceSizeISCCR

DynSLISCCRLimNN

DynSLISCCRLimNN

DynSLISCCRLimNN

NN

DynSLLimISCCRLim

HitSCWAYSSET

HitSCWAYSSET

HitSCWAYSSET

WAYSSETHitSCHitSC

.34.5)).((64

.57.4)).((128

.4)).((256

)(log

.32)).((

1.

1.

1.

21.1.

Page 20: Aleksandar Milenković

20

Instruction Trace Compression –An Analytical Model (General case with SCIT packing)

0.0

0.3

0.5

0.8

1.0

1

11

21

0

5

10

15

20

25

30

35

40

45

50

Bits/Inst

SL.DynSC.HitRate

Size(SC Itrace)

Page 21: Aleksandar Milenković

21

2nd Level Instruction Address Trace Compression

Observation: a small number of streams that exhibit a very strong temporal locality

Consequences High stream cache hit rates Size(SCIT) >> Size(SCMT) There exists a lot of redundancy in the SCIT stream

How could we exploit this? N-tuple Compression Using N-Tuple History Table

Page 22: Aleksandar Milenković

22

N-tuple Compression Using Tuple History Table

SCIT Trace

N-tuple History Table(FIFO)

==?’00…0’ index

1

MaxT-1

Hit/Miss

TUPLE.HIT Trace TUPLE.MISS Trace

N-tuple Input Buffer

Page 23: Aleksandar Milenković

23

N-tuple Compression Using Tuple History Table (THT)

1. Get the next SCIT2. if (N-tuple incoming stream buffer is full) { 3. Lookup in the Tuple History Table (THT);4. if (hit) {5. Emit(index in the THT) to the Tuple.Hit trace;6. // emit the first index found in the buffer7. } else {8. Emit(0) to Tuple.Hit trace;9. Emit(N-tuple) to Tuple.Miss trace;}10. Update the Tuple History Table; }

Page 24: Aleksandar Milenković

24

Outline

Program Execution Traces: An Introduction Background and Motivation Techniques for Trace Compression Trace Compressor in Hardware Instruction Address Trace Compression

Stream Detection Stream Caches N-tuple Compression Using Tuple History Table

Data Address Trace Compression Results Conclusions

Page 25: Aleksandar Milenković

25

Data Address Trace Compression

More challenging task Data addresses rarely stay constant

during program execution But, they often have a regular stride Proposed approach exploits locality of memory

referencing instructions and regularity in data address strides

Use new structure Data Address Stride Cache (DASC)

Page 26: Aleksandar Milenković

26

index

PC

Data Address Stride Cache (DASC)

0

1

i

N - 1

… …

… …

LDA Stride

LDA-DA

G(PC)

DA

==?’0’ ’1’

DT (Data trace)

DMT Data Miss Trace

Stride.Hit

Tagless Data Address Stride Cache

Stride.Hit

Page 27: Aleksandar Milenković

27

Data Address Compression: Tagless DASC

// Compress data address stream1. Get the next pair from data buffers (PC, DA)2. Lookup in the data address stream cache indexSet = G(PC);3. cStride = DA - DASC[iSet].LDA;4. if (cStride == DASC[iSet].Stride) {5. Emit(‘1’) to DT; //1-bit info 6. } else {7. Emit(‘0’) to DT;8. Emit DA to DMT;9. DASC[iSet].Stride =lsb(cStride);}10. DASC[iSet].LDA = DA;

Page 28: Aleksandar Milenković

28

Tagless DASC Compression Ratio: An Analytical Model

Definitions Nmemref – Number of memory referencing

instructions DASC.AddressHit – Address hit Sizes: 4 byte data address

tAddresssHiDASCDSCCR

BStrideHitDASCNDMTSizeDTSize

BNDtraceSize

DMTSizeDTSize

DtraceSizeDSCCR

memref

memref

.03125.1

1).(

)]125.04).1[()()(

4)(

)()(

)().(

3203125.0

1)).((

1.

DSCCRLim

AddressHitDASC

Page 29: Aleksandar Milenković

29

2nd Level Data Address Trace Comp.

DT

Prev.DT

=?CNT

Data Header(DH)

Data Repetition Trace (DRT)

// Detect data repetitions1. Get next DT byte; 2. if (DT == Prev.DT) CNT++;3. else {4. if (CNT == 0) {5. Emit Prev.DT to DRT;6. Emit ‘0’ to DH;7. } else {8. Emit (Prev.DT, CNT) pair to DRT;9. Emit ‘1’ to DH;}10. Prev.DT = DT;

Page 30: Aleksandar Milenković

30

Outline

Program Execution Traces: An Introduction Background and Motivation Techniques for Trace Compression Trace Compressor in Hardware Instruction Address Trace Compression

Stream Detection Stream Caches N-tuple Compression Using Tuple History Table

Data Address Trace Compression Results Conclusions

Page 31: Aleksandar Milenković

31

Experimental Evaluation

Goals Assess the

effectiveness of the proposed algorithms

Explore the feasibility of the proposed hardware implementations

Workload 16 MiBench

bechmarks

IC NUS maxSL SL.Dyncjpeg 104,607,812 1636 239 10.89djpeg 23,391,628 1324 206 21.81lame 1,285,111,635 3410 252 27.81tiff2bw 143,254,646 1058 43 12.79tiff2rgba 151,691,275 1146 75 27.54tiffmedian 541,260,067 1431 75 22.22tiffdither 832,951,018 1831 51 12.57mad 286,974,899 1659 1055 20.09sha 140,885,982 495 62 15.15bf_e 544,053,846 413 300 5.85rijndael_e 319,977,971 542 254 18.94ghostscript 708,090,638 6900 187 8.70rsynth 824,942,227 1323 180 15.77stringsearch 3,675,745 439 62 5.61adpcm_c 732,513,651 347 71 54.63gsm_d 1,299,270,245 845 401 11.07

Page 32: Aleksandar Milenković

32

SC Hit Rate (NSETS <= 16)

Stream Cache Hit Rate

(setsXways); n = log2(setsXways); XOR Mapping: S.SA<5+n:6> xor S.L<n-1:0>

8x1 8x2 8x4 8x8 16x1 16x2 16x4 16x8cjpeg 63.39 87.85 95.10 98.96 84.74 93.27 98.61 99.77djpeg 74.99 86.98 90.46 96.26 78.82 90.29 94.84 98.72lame 42.20 68.40 86.35 93.88 57.63 81.06 92.45 95.57tiff2bw 96.77 96.82 96.86 97.26 96.80 96.93 97.52 99.04tiff2rgba 69.97 92.36 92.80 96.07 70.68 93.18 95.78 99.97tiffmedian 93.67 97.24 97.38 97.91 94.04 97.43 98.05 99.13tiffdither 33.55 64.97 72.87 84.97 65.29 72.94 85.18 95.18mad 25.12 63.12 78.33 96.68 56.46 75.38 96.49 97.92sha 85.13 86.23 91.11 99.82 86.77 93.29 97.14 99.99bf_e 22.86 54.05 97.24 99.17 58.05 77.17 99.25 99.98rijndael_e 31.29 34.89 40.87 89.91 33.68 40.29 78.59 99.78ghostscript 5.52 13.31 27.87 68.01 13.51 29.93 60.95 97.66rsynth 34.62 61.45 82.22 90.50 55.58 79.09 89.31 95.13stringsearch 61.51 64.94 72.66 82.03 64.38 72.30 80.96 96.86adpcm_c 99.29 99.69 100.00 100.00 99.79 100.00 100.00 100.00gsm_d 47.66 67.17 97.11 97.92 61.34 79.65 97.89 98.60average 55.47 71.22 82.45 93.08 67.35 79.51 91.44 98.33harmean 31.14 54.39 71.93 91.84 52.45 71.15 89.53 98.28

Page 33: Aleksandar Milenković

33

SC Hit Rate (NSETS >= 32)

Stream Cache Hit Rate

(setsXways); n = log2(setsXways); XOR Mapping: S.SA<5+n:6> xor S.L<n-1:0>

32x1 32x2 32x4 32x8 64x1 64x2 64x4 128x1 128x2 256x1cjpeg 89.31 97.34 99.34 99.94 90.98 98.13 99.56 92.22 98.64 92.47djpeg 86.32 94.18 97.52 99.69 92.81 96.53 99.40 94.09 97.84 95.06lame 71.17 89.63 95.40 96.55 78.05 94.05 96.38 80.25 95.02 81.43tiff2bw 97.00 97.72 98.86 99.88 97.63 98.54 99.56 98.14 99.13 98.47tiff2rgba 71.31 94.89 98.75 99.97 73.08 97.36 99.69 75.45 99.11 76.29tiffmedian 95.08 98.07 98.96 99.82 98.04 98.79 99.72 98.39 99.34 98.70tiffdither 73.82 84.85 94.36 98.43 79.55 91.50 97.64 86.45 95.26 86.61mad 67.26 90.09 98.01 98.94 76.48 94.13 98.90 83.40 95.14 87.35sha 91.12 96.59 99.99 99.99 92.79 98.25 99.99 96.63 99.96 96.65bf_e 58.93 77.62 99.28 99.99 70.84 99.35 99.99 71.02 99.36 71.20rijndael_e 40.88 67.23 99.71 100.00 58.90 80.57 99.92 67.34 96.24 72.22ghostscript 26.18 56.64 91.29 99.22 47.02 81.14 99.05 62.24 88.75 65.11rsynth 72.93 89.45 95.14 99.38 84.33 93.00 98.74 91.14 96.11 93.16stringsearch 70.28 80.36 90.98 99.81 76.98 88.51 97.52 82.10 93.88 84.88adpcm_c 99.79 100.00 100.00 100.00 99.80 100.00 100.00 100.00 100.00 100.00gsm_d 72.40 97.89 98.57 99.44 74.77 98.46 99.30 75.09 99.00 76.09average 73.99 88.28 97.26 99.44 80.75 94.27 99.08 84.62 97.05 85.98harmean 65.60 85.57 97.08 99.43 77.98 93.57 99.06 83.52 96.82 85.17

Page 34: Aleksandar Milenković

34

SC Compression Ratio (NSETS <= 16)

8x1 8x2 8x4 8x8 16x1 16x2 16x4 16x8cjpeg 19.75 39.33 50.07 54.32 34.49 45.29 53.16 49.13djpeg 53.67 75.81 79.17 93.13 55.97 78.59 86.58 92.94lame 34.08 53.49 85.10 105.36 42.49 70.78 98.68 101.47tiff2bw 95.41 77.64 65.45 57.69 77.54 65.73 58.54 55.45tiff2rgba 58.70 124.89 111.86 116.39 56.03 114.06 114.65 125.64tiffmedian 128.52 139.29 117.62 104.02 111.40 117.97 104.89 96.80tiffdither 13.60 22.33 25.38 33.49 22.49 25.42 33.73 45.06mad 19.28 33.89 46.50 86.70 29.67 42.80 85.82 81.15sha 54.19 51.00 56.66 79.87 52.18 63.08 67.86 69.24bf_e 5.51 8.33 30.54 29.43 8.97 13.19 29.58 26.60rijndael_e 19.88 20.17 21.15 60.39 19.85 20.98 41.61 85.52ghostscript 6.83 7.20 8.23 14.82 7.22 8.43 12.88 35.09rsynth 17.31 25.98 41.65 51.48 23.18 37.76 49.10 56.39stringsearch 9.76 9.96 11.26 13.61 9.84 11.16 13.18 21.74adpcm_c 532.10 423.73 349.56 291.32 427.95 349.56 291.31 249.71gsm_d 14.80 20.68 57.54 51.84 18.20 26.96 51.77 46.87Average 67.71 70.86 72.36 77.74 62.34 68.24 74.58 77.43Harmonic mean 18.06 23.62 32.64 44.23 22.54 28.67 41.42 54.59Weighted Harmonic Mean 16.33 22.15 34.40 47.07 21.10 28.02 44.12 57.43

Page 35: Aleksandar Milenković

35

SC Compression Ratio (NSETS > 16)

32x1 32x2 32x4 32x8 64x1 64x2 64x4 128x1 128x2 256x1cjpeg 37.56 49.34 47.98 43.42 36.27 44.98 42.61 34.47 40.79 31.64djpeg 66.67 83.80 87.35 85.92 78.66 83.24 84.73 74.56 78.75 69.99lame 53.84 87.70 100.69 94.89 60.23 94.89 94.22 59.74 89.06 57.69tiff2bw 66.04 59.23 54.91 50.86 58.92 53.99 50.08 52.87 49.03 47.52tiff2rgba 53.49 109.55 117.53 110.00 52.56 109.38 108.47 52.39 105.44 50.41tiffmedian 102.06 104.99 95.91 88.09 104.82 95.04 87.65 93.02 86.08 83.45tiffdither 26.00 33.36 43.45 46.61 28.37 38.68 44.97 32.38 40.64 30.12mad 35.12 63.77 81.52 75.42 41.24 67.99 75.28 46.59 63.90 48.66sha 56.69 65.83 69.24 60.59 54.57 62.98 60.59 58.09 60.47 51.91bf_e 8.70 12.46 25.57 23.29 10.55 25.68 23.29 10.03 22.58 9.55rijndael_e 21.15 31.72 85.17 75.74 27.01 41.02 75.44 30.20 63.78 31.71ghostscript 8.07 11.93 26.57 33.52 10.24 19.15 33.25 12.60 22.29 12.69rsynth 31.88 49.36 56.42 61.17 41.12 51.48 59.32 47.85 52.80 47.00stringsearch 10.63 12.95 16.92 22.23 11.80 15.48 19.96 12.68 17.18 12.78adpcm_c 343.85 291.31 249.71 218.50 287.50 249.71 218.50 249.70 218.50 218.49gsm_d 22.09 51.75 46.79 43.07 22.01 46.52 42.78 20.89 42.17 20.17Average 58.99 69.94 75.36 70.83 57.87 68.76 70.07 55.50 65.84 51.49Harmonic mean 24.68 35.11 50.03 51.34 28.19 43.90 50.05 29.55 44.82 28.68Weighted Harmonic Mean 23.88 36.89 54.14 54.24 27.54 47.57 53.60 28.95 47.81 28.05

Page 36: Aleksandar Milenković

36

Findings about SC Size/Organization

SC with 128 entries CR(32x4) = 54.139,

CR(16x8) = 57.427 32x4 is a reasonable choice

(call it MAX) SC with 256 entries

CR(64x4) = 53.6 But even smaller SCs

work very well 64 entries: CR(8x8) = 47.068,

CR(16x4) = 44.116 16 entries: CR(8x2) = 22.145

Associativity Higher is better for very small SCs

(direct mapped is not an option) Less important for larger SCs

SC.Hit WaysEntries 1 2 4 8

8 55.47 59.67 61.06 59.5416 67.35 71.22 74.58 73.6032 73.99 79.51 82.45 82.8264 80.75 88.28 91.44 93.08

128 84.62 94.27 97.26 98.33256 85.98 97.05 99.08 99.08

CR(SC.I) WaysEntries 1 2 4 8

8 16.33 17.59 16.99 15.7916 21.10 22.15 27.81 26.6132 23.88 28.02 34.40 33.9664 27.54 36.89 44.12 47.07

128 28.95 47.57 54.14 57.43256 28.05 47.81 53.60 54.24

Page 37: Aleksandar Milenković

37

SC + N-tuple Compression Ratio

DEF. FAST BEST BEST DEF.SC.I SC.I+NT I.GZ I.GZ I.GZ I.BZ2 I.GZGZ

cjpeg 47.98 147.56 109.58 54.53 124.45 341.96 265.66djpeg 87.35 188.53 71.78 39.85 73.70 201.98 232.47lame 100.68 158.10 60.46 128.53 333.88 87.61 174.24tiff2bw 54.91 235.05 114.11 83.94 114.42 376.83 615.21tiff2rgba 117.53 407.14 121.30 20.26 121.98 529.62 1292.74tiffmedian 95.91 414.37 152.81 92.32 155.47 472.93 1017.48tiffdither 43.45 65.48 91.09 46.35 99.84 170.88 147.09mad 81.52 177.84 73.46 37.82 78.52 94.31 206.25sha 69.24 440.35 211.43 54.42 221.75 656.53 4112.08bf_e 25.57 98.46 170.38 40.95 182.25 352.02 4065.94rijndael_e 85.17 454.63 143.82 12.56 150.62 141.77 2392.86ghostscript 26.57 50.91 100.64 39.68 111.24 212.54 434.53rsynth 56.42 91.83 46.71 30.61 48.02 143.22 191.22stringsearch 16.92 24.22 82.06 32.34 100.63 202.47 132.76adpcm_c 249.71 1583.96 233.12 107.34 233.63 1862.63 12764.68gsm_d 46.79 174.57 85.37 59.22 87.17 165.58 507.06TOTAL 54.14 125.90 87.45 47.24 112.91 171.97 321.58

Page 38: Aleksandar Milenković

38

DASC Compression

DASC DASC DASC DASC DASC DASC DEF. FAST BEST32 64 128 256 512 1024 D.GZ D.GZ D.GZ D.BZ2 D.GZGZ

cjpeg 3.35 4.60 5.14 5.77 6.54 7.11 5.98 4.50 6.11 18.20 9.57djpeg 2.81 3.57 4.28 4.96 5.22 5.29 4.22 3.78 4.22 8.62 4.92lame 1.20 1.52 2.81 3.82 4.49 4.88 6.56 4.01 6.63 8.80 8.60tiff2bw 76.31 78.04 84.28 105.04 128.84 134.23 2.14 2.55 2.10 14.28 3.07tiff2rgba 5.98 79.81 91.24 107.49 127.05 139.57 2.10 2.79 2.09 4.06 4.03tiffmedian 8.64 8.70 8.74 8.81 8.87 8.89 4.40 4.37 4.53 11.16 6.03tiffdither 2.61 6.08 7.21 8.69 9.65 10.06 4.51 4.41 4.51 7.87 6.77mad 1.30 1.59 1.96 2.07 2.35 2.64 4.08 3.60 4.22 13.47 6.97sha 6.58 7.94 9.38 10.79 11.36 11.36 44.91 8.36 45.61 172.71 591.69bf_e 1.58 1.95 2.38 2.61 2.75 2.91 7.58 4.86 7.83 16.35 9.08rijndael_e 1.10 1.10 1.10 1.13 1.29 2.06 4.24 3.22 4.27 7.31 4.49ghostscript 1.07 1.19 1.56 2.19 2.93 5.27 27.21 18.58 27.46 47.42 40.83rsynth 1.22 1.36 1.76 3.81 8.30 32.43 24.44 21.46 25.27 57.40 43.88stringsearch 1.80 2.04 2.70 4.13 4.44 5.16 11.12 8.57 11.23 15.03 11.47adpcm_c 3.13 3.13 3.13 3.13 3.13 3.13 6.57 3.64 7.15 12.27 11.42gsm_d 2.67 4.48 11.30 13.60 14.81 16.78 21.60 18.05 23.29 63.53 33.15TOTAL 1.66 2.04 2.80 3.77 4.67 6.12 6.78 5.51 6.90 13.29 9.70

Page 39: Aleksandar Milenković

39

Hardware Complexity Estimation

CPU model In-order, Xscale like Vary SC and DASC parameters

SC and DASC timings SC: Hit latency = 1 cc, Miss latency = 2 cc. DASC: Hit.Hit = 2cc (address hit, stride hit),

Hit.Miss = 3cc (address hit, stride miss), Miss = 2 cc (address miss).

To avoid any stalls Instruction stream input buffer: MIN = 2 entries

Will go up with more aggressive CPU model Data address input buffer: MIN = 8 entries

Will go up with more aggressive CPU model Results are relatively independent from

SC and DASC organization

Page 40: Aleksandar Milenković

40

Hardware Complexity estimation

Component Entries Complexity Bytes

Instruction stream buffer 2 2x5 10

Stream detector 2 2x4 8

Stream cache 32x4 128x5 640

N-tuple history buffer 255 255x8*(7/8) 1785

Data address buffer 8 8x8 64

Data address stride cache 1024 1024x5 5120

Data repetitions state machine

- 2 2

Page 41: Aleksandar Milenković

41

Outline

Program Execution Traces: An Introduction Background and Motivation Techniques for Trace Compression Trace Compressor in Hardware Instruction Address Trace Compression

Stream Detection Stream Caches N-tuple Compression Using Tuple History Table

Data Address Trace Compression Results Conclusions

Page 42: Aleksandar Milenković

42

Conclusions

Algorithms for instruction and data address trace compression that enable the following:

real-time trace compression with low complexity (small structures, small number of external pins) excellent compression ratio

Proposed mechanism Stream Caches + Ntuple for instruction traces Data address stride cache + data repetitions for data address traces

Analytical & simulation analysis focusing on Compression ratio (bits/instructions) Optimal sizing/organization of the structures

Findings The proposed base mechanism outperforms FAST GZ software

implementation with relatively small structures (32x4 SC, 1024x1 DASC) perform as well as DEFAULT GZ software implementation when N-tuple

and Data repetitions are included