1 recursive data structure profiling easwaran raman david i. august princeton university

1

Recursive Data Structure Profiling

Easwaran RamanDavid I. August

Princeton University

2

Motivation

Huge processor-memory performance gap Latency > 100 cycles

significant fraction of memory operations in typical programs

In many applications, Recursive Data Structures (RDS) constitute a large fraction of memory usage

198

0

198

41

986

198

81

990

199

21

994

199

61

998

200

0

DRAM

CPU

198

2

1

10

100

1000

Year

3

Motivation

Techniques to minimize the performance impact of this gap Caching, prefetching, out-of-order execution

Not very successful for RDS Difficult to statically determine many RDS properties Accesses are irregular and usually lie in critical path of execution

while (valid(node)){ //do something //with node->data node = next(node)}

Traversal Code

Short loop body prevents efficient OoO execution

0x1000

0x2000

0x3000

0x4000

An RDS layout example

Non-contiguous layout results in irregular access patterns

4

Motivation Linearization[Clark76, Luk99] Speculation recovery costs outweighs benefits if

the next pointer field gets overwritten frequently Information on the dynamic behavior of

entire RDS structure is important

pos

index = 0;head = pos[index]while(head){ foo(head) head = pos[index++] check(head)}

head

head

Placement of the nodes in the figure correspond to their placement in memory

1008 1012 1004 10161000

5

RDS Profile

RDS profiling gives a ‘logical’ understanding of runtime behavior ‘Application creates 100 trees’ instead of ‘application

allocates 2MB in heap’ ‘Linked list traversed 10 times’ instead of ‘Address

0x10004000 accessed 200 times’ Profile for linearization: next pointer field in list L

is modified n times

6

node *tree_create(){node *n = (node *)malloc(…); … n->left =

tree_create(…); n->right =

tree_create(…);}

RDS Discovery

Assign unique id for value returned by malloc and create a node labeled by that id

Connect nodes by a directed edge if both the address and the value of a store have valid ids

1

2 3

Dynamic Shape Graph

C function for creating a tree

call malloc ;id = 1mov r10 = r8…call tree_create… call malloc ;id = 2… mov r11 = r8store r10[offset1] = r11; create 1->2call tree_create… call malloc ;id = 3… mov r12 = r8store r10[offset2] = r12; create 1->3

Execution trace in (pseudo) assembly

7

RDS Discovery

Multiple RDS instances can be connected together in the DSG!

To separate them, we use properties of the static code Use another graph called

Static Shape Graph (SSG)

array = malloc(…);

for (i=…) array[i] = create_tree(…);

1

2

3 4 6 7

…5

8

RDS discovery

For every static call to malloc, create a node with unique id in the Static Shape Graph (SSG)

If a store creates an edge, connect the corresponding static nodes

Check for SCCs in the SSG

Connect two dynamic nodes only if their corresponding static nodes are in same SCC

1

2

3 4

5

6 7

DSG

Execution trace in (pseudo) assembly

A

T

SSG

call malloc; id = 1Mov r20 = r8…call malloc ;id = 2…mov r10 = r8………call tree_create…… call malloc ;id = 3…… mov r11 = r8…store r10[offset1] = r11; create 2->3…call tree_create…… call malloc ;id = 4…… mov r12 = r8… store r10[offset2] = r12;create 2->4store r20[0] = r10 ; create 1->2

9

Experimental setup

Uses Pin, a dynamic instrumentation tool for Itanium Mapping between address ranges and dynamic ids are

stored in an AVL tree Most recent mapping is cached

A mix of benchmarks from SPEC, Olden and other pointer intensive applications Dynamic instruction count varies from a few million (ks) to over

300 billion (mesa) All experiments run on a 900MHz Itanium 2 with 2 GB

RAM running RH 7.1

10

Profiler Performance Profile: RDS size, lifetime, access count Memory: <16 MB for all but 3 applications

Profiler Performance

05

10152025303540

Benchmark

Slo

wd

ow

n

Baseline: Execution using Pin (~ 10 times slower than native)

11

RDS usage statistics SCCs in static shape graph (RDS types)

Usually a few(<5) per benchmark, a maximum of 31 in parser #RDS instances (connected components in DSG)

Exhibits a wide range (1 in mcf to around million in parser) Tend to be live for long if the program creates only a few of them

Sizes of RDS instances Varies from a single node self-loop (parser) to a few hundred

thousand nodes (mcf, parser) #pointer chasing loads

Significant in many benchmarks Applications show vast diversity in RDS usage

A good reason for profiling them!

12

0

10

20

30

40

50

0 50 100Time (Normalized)

# R

DS

inst

ance

s

0

2000

4000

6000

8000

10000

12000

14000

perlbmk

ks

tree-puzzle

li

gcc

sqlite

parser

twolf

Temporal distribution

13

0

20

40

60

80

100

0 50 100

Time (Normalized)

% o

f R

DS

life

tim

e

gcc

mcf

sqlite

parser

perlbmk

ks

tree-puzzle

twolf

Cumulative distribution of RDS lifetimes

14

RDS Stability Stability of an RDS : A notion of how 'array-like'

an RDS is Stability index : an attempt to quantify this notion

Identify the time instances (alteration points) when changes occur to the RDS structure (by stores that replace existing pointers)

Count the traversals between successive alteration points

Stability index = #intervals that account for ‘most’ of the traversals

Lower index means higher stability

15

Cumulative distribution of stability index

0

20

40

60

80

100

0 5 10Stability Index

% o

f a

cc

es

se

s vpr

gcc

mcf

sqlite

parser

perlbmk

ks

tree-puzzle

twolf

16

Conclusion

Aggressive data structure level optimization techniques for RDS need profile information for improved performance

RDS profiling gives a better understanding of the runtime behavior of RDS

RDS usage varies widely across benchmarks

17

Extra Slides

18

RDS Profiling: Definitions

RDS type: The abstract form of the logical data structure that is manipulated by the program Examples: list, binary tree, graph, etc. Can be mutually recursive (nodes point to their

incident edges and vice versa to form a graph) RDS instance: A concrete realization of the RDS

type Example: the tree created in function foo, the list

pointed to by the first entry of the hash table.

1 recursive data structure profiling easwaran raman david i. august princeton university

Documents