1 recursive data structure profiling easwaran raman david i. august princeton university
TRANSCRIPT
1
Recursive Data Structure Profiling
Easwaran RamanDavid I. August
Princeton University
2
Motivation
Huge processor-memory performance gap Latency > 100 cycles
significant fraction of memory operations in typical programs
In many applications, Recursive Data Structures (RDS) constitute a large fraction of memory usage
198
0
198
41
986
198
81
990
199
21
994
199
61
998
200
0
DRAM
CPU
198
2
1
10
100
1000
Year
3
Motivation
Techniques to minimize the performance impact of this gap Caching, prefetching, out-of-order execution
Not very successful for RDS Difficult to statically determine many RDS properties Accesses are irregular and usually lie in critical path of execution
while (valid(node)){ //do something //with node->data node = next(node)}
Traversal Code
Short loop body prevents efficient OoO execution
0x1000
0x2000
0x3000
0x4000
An RDS layout example
Non-contiguous layout results in irregular access patterns
4
Motivation Linearization[Clark76, Luk99] Speculation recovery costs outweighs benefits if
the next pointer field gets overwritten frequently Information on the dynamic behavior of
entire RDS structure is important
pos
index = 0;head = pos[index]while(head){ foo(head) head = pos[index++] check(head)}
head
head
Placement of the nodes in the figure correspond to their placement in memory
1008 1012 1004 10161000
5
RDS Profile
RDS profiling gives a ‘logical’ understanding of runtime behavior ‘Application creates 100 trees’ instead of ‘application
allocates 2MB in heap’ ‘Linked list traversed 10 times’ instead of ‘Address
0x10004000 accessed 200 times’ Profile for linearization: next pointer field in list L
is modified n times
6
node *tree_create(){node *n = (node *)malloc(…); … n->left =
tree_create(…); n->right =
tree_create(…);}
RDS Discovery
Assign unique id for value returned by malloc and create a node labeled by that id
Connect nodes by a directed edge if both the address and the value of a store have valid ids
1
2 3
Dynamic Shape Graph
C function for creating a tree
call malloc ;id = 1mov r10 = r8…call tree_create… call malloc ;id = 2… mov r11 = r8store r10[offset1] = r11; create 1->2call tree_create… call malloc ;id = 3… mov r12 = r8store r10[offset2] = r12; create 1->3
Execution trace in (pseudo) assembly
7
RDS Discovery
Multiple RDS instances can be connected together in the DSG!
To separate them, we use properties of the static code Use another graph called
Static Shape Graph (SSG)
array = malloc(…);
for (i=…) array[i] = create_tree(…);
1
2
3 4 6 7
…5
8
RDS discovery
For every static call to malloc, create a node with unique id in the Static Shape Graph (SSG)
If a store creates an edge, connect the corresponding static nodes
Check for SCCs in the SSG
Connect two dynamic nodes only if their corresponding static nodes are in same SCC
1
2
3 4
5
6 7
DSG
Execution trace in (pseudo) assembly
A
T
SSG
call malloc; id = 1Mov r20 = r8…call malloc ;id = 2…mov r10 = r8………call tree_create…… call malloc ;id = 3…… mov r11 = r8…store r10[offset1] = r11; create 2->3…call tree_create…… call malloc ;id = 4…… mov r12 = r8… store r10[offset2] = r12;create 2->4store r20[0] = r10 ; create 1->2
9
Experimental setup
Uses Pin, a dynamic instrumentation tool for Itanium Mapping between address ranges and dynamic ids are
stored in an AVL tree Most recent mapping is cached
A mix of benchmarks from SPEC, Olden and other pointer intensive applications Dynamic instruction count varies from a few million (ks) to over
300 billion (mesa) All experiments run on a 900MHz Itanium 2 with 2 GB
RAM running RH 7.1
10
Profiler Performance Profile: RDS size, lifetime, access count Memory: <16 MB for all but 3 applications
Profiler Performance
05
10152025303540
Benchmark
Slo
wd
ow
n
Baseline: Execution using Pin (~ 10 times slower than native)
11
RDS usage statistics SCCs in static shape graph (RDS types)
Usually a few(<5) per benchmark, a maximum of 31 in parser #RDS instances (connected components in DSG)
Exhibits a wide range (1 in mcf to around million in parser) Tend to be live for long if the program creates only a few of them
Sizes of RDS instances Varies from a single node self-loop (parser) to a few hundred
thousand nodes (mcf, parser) #pointer chasing loads
Significant in many benchmarks Applications show vast diversity in RDS usage
A good reason for profiling them!
12
0
10
20
30
40
50
0 50 100Time (Normalized)
# R
DS
inst
ance
s
0
2000
4000
6000
8000
10000
12000
14000
perlbmk
ks
tree-puzzle
li
gcc
sqlite
parser
twolf
Temporal distribution
13
0
20
40
60
80
100
0 50 100
Time (Normalized)
% o
f R
DS
life
tim
e
gcc
mcf
sqlite
parser
perlbmk
ks
tree-puzzle
twolf
Cumulative distribution of RDS lifetimes
14
RDS Stability Stability of an RDS : A notion of how 'array-like'
an RDS is Stability index : an attempt to quantify this notion
Identify the time instances (alteration points) when changes occur to the RDS structure (by stores that replace existing pointers)
Count the traversals between successive alteration points
Stability index = #intervals that account for ‘most’ of the traversals
Lower index means higher stability
15
Cumulative distribution of stability index
0
20
40
60
80
100
0 5 10Stability Index
% o
f a
cc
es
se
s vpr
gcc
mcf
sqlite
parser
perlbmk
ks
tree-puzzle
twolf
16
Conclusion
Aggressive data structure level optimization techniques for RDS need profile information for improved performance
RDS profiling gives a better understanding of the runtime behavior of RDS
RDS usage varies widely across benchmarks
17
Extra Slides
18
RDS Profiling: Definitions
RDS type: The abstract form of the logical data structure that is manipulated by the program Examples: list, binary tree, graph, etc. Can be mutually recursive (nodes point to their
incident edges and vice versa to form a graph) RDS instance: A concrete realization of the RDS
type Example: the tree created in function foo, the list
pointed to by the first entry of the hash table.
19