mesh layouts for block-based caches
DESCRIPTION
Mesh Layouts for Block-Based Caches. Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory. Goal. Provide cache-coherent layouts of meshes and graphs Derive metrics that measure cache-coherence of layouts Generality Simplicity Efficiency Accuracy. Cache-Coherent Metrics. - PowerPoint PPT PresentationTRANSCRIPT
Mesh Layouts for Block-Based Caches
Sung-Eui Yoon
Peter LindstromLawrence Livermore National Laboratory
2
Goal● Provide cache-coherent layouts of
meshes and graphs● Derive metrics that measure cache-
coherence of layouts● Generality● Simplicity● Efficiency● Accuracy
3
Cache-Coherent Metrics● Measure the expected number of
cache misses of a layout given block-based caches● Should correlate well with the observed
number of cache misses
● Cache-aware metrics● Measure cache-coherence given known
cache parameters (e.g., block size)● Cache-oblivious metrics
● Consider all possible cache parameters
4
Motivation
Accumulated growth rate
during 1993 – 2004(log scale)
Courtesy: Anselmo Lastra, http://www.hcibook.com/e3/online/moores-law/
● Lower growth rate of data access speed
1.5X
20X46X
130X
during 99 - 04
5
Memory Hierarchies and Block-Based Caches
CPU
Fast memory or cache
Slow memory
Blocktransfer
Disk
10-2 secAccess time: 10-7 sec10-8 sec
6
Main Contributions● Propose novel and practical cache-
aware and cache-oblivious metrics● Derive metrics given block-based caches● Propose efficient cache-coherent layout
constructions● Apply to different applications
7
Related Work● Computation reordering● Data layout optimization
8
Computational Reordering● Cache-aware [Coleman and McKinley
95, Vitter 01, Sen et al. 02]● Cache-oblivious [Frigo et al. 99, Arge
et al. 04]
Focus on specific problems such as sorting and linear algebra computations
9
Data Layout Optimization● Graph and matrix layout [Diaz et al. 02]
● Minimum linear arrangement (MLA), bandwidth, and wavefront, etc.
● Space-filling curves● [Sagan 94, Pascucci and Frank 01,
Lindstrom and Pascucci 01, Gopi and Eppstein 04]
● Rendering and processing sequences● [Deering 95, Hoppe 99, Bogomjakov and
Gotsman 02, Isenburg and Lindstrom 05]● Cache-oblivious mesh layout
● [Yoon et al. 05]
10
Outline● Computation models● Cache-aware and cache-oblivious
metrics● Results
11
Outline● Computation models● Cache-aware and cache-oblivious
metrics● Results
12
nc
General Framework of Layout Computation
na
nb ndInput directed graph, G (N, A)
Layout algorithm, φ
1D layout, φ(N)
……..na nd nc nb
Cache-coherent metric
13
nc
Two-Level I/O Model [Aggarwal and Vitter 88]
na
nb ndInput directed
graph
Layout algorithm
Cache
M cache blocks, whose size is B
1D layout with block size = 3
……..na nd nc nb
na nd nc nb
14
Graph Representation● Directed graph, G = (N, A)
● Represent access patterns between nodes
● Nodes, N● Data element ● (e.g., mesh vertex or mesh triangle)
● Directed arcs, A● Connects two nodes if they are accessed
sequentially
nc
na
nb nd
15
Weights of Nodes and Arcs● Indicate probabilities that each
element will be accessed
● Computed in an equilibrium status during infinite random walks● Assume that applications infinitely access
the data according to the input graph● Correspond to eigen-values of the
probability transition matrix
16
Cache-Coherence of a Layout given Block-Based Caches● Expected number of cache misses of a
layout● Probability accessing a node from
another node by traversing an arc● Conditional probability that we will have
a cache miss given the above access pattern
nc
na
nb nd
17
Specialization to Meshes● Expected number of cache misses of a
layout● Probability accessing a node from
another node by traversing an arc● Conditional probability that we will have
a cache miss given the above access pattern
nc
na
nb nd
nc
na
nb nd
An input mesh Implicitly derived graph
= constant
1. Two opposite directed arcs2. Uniform distribution to access adjacent nodes given a node
18
Outline● Computation models● Cache-aware and cache-oblivious
metrics● Results
19
Four Different Cases
Cache-aware casesingle cache block,
M=1
Cache-oblivious casesingle cache block,
M=1
Cache-aware casemultiple cache blocks,
M>1
Cache-oblivious casemultiple cache blocks,
M>1
20
Cache-Aware: Single Cache Block, M=1
nc
na
nb ndInput directed
graph
1D layout with block size = 3
……..na nd nc nb
Cache, whose block size is B
Straddling arcs
21
Cache-Aware: Multiple Cache Blocks, M>1
nc
na
nb ndInput directed
graph
1D layout with block boundary
……..na nd nc nb
Straddling arcsCache
22
Final Cache-Aware Metric● Counts the number of straddling arcs
of the layout given a block size B
Aji
BB jiSA ),(
|))()((|||
1
)(iB : block index containing the node, i
)(xS : Unit step function, 1 if x > 0 0 otherwise.
where
23
High Accuracy of Cache-Aware Metric
Linear correlation
[-1, 1]
Observed number of cache misses
With 5 cache blocks
With 25 cache blocks
Cache-aware metric 0.97 0.97Tested layouts:
Z-curve, Hilbert curve, H-order, minimum linear arrangement layout, βΩ-layout, geometric CO layout,(bi or uni) row-by-row, (bi or uni) diagnoal layouts
Z-curve on a uniform grid
Tested block size = 4KB
24
Four Different Cases
Cache-aware casesingle cache block,
M=1
Cache-oblivious casesingle cache block,
M=1
Cache-aware casemultiple cache blocks,
M>1
Cache-oblivious casemultiple cache blocks,
M>1
25
Cache-Oblivious: Single Cache Block, M=1
Cache
Does not assume a particular block size:Then, what are good representatives
for block sizes?
26
Two Possible Block Size Progressions● Arithmetic progression
● 1, 2, 3, 4, …
● Geometric progression● 20 , 21 , 22 , 23 , …● Well reflects current caching
architectures● E.g., L1: 32B, L2: 64B, Page: 4KB, etc.
27
Probability that an Arc is a Straddling Arc
……..na nd nc nb
Is an arc straddlinggiven a block size?
Computed as a probability as a function of arc length, l
Arc length, l, = 2
28
Two Cache-Oblivious Metrics● Arithmetic cache-oblivious metric,
● Geometric cache-oblivious metric,
Aji
ijA l),(
||1
Aji
ijA l),(
||1 )log(
log||
1
),(
A
Ajiijl
Arc length of arc (i, j)
Geometric mean of arc lengths
)(aCOM
)(gCOM
MLA metric,Arithmetic mean
29
Validation for Cache-Oblivious (CO) Metrics
● Geometric cache-oblivious metric● Practical and useful
Geometric CO layout
Arithmetic CO layout
97% of tested block sizes
The number of cache misseswhen M = 1(log scale)
73% of tested power-of-two block sizes
30
Correlations between Metrics and Observed Number of Cache Misses
Linear correlation[-1, 1]
Observed number of cache misses
With 1 cache block
With 5 cache blocks
Geometric CO metric 0.98 0.81
Arithmetic CO metric -0.19 -0.32
Tested block size = 4KB
Tested with 10 different layouts on a uniform grid
31
Efficient Layout Computation for Our Metrics
● Cache-aware layouts● Optimized with cache-aware metric given
a block size B● Computed from the graph partitioning
● Geometric cache-oblivious metric● Very efficient● Can be used in different layout methods
32
Layout Computation with Geometric Cache-Oblivious Metric
● Multi-level construction method● Partition an input mesh into k different
sets● Layout partitions based on our metric
…
1. Partition 2. Lay out
•Generalized layout method for unstructured meshes
33
Outline● Computation models● Cache-aware and cache-oblivious
metrics● Results
34
Applications● Isosurface extraction● View-dependent rendering
35
Iso-Surface Extraction
● Uses contour tree [van Kreveld et al. 97]● Runtime is dominated by the traversal of
iso-surface● Layout graph
● Use an input tetrahedral mesh
Spx model(140K vertices)
36
High Correlation with Number of Cache Misses
Linear correlation
[-1, 1]
Observed number of cache misses
With 1 cache block
With 10K cache blocks
Geometric CO metric 0.99 0.98
Tested block size = 4KB
Tested with 8 different layouts: our geometric CO, our cache-aware, breadth-first (and
depth-first) layouts, spectral [Juvan and Mohar 92], cache-oblivious mesh [Yoon et al. 05], Z-curve [Sagan 94], X-axis
sorted layouts
37
High Correlation with Runtime Performance
Linear correlation
[-1, 1]
First iso-surface
extraction time
Second iso-surface
extraction timeGeometric CO metric 0.94 0.94
Disk I/O time is major bottleneck
Memory access time is major bottleneck
38
Comparison with Other Layouts
0 0.5 1 1.5 2 2.5
Cache-aware layout
Geometric CO layout
Z-curve
COML
Depth-first layout
Spectral layout
Breadth-first layout
X-axis layout
The first iso-surface extraction time (sec)
8% - 77% improvement and
very close to the cache-aware
performance
39
View-Dependent Rendering● Layout vertices and triangles of
progressive meshes● Used in an efficient VDR system [Yoon et
al. 04]● Reduce misses in GPU vertex cache
40
Cache Miss Ratio on Bunny Model
GPU vertex cache miss
ratio
Vertex cache size
Hoppe [Hoppe 99]Theoretical lower bound
[Bar-Yehuda and Gotsman 96]
Universal rendering seq. [Bogomjakov and Gotsman 02]
GeometricCO layout
41
Cache Miss Ratio on Power Plant Model
GPU vertex cache miss
ratio
Vertex cache size
Z-curve
Hoppe’s renderingseq. [Hoppe 99]
Theoretical lower bound
[Bar-Yehuda and Gotsman 96]
COML [Yoon et al. 05]
GeometricCO layout
42
Conclusion● Novel cache-aware and cache-
oblivious metrics to evaluate layouts● Derived metrics based on two-level I/O
model● Improved the performance of applications
without modifying codes
OpenCCL, open source libraryhttp:://gamma.cs.unc.edu/COL/OpenCCL
43
Ongoing and Future Work● Derive a lower bound on our
geometric cache-oblivious metric● Employ mesh compression to further
reduce disk I/O accesses● Investigate efficient layout method for
deforming models● Apply to non-graphics applications
● e.g., shortest path or other graph computations
44
Cache-Efficient Layouts of Bounding Volume Hierarchies● Yoon and Manocha, Eurographics 06
Ray tracingCollision detection
45
Acknowledgements● Ajith Mascarenhas● Martin Isenburg● Dinesh Manocha● Fabio Bernardon, Joao Comba, and
Claudio Silva● For their unstructured tetrahedra
rendering program● Members of data analysis group in
LLNL● Anonymous reviewers
46
UCRL-PRES-225448
This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.
47
Additional slides
48
Cache-Coherence of Layouts● Well known heuristics for cache-
coherent layouts● Space-filling curves [Sagan 94]
● How can we compute better layouts?● Requires metrics measuring cache-
coherence of layouts
49
Main Results● Define cache-coherence of layout as:
● Expected number of cache misses during random walks of a graph given block-based caches
● Then, the exp. number of cache misses= Number of straddling arcs in a cache-
aware cache≈ Geometric mean of arc lengths in a cache-
oblivious case
50
Data Layout Optimization● Rendering sequences
● Triangle strips● [Deering 95, Hoppe 99, Bogomjakov and
Gotsman 02]● Processing sequences
● [Isenburg and Gumhold 03, Isenburg and Lindstrom 05]
Assume that access patternglobally follows the layout order
51
Data Layout Optimization● Space-filling curves
● [Sagan 94, Velho and Gomes 91, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04]
Assume geometric regularity
52
Data Layout Optimization● Graph and matrix layout
● A survey [Diaz et al. 02]● Minimum linear arrangement (MLA)● Bandwidth● Profile● Wavefront, etc.
Does not necessarily produce good layouts for block-based caches
53
Cache-Aware Metric● Expected number of cache misses given a
block size B
Aji
BB iiSA ),(
|))()((|||
1
54
Correlation between Cache-Aware Metric and Observed Number of Cache Misses
Cache-aware metric
M = 5 M = 25Observed number of cache misses
Cache block size = 4KB
R2 = 0.97 R2 = 0.97
55
Evaluating Existing Layouts
● No known tight bound
● Compare against the best layout we can construct● Employ an efficient
sampling method
Is it close to the optimal
layout?
An existing layout, φ
Use it
Build a new one
56
Implementation● Modify our open source layout
computation codes, OpenCCL● Based on METIS graph partitioning library
[Karypis and Kumar 98]
● Processing speed● 15k triangles / second
57
Comparison with Cache-Oblivious Mesh Layouts (COML) [Yoon et al. 05]
● Two major improvements● Accuracy● Usability
58
Pros and Cons● Limitations
● Not directly applicable to dynamic models
● Pros● Generality● Can have benefits without modifying
underlying codes
59
Specialization to Meshes
● Assume an equally likelihood to access adjacent nodes given a node
nc
na
nb nd
nc
na
nb nd
An input mesh Implicitly derived graph
60
Two Possible Block Size Progressions● Arithmetic
progression● 1, 2, 3, 4, …
● Geometric progression● 20 , 21 , 22 , 23 , …● Well reflects current
caching architectures● E.g., L1: 32B, L2: 64B,
Page: 4KB, etc.
B
Pr(B)
B
Pr(B)Geometricdistribution
Uniform distribution
61
Cache Miss Ratio with Different Mesh Resolutions of Bunny Model
GPU vertex cache miss
ratio(case size: 32)
Mesh resolution
Hoppe’s rendering sequence[Hoppe 99]
COLg
62
Correlation between Cache-Aware Metric and Number of Cache Misses
M = 5 M = 25Observed number of cache misses
Cache block size = 4KB
R2 = 0.97 R2 = 0.97
63
Correlations with Observed Number of Cache Misses
Cache missesOne blk : Mult blks
ArithmeticCO metric
GeometricCO metric
R2 = 0.98 R2 = 0.81 Cache block size = 4KB
,
64
Comparison with Other Layouts0.99 0.98 0.94 0.94
X-axis
BFLSpectral layout
[Juvan and Mohar 92]DFLCOML
[Yoon et al. 05]Z-curveCOLgAware
Correlation:
Up to 2X speedup9% lower than cache-
aware layout
65
Comparison with Other Layouts● Compute eight different layouts
● Our geometric cache-oblivious layout● Our cache-aware layout● Breadth-first layout● Depth-first layout● Spectral layout [Juvan and Mohar 92]● Cache-oblivious mesh layout [Yoon et al.
05]● Z-curve [Sagan 94]● X-axis sorted layout
66
Our Layout of Bunny Model