mesh layouts for block-based caches

66
Mesh Layouts for Block-Based Caches Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory

Upload: tamber

Post on 25-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Mesh Layouts for Block-Based Caches. Sung-Eui Yoon Peter Lindstrom Lawrence Livermore National Laboratory. Goal. Provide cache-coherent layouts of meshes and graphs Derive metrics that measure cache-coherence of layouts Generality Simplicity Efficiency Accuracy. Cache-Coherent Metrics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mesh Layouts  for Block-Based Caches

Mesh Layouts for Block-Based Caches

Sung-Eui Yoon

Peter LindstromLawrence Livermore National Laboratory

Page 2: Mesh Layouts  for Block-Based Caches

2

Goal● Provide cache-coherent layouts of

meshes and graphs● Derive metrics that measure cache-

coherence of layouts● Generality● Simplicity● Efficiency● Accuracy

Page 3: Mesh Layouts  for Block-Based Caches

3

Cache-Coherent Metrics● Measure the expected number of

cache misses of a layout given block-based caches● Should correlate well with the observed

number of cache misses

● Cache-aware metrics● Measure cache-coherence given known

cache parameters (e.g., block size)● Cache-oblivious metrics

● Consider all possible cache parameters

Page 4: Mesh Layouts  for Block-Based Caches

4

Motivation

Accumulated growth rate

during 1993 – 2004(log scale)

Courtesy: Anselmo Lastra, http://www.hcibook.com/e3/online/moores-law/

● Lower growth rate of data access speed

1.5X

20X46X

130X

during 99 - 04

Page 5: Mesh Layouts  for Block-Based Caches

5

Memory Hierarchies and Block-Based Caches

CPU

Fast memory or cache

Slow memory

Blocktransfer

Disk

10-2 secAccess time: 10-7 sec10-8 sec

Page 6: Mesh Layouts  for Block-Based Caches

6

Main Contributions● Propose novel and practical cache-

aware and cache-oblivious metrics● Derive metrics given block-based caches● Propose efficient cache-coherent layout

constructions● Apply to different applications

Page 7: Mesh Layouts  for Block-Based Caches

7

Related Work● Computation reordering● Data layout optimization

Page 8: Mesh Layouts  for Block-Based Caches

8

Computational Reordering● Cache-aware [Coleman and McKinley

95, Vitter 01, Sen et al. 02]● Cache-oblivious [Frigo et al. 99, Arge

et al. 04]

Focus on specific problems such as sorting and linear algebra computations

Page 9: Mesh Layouts  for Block-Based Caches

9

Data Layout Optimization● Graph and matrix layout [Diaz et al. 02]

● Minimum linear arrangement (MLA), bandwidth, and wavefront, etc.

● Space-filling curves● [Sagan 94, Pascucci and Frank 01,

Lindstrom and Pascucci 01, Gopi and Eppstein 04]

● Rendering and processing sequences● [Deering 95, Hoppe 99, Bogomjakov and

Gotsman 02, Isenburg and Lindstrom 05]● Cache-oblivious mesh layout

● [Yoon et al. 05]

Page 10: Mesh Layouts  for Block-Based Caches

10

Outline● Computation models● Cache-aware and cache-oblivious

metrics● Results

Page 11: Mesh Layouts  for Block-Based Caches

11

Outline● Computation models● Cache-aware and cache-oblivious

metrics● Results

Page 12: Mesh Layouts  for Block-Based Caches

12

nc

General Framework of Layout Computation

na

nb ndInput directed graph, G (N, A)

Layout algorithm, φ

1D layout, φ(N)

……..na nd nc nb

Cache-coherent metric

Page 13: Mesh Layouts  for Block-Based Caches

13

nc

Two-Level I/O Model [Aggarwal and Vitter 88]

na

nb ndInput directed

graph

Layout algorithm

Cache

M cache blocks, whose size is B

1D layout with block size = 3

……..na nd nc nb

na nd nc nb

Page 14: Mesh Layouts  for Block-Based Caches

14

Graph Representation● Directed graph, G = (N, A)

● Represent access patterns between nodes

● Nodes, N● Data element ● (e.g., mesh vertex or mesh triangle)

● Directed arcs, A● Connects two nodes if they are accessed

sequentially

nc

na

nb nd

Page 15: Mesh Layouts  for Block-Based Caches

15

Weights of Nodes and Arcs● Indicate probabilities that each

element will be accessed

● Computed in an equilibrium status during infinite random walks● Assume that applications infinitely access

the data according to the input graph● Correspond to eigen-values of the

probability transition matrix

Page 16: Mesh Layouts  for Block-Based Caches

16

Cache-Coherence of a Layout given Block-Based Caches● Expected number of cache misses of a

layout● Probability accessing a node from

another node by traversing an arc● Conditional probability that we will have

a cache miss given the above access pattern

nc

na

nb nd

Page 17: Mesh Layouts  for Block-Based Caches

17

Specialization to Meshes● Expected number of cache misses of a

layout● Probability accessing a node from

another node by traversing an arc● Conditional probability that we will have

a cache miss given the above access pattern

nc

na

nb nd

nc

na

nb nd

An input mesh Implicitly derived graph

= constant

1. Two opposite directed arcs2. Uniform distribution to access adjacent nodes given a node

Page 18: Mesh Layouts  for Block-Based Caches

18

Outline● Computation models● Cache-aware and cache-oblivious

metrics● Results

Page 19: Mesh Layouts  for Block-Based Caches

19

Four Different Cases

Cache-aware casesingle cache block,

M=1

Cache-oblivious casesingle cache block,

M=1

Cache-aware casemultiple cache blocks,

M>1

Cache-oblivious casemultiple cache blocks,

M>1

Page 20: Mesh Layouts  for Block-Based Caches

20

Cache-Aware: Single Cache Block, M=1

nc

na

nb ndInput directed

graph

1D layout with block size = 3

……..na nd nc nb

Cache, whose block size is B

Straddling arcs

Page 21: Mesh Layouts  for Block-Based Caches

21

Cache-Aware: Multiple Cache Blocks, M>1

nc

na

nb ndInput directed

graph

1D layout with block boundary

……..na nd nc nb

Straddling arcsCache

Page 22: Mesh Layouts  for Block-Based Caches

22

Final Cache-Aware Metric● Counts the number of straddling arcs

of the layout given a block size B

Aji

BB jiSA ),(

|))()((|||

1

)(iB : block index containing the node, i

)(xS : Unit step function, 1 if x > 0 0 otherwise.

where

Page 23: Mesh Layouts  for Block-Based Caches

23

High Accuracy of Cache-Aware Metric

Linear correlation

[-1, 1]

Observed number of cache misses

With 5 cache blocks

With 25 cache blocks

Cache-aware metric 0.97 0.97Tested layouts:

Z-curve, Hilbert curve, H-order, minimum linear arrangement layout, βΩ-layout, geometric CO layout,(bi or uni) row-by-row, (bi or uni) diagnoal layouts

Z-curve on a uniform grid

Tested block size = 4KB

Page 24: Mesh Layouts  for Block-Based Caches

24

Four Different Cases

Cache-aware casesingle cache block,

M=1

Cache-oblivious casesingle cache block,

M=1

Cache-aware casemultiple cache blocks,

M>1

Cache-oblivious casemultiple cache blocks,

M>1

Page 25: Mesh Layouts  for Block-Based Caches

25

Cache-Oblivious: Single Cache Block, M=1

Cache

Does not assume a particular block size:Then, what are good representatives

for block sizes?

Page 26: Mesh Layouts  for Block-Based Caches

26

Two Possible Block Size Progressions● Arithmetic progression

● 1, 2, 3, 4, …

● Geometric progression● 20 , 21 , 22 , 23 , …● Well reflects current caching

architectures● E.g., L1: 32B, L2: 64B, Page: 4KB, etc.

Page 27: Mesh Layouts  for Block-Based Caches

27

Probability that an Arc is a Straddling Arc

……..na nd nc nb

Is an arc straddlinggiven a block size?

Computed as a probability as a function of arc length, l

Arc length, l, = 2

Page 28: Mesh Layouts  for Block-Based Caches

28

Two Cache-Oblivious Metrics● Arithmetic cache-oblivious metric,

● Geometric cache-oblivious metric,

Aji

ijA l),(

||1

Aji

ijA l),(

||1 )log(

log||

1

),(

A

Ajiijl

Arc length of arc (i, j)

Geometric mean of arc lengths

)(aCOM

)(gCOM

MLA metric,Arithmetic mean

Page 29: Mesh Layouts  for Block-Based Caches

29

Validation for Cache-Oblivious (CO) Metrics

● Geometric cache-oblivious metric● Practical and useful

Geometric CO layout

Arithmetic CO layout

97% of tested block sizes

The number of cache misseswhen M = 1(log scale)

73% of tested power-of-two block sizes

Page 30: Mesh Layouts  for Block-Based Caches

30

Correlations between Metrics and Observed Number of Cache Misses

Linear correlation[-1, 1]

Observed number of cache misses

With 1 cache block

With 5 cache blocks

Geometric CO metric 0.98 0.81

Arithmetic CO metric -0.19 -0.32

Tested block size = 4KB

Tested with 10 different layouts on a uniform grid

Page 31: Mesh Layouts  for Block-Based Caches

31

Efficient Layout Computation for Our Metrics

● Cache-aware layouts● Optimized with cache-aware metric given

a block size B● Computed from the graph partitioning

● Geometric cache-oblivious metric● Very efficient● Can be used in different layout methods

Page 32: Mesh Layouts  for Block-Based Caches

32

Layout Computation with Geometric Cache-Oblivious Metric

● Multi-level construction method● Partition an input mesh into k different

sets● Layout partitions based on our metric

1. Partition 2. Lay out

•Generalized layout method for unstructured meshes

Page 33: Mesh Layouts  for Block-Based Caches

33

Outline● Computation models● Cache-aware and cache-oblivious

metrics● Results

Page 34: Mesh Layouts  for Block-Based Caches

34

Applications● Isosurface extraction● View-dependent rendering

Page 35: Mesh Layouts  for Block-Based Caches

35

Iso-Surface Extraction

● Uses contour tree [van Kreveld et al. 97]● Runtime is dominated by the traversal of

iso-surface● Layout graph

● Use an input tetrahedral mesh

Spx model(140K vertices)

Page 36: Mesh Layouts  for Block-Based Caches

36

High Correlation with Number of Cache Misses

Linear correlation

[-1, 1]

Observed number of cache misses

With 1 cache block

With 10K cache blocks

Geometric CO metric 0.99 0.98

Tested block size = 4KB

Tested with 8 different layouts: our geometric CO, our cache-aware, breadth-first (and

depth-first) layouts, spectral [Juvan and Mohar 92], cache-oblivious mesh [Yoon et al. 05], Z-curve [Sagan 94], X-axis

sorted layouts

Page 37: Mesh Layouts  for Block-Based Caches

37

High Correlation with Runtime Performance

Linear correlation

[-1, 1]

First iso-surface

extraction time

Second iso-surface

extraction timeGeometric CO metric 0.94 0.94

Disk I/O time is major bottleneck

Memory access time is major bottleneck

Page 38: Mesh Layouts  for Block-Based Caches

38

Comparison with Other Layouts

0 0.5 1 1.5 2 2.5

Cache-aware layout

Geometric CO layout

Z-curve

COML

Depth-first layout

Spectral layout

Breadth-first layout

X-axis layout

The first iso-surface extraction time (sec)

8% - 77% improvement and

very close to the cache-aware

performance

Page 39: Mesh Layouts  for Block-Based Caches

39

View-Dependent Rendering● Layout vertices and triangles of

progressive meshes● Used in an efficient VDR system [Yoon et

al. 04]● Reduce misses in GPU vertex cache

Page 40: Mesh Layouts  for Block-Based Caches

40

Cache Miss Ratio on Bunny Model

GPU vertex cache miss

ratio

Vertex cache size

Hoppe [Hoppe 99]Theoretical lower bound

[Bar-Yehuda and Gotsman 96]

Universal rendering seq. [Bogomjakov and Gotsman 02]

GeometricCO layout

Page 41: Mesh Layouts  for Block-Based Caches

41

Cache Miss Ratio on Power Plant Model

GPU vertex cache miss

ratio

Vertex cache size

Z-curve

Hoppe’s renderingseq. [Hoppe 99]

Theoretical lower bound

[Bar-Yehuda and Gotsman 96]

COML [Yoon et al. 05]

GeometricCO layout

Page 42: Mesh Layouts  for Block-Based Caches

42

Conclusion● Novel cache-aware and cache-

oblivious metrics to evaluate layouts● Derived metrics based on two-level I/O

model● Improved the performance of applications

without modifying codes

OpenCCL, open source libraryhttp:://gamma.cs.unc.edu/COL/OpenCCL

Page 43: Mesh Layouts  for Block-Based Caches

43

Ongoing and Future Work● Derive a lower bound on our

geometric cache-oblivious metric● Employ mesh compression to further

reduce disk I/O accesses● Investigate efficient layout method for

deforming models● Apply to non-graphics applications

● e.g., shortest path or other graph computations

Page 44: Mesh Layouts  for Block-Based Caches

44

Cache-Efficient Layouts of Bounding Volume Hierarchies● Yoon and Manocha, Eurographics 06

Ray tracingCollision detection

Page 45: Mesh Layouts  for Block-Based Caches

45

Acknowledgements● Ajith Mascarenhas● Martin Isenburg● Dinesh Manocha● Fabio Bernardon, Joao Comba, and

Claudio Silva● For their unstructured tetrahedra

rendering program● Members of data analysis group in

LLNL● Anonymous reviewers

Page 46: Mesh Layouts  for Block-Based Caches

46

UCRL-PRES-225448

This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.

Page 47: Mesh Layouts  for Block-Based Caches

47

Additional slides

Page 48: Mesh Layouts  for Block-Based Caches

48

Cache-Coherence of Layouts● Well known heuristics for cache-

coherent layouts● Space-filling curves [Sagan 94]

● How can we compute better layouts?● Requires metrics measuring cache-

coherence of layouts

Page 49: Mesh Layouts  for Block-Based Caches

49

Main Results● Define cache-coherence of layout as:

● Expected number of cache misses during random walks of a graph given block-based caches

● Then, the exp. number of cache misses= Number of straddling arcs in a cache-

aware cache≈ Geometric mean of arc lengths in a cache-

oblivious case

Page 50: Mesh Layouts  for Block-Based Caches

50

Data Layout Optimization● Rendering sequences

● Triangle strips● [Deering 95, Hoppe 99, Bogomjakov and

Gotsman 02]● Processing sequences

● [Isenburg and Gumhold 03, Isenburg and Lindstrom 05]

Assume that access patternglobally follows the layout order

Page 51: Mesh Layouts  for Block-Based Caches

51

Data Layout Optimization● Space-filling curves

● [Sagan 94, Velho and Gomes 91, Pascucci and Frank 01, Lindstrom and Pascucci 01, Gopi and Eppstein 04]

Assume geometric regularity

Page 52: Mesh Layouts  for Block-Based Caches

52

Data Layout Optimization● Graph and matrix layout

● A survey [Diaz et al. 02]● Minimum linear arrangement (MLA)● Bandwidth● Profile● Wavefront, etc.

Does not necessarily produce good layouts for block-based caches

Page 53: Mesh Layouts  for Block-Based Caches

53

Cache-Aware Metric● Expected number of cache misses given a

block size B

Aji

BB iiSA ),(

|))()((|||

1

Page 54: Mesh Layouts  for Block-Based Caches

54

Correlation between Cache-Aware Metric and Observed Number of Cache Misses

Cache-aware metric

M = 5 M = 25Observed number of cache misses

Cache block size = 4KB

R2 = 0.97 R2 = 0.97

Page 55: Mesh Layouts  for Block-Based Caches

55

Evaluating Existing Layouts

● No known tight bound

● Compare against the best layout we can construct● Employ an efficient

sampling method

Is it close to the optimal

layout?

An existing layout, φ

Use it

Build a new one

Page 56: Mesh Layouts  for Block-Based Caches

56

Implementation● Modify our open source layout

computation codes, OpenCCL● Based on METIS graph partitioning library

[Karypis and Kumar 98]

● Processing speed● 15k triangles / second

Page 57: Mesh Layouts  for Block-Based Caches

57

Comparison with Cache-Oblivious Mesh Layouts (COML) [Yoon et al. 05]

● Two major improvements● Accuracy● Usability

Page 58: Mesh Layouts  for Block-Based Caches

58

Pros and Cons● Limitations

● Not directly applicable to dynamic models

● Pros● Generality● Can have benefits without modifying

underlying codes

Page 59: Mesh Layouts  for Block-Based Caches

59

Specialization to Meshes

● Assume an equally likelihood to access adjacent nodes given a node

nc

na

nb nd

nc

na

nb nd

An input mesh Implicitly derived graph

Page 60: Mesh Layouts  for Block-Based Caches

60

Two Possible Block Size Progressions● Arithmetic

progression● 1, 2, 3, 4, …

● Geometric progression● 20 , 21 , 22 , 23 , …● Well reflects current

caching architectures● E.g., L1: 32B, L2: 64B,

Page: 4KB, etc.

B

Pr(B)

B

Pr(B)Geometricdistribution

Uniform distribution

Page 61: Mesh Layouts  for Block-Based Caches

61

Cache Miss Ratio with Different Mesh Resolutions of Bunny Model

GPU vertex cache miss

ratio(case size: 32)

Mesh resolution

Hoppe’s rendering sequence[Hoppe 99]

COLg

Page 62: Mesh Layouts  for Block-Based Caches

62

Correlation between Cache-Aware Metric and Number of Cache Misses

M = 5 M = 25Observed number of cache misses

Cache block size = 4KB

R2 = 0.97 R2 = 0.97

Page 63: Mesh Layouts  for Block-Based Caches

63

Correlations with Observed Number of Cache Misses

Cache missesOne blk : Mult blks

ArithmeticCO metric

GeometricCO metric

R2 = 0.98 R2 = 0.81 Cache block size = 4KB

,

Page 64: Mesh Layouts  for Block-Based Caches

64

Comparison with Other Layouts0.99 0.98 0.94 0.94

X-axis

BFLSpectral layout

[Juvan and Mohar 92]DFLCOML

[Yoon et al. 05]Z-curveCOLgAware

Correlation:

Up to 2X speedup9% lower than cache-

aware layout

Page 65: Mesh Layouts  for Block-Based Caches

65

Comparison with Other Layouts● Compute eight different layouts

● Our geometric cache-oblivious layout● Our cache-aware layout● Breadth-first layout● Depth-first layout● Spectral layout [Juvan and Mohar 92]● Cache-oblivious mesh layout [Yoon et al.

05]● Z-curve [Sagan 94]● X-axis sorted layout

Page 66: Mesh Layouts  for Block-Based Caches

66

Our Layout of Bunny Model