register blocking performanceyelick/papers/ppsc99.pdfperformance of blocked code (chosen size)...

9

Upload: others

Post on 27-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Register blocking performanceyelick/papers/ppsc99.pdfPerformance of blocked code (chosen size) Performance of blocked code (best) 2x2(1.23) matrix size(N) nonzeros Application area
Page 2: Register blocking performanceyelick/papers/ppsc99.pdfPerformance of blocked code (chosen size) Performance of blocked code (best) 2x2(1.23) matrix size(N) nonzeros Application area

1 2 3 4 5 6 7 8 9 10 11 1210

15

20

25

30

35

40

45

50

55Register blocking performance

columns in register block

Mflo

ps/s

ec.

1x 2x 3x 4x 5x 6x 7x 8x 9x10x11x12x

Page 3: Register blocking performanceyelick/papers/ppsc99.pdfPerformance of blocked code (chosen size) Performance of blocked code (best) 2x2(1.23) matrix size(N) nonzeros Application area

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

Performance of register blocked code on Ultrasparc

matrices

Mflo

ps/s

ec.

2x2(1.23) 2x2(1.35)2x1(1.00)

6x6(1.19)3x3(1.06)

3x3(1.00)3x3(1.00)

3x3(1.00)3x3(1.00)

2x2(1.21)2x2(1.21)

6x6(1.15)6x2(1.13)

2x2(1.33)2x1(1.17)

4x4(1.00)4x2(1.00)

2x2(1.98)2x1(1.71)

Base performance Performance of blocked code (chosen size)Performance of blocked code (best)

2x2(1.23)

matrix size(N) nonzeros Application area

Airfoil eigenvalue calculation2D PDE problemAutomobile frame stiffness matrix

4 13965 1.0M FEM stiffness matrixFEM stiffness matrixEngine block stiffness matrixStructure from shuttle rocket booster

Unstructured Euler solverDevice simulation

Chemical process separation

1 23560 484K 2 41092 1.7M 3 30237 1.5M

5 24696 1.8M 6 52329 2.7M 7 54870 2.7M

9 62424 1.7M 10 26068 177K

8 4134 94K

Page 4: Register blocking performanceyelick/papers/ppsc99.pdfPerformance of blocked code (chosen size) Performance of blocked code (best) 2x2(1.23) matrix size(N) nonzeros Application area

R

Ccache

cache

y

x

AAAAA00

AAAAA

AAAAA00

AAAAAAAA

0 00 0 00

00

0 00

00 0 0

00

0

0 0000

0

0

00

0 000

0 00 0

0

A

0603

14 17

21

022 25 26

30 34

42 43 46 47A51 54

61 62 65 67

72 73 74 75

00

A A A A A A A00

0

value

col_idx

03 21 22 30

00 3 1 2 6

06

4

14

=block_ptr

row_start

0 52 2 4 6 11 19 24... ...

8 16

Page 5: Register blocking performanceyelick/papers/ppsc99.pdfPerformance of blocked code (chosen size) Performance of blocked code (best) 2x2(1.23) matrix size(N) nonzeros Application area

0.05 0.1 0.15 0.2 0.250

2

4

6

8

10

12

14

16

18

20Performance of Static Cache blocking on Random Matrices

Density of Random Matrices

Perfo

rman

ce128x128 block 256x256 block 512x512 block 1024x1024 block 2048x2048 block 4096x4096 block 8192x8192 block 16384x16384 block32768x32768 block65536x65536 block

Page 6: Register blocking performanceyelick/papers/ppsc99.pdfPerformance of blocked code (chosen size) Performance of blocked code (best) 2x2(1.23) matrix size(N) nonzeros Application area

Proc 3Proc 2Proc1Proc0

C1 C2 C4

S1 S2 S3 S4

Page 7: Register blocking performanceyelick/papers/ppsc99.pdfPerformance of blocked code (chosen size) Performance of blocked code (best) 2x2(1.23) matrix size(N) nonzeros Application area

1 2 3 4 5 6 7 80

5

10

15

20

25

30

35Performance of cache blocking on SMP: 1

number of processors

Mflo

ps/p

roc.

base performanceC1 C2 C3 C4

1 2 3 4 5 6 7 80

5

10

15

20

25

30

35Performance of cache blocking on SMP: 5

number of processors

Mflo

ps/p

roc.

base performanceC1 C2 C3 C4

Page 8: Register blocking performanceyelick/papers/ppsc99.pdfPerformance of blocked code (chosen size) Performance of blocked code (best) 2x2(1.23) matrix size(N) nonzeros Application area

1 2 3 4 5 6 7 85

10

15

20

25

30

35

40Performance of reordering before blocking on SMP: 2

number of processors

Mflo

ps

base performance blocked performance blocked non−reordered perf. blocked rcm−reordered perf. blocked hmetis−reordered perf.

1 2 3 4 5 6 7 85

10

15

20

25

30

35

40Performance of reordering after blocking on SMP: 2

number of processors

Mflo

ps

base performance blocked performance blocked non−reordered perf. blocked rcm−reordered perf. blocked hmetis−reordered perf.

Page 9: Register blocking performanceyelick/papers/ppsc99.pdfPerformance of blocked code (chosen size) Performance of blocked code (best) 2x2(1.23) matrix size(N) nonzeros Application area