p a r a l l e l c o m p u t i n g l a b o r a t o r y peri...p a r a l l e l c o m p u t i n g l a b...

72
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR LAB PERI : Auto-tuning Memory Intensive Kernels for Multicore Samuel Williams 1,2 , Kaushik Datta 1 , Jonathan Carter 2 , Leonid Oliker 1,2 , John Shalf 2 , Katherine Yelick 1,2 , David Bailey 2 1 University of California, Berkeley 2 Lawrence Berkeley National Laboratory [email protected]

Upload: others

Post on 12-Mar-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

1

BERKELEY PAR LAB

PERI :Auto-tuning Memory Intensive

Kernels for MulticoreSamuel Williams1,2, Kaushik Datta1,

Jonathan Carter2, Leonid Oliker1,2, John Shalf2,Katherine Yelick1,2, David Bailey2

1University of California, Berkeley2Lawrence Berkeley National Laboratory

[email protected]

Page 2: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

2

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Motivation

Multicore is the de facto solution for increasedpeak performance for the next decade

However, given the diversity of architectures,multicore guarantees neither good scalability norgood (attained) performance

We need a solution that providesperformance portability

Page 3: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

3

BERKELEY PAR LAB

What’s a MemoryIntensive Kernel?

Page 4: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

4

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Arithmetic Intensity in HPC

True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with problem

size (increased temporal locality), but remains constant on others Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods

Page 5: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

5

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Memory Intensive

Let us define memory intensive to be when the kernel’s arithmeticintensity is less than machine’s balance (flop:byte)

Performance ~ Stream BW * Arithmetic Intensity

Technology allows peak flops to improve faster than bandwidth.more and more kernels will be considered memory intensive

Page 6: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

6

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Outline

Motivation Memory Intensive Kernels Multicore SMPs of Interest Software Optimizations Introduction to Auto-tuning Auto-tuning Memory Intensive Kernels

Sparse Matrix Vector Multiplication (SpMV) Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD) Heat Equation Stencil (3D Laplacian)

Summary

Page 7: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

7

BERKELEY PAR LAB

Multicore SMPsof Interest(used throughout the rest of the talk)

Page 8: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

8

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMPs Used

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Hyp

erTr

ansp

ortOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erTr

ansp

ort

4GB

/s(e

ach

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

Page 9: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

9

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMPs Used

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Hyp

erTr

ansp

ortOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erTr

ansp

ort

4GB

/s(e

ach

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

Conventional Cache-based

Memory Hierarchy

Disjoint Local Store

Memory Hierarchy

Page 10: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

10

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMPs Used

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Hyp

erTr

ansp

ortOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erTr

ansp

ort

4GB

/s(e

ach

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

cache-based Pthreads

implementation

local store-based

libspe implementation

Page 11: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

11

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMPs Used

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Hyp

erTr

ansp

ortOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erTr

ansp

ort

4GB

/s(e

ach

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

multithreaded cores

Page 12: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

12

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMPs Used(threads)

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Hyp

erTr

ansp

ortOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erTr

ansp

ort

4GB

/s(e

ach

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

8 threads 8 threads

16* threads128 threads

*SPEs only

Page 13: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

13

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMPs Used(peak double precision flops)

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Hyp

erTr

ansp

ortOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erTr

ansp

ort

4GB

/s(e

ach

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

75 GFlop/s 74 Gflop/s

29* GFlop/s19 GFlop/s

*SPEs only

Page 14: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

14

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMPs Used(total DRAM bandwidth)

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Hyp

erTr

ansp

ortOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erTr

ansp

ort

4GB

/s(e

ach

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

21 GB/s (read)10 GB/s (write) 21 GB/s

51 GB/s42 GB/s (read)21 GB/s (write)

*SPEs only

Page 15: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

15

BERKELEY PAR LAB

Categorization ofSoftware Optimizations

Page 16: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

16

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Optimization Categorization

Maximizing (attained)In-core Performance

Minimizing (total)Memory Traffic

Maximizing (attained)Memory Bandwidth

Page 17: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

17

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Optimization Categorization

MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

Page 18: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

18

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Optimization Categorization

MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

?unroll &jam

?explicitSIMD

?reorder

?eliminatebranches

Page 19: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

19

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

?unroll &jam

?explicitSIMD

?reorder

?eliminatebranches

Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior

?cacheblocking?array

padding

?compressdata

?streamingstores

MaximizingMemory Bandwidth

•Exploit NUMA

•Hide memory latency

•Satisfy Little’s Law

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

Page 20: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

20

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Optimization Categorization

MaximizingIn-core Performance

MaximizingMemory Bandwidth

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

?unroll &jam

?explicitSIMD

?reorder

?eliminatebranches

•Exploit NUMA

•Hide memory latency

•Satisfy Little’s Law

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

MinimizingMemory Traffic

Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior

?cacheblocking?array

padding

?compressdata

?streamingstores

Page 21: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

21

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

MaximizingMemory Bandwidth

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

?unroll &jam

?explicitSIMD

?reorder

?eliminatebranches

•Exploit NUMA

•Hide memory latency

•Satisfy Little’s Law

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior

?cacheblocking?array

padding

?compressdata

?streamingstores

Page 22: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

22

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

MaximizingMemory Bandwidth

•Exploit in-core parallelism (ILP, DLP, etc…)

•Good (enough) floating-point balance

?unroll &jam

?explicitSIMD

?reorder

?eliminatebranches

•Exploit NUMA

•Hide memory latency

•Satisfy Little’s Law

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

Eliminate:•Capacity misses•Conflict misses•Compulsory misses•Write allocate behavior

?cacheblocking?array

padding

?compressdata

?streamingstores

Each optimization has

a large parameter space

What are the optimal parameters?

Page 23: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

23

BERKELEY PAR LAB

Introduction toAuto-tuning

Page 24: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

24

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Out-of-the-box Code Problem

Out-of-the-box code has (unintentional) assumptions on: cache sizes (>10MB) functional unit latencies(~1 cycle) etc…

These assumptions may result in poor performance when theyexceed the machine characteristics

Page 25: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

25

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuning?

Provides performance portability across the existing breadth andevolution of microprocessors

One time up front productivity cost is amortized by the number ofmachines its used on

Auto-tuning does not invent new optimizations Auto-tuning automates the exploration of the optimization and

parameter space Two components:

parameterized code generator (we wrote ours in Perl) Auto-tuning exploration benchmark

(combination of heuristics and exhaustive search) Can be extended with ISA specific optimizations (e.g. DMA, SIMD)

Page 26: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

26

BERKELEY PAR LAB

Auto-tuning MemoryIntensive Kernels

Sparse Matrix Vector Multiplication (SpMV) SC’07Lattice-Boltzmann Magneto-hydrodynamics (LBMHD) IPDPS’08Explicit Heat Equation (Stencil) SC’08

Page 27: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

27

BERKELEY PAR LAB

Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick,James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on EmergingMulticore Platforms", Supercomputing (SC), 2007.

Page 28: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

28

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Sparse MatrixVector Multiplication

What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure

What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors

Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD)

(a)algebra conceptualization

(c)CSR reference code

for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0;}

A x y

(b)CSR data structure

A.val[ ]

A.rowStart[ ]

...

...

A.col[ ]...

Page 29: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

29

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

The Dataset (matrices)

Unlike dense BLAS, performance is dictated by sparsity Suite of 14 matrices All bigger than the caches of our SMPs We’ll also include a median performance number

Dense

Protein FEM /Spheres

FEM /Cantilever

WindTunnel

FEM /Harbor QCD FEM /

Ship Economics Epidemiology

FEM /Accelerator Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

Page 30: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

30

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(simple parallelization)

Out-of-the box SpMVperformance on a suite of14 matrices

Scalability isn’t great Is this performance

good?

Naïve Pthreads

Naïve

Page 31: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

31

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned SpMV Performance(portable C)

Fully auto-tuned SpMVperformance across the suiteof matrices

Why do some optimizationswork better on somearchitectures?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 32: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

32

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned SpMV Performance(architecture specific optimizations)

Fully auto-tuned SpMVperformance across the suiteof matrices

Included SPE/local storeoptimized version

Why do some optimizationswork better on somearchitectures?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Page 33: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

33

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned SpMV Performance(architecture specific optimizations)

Fully auto-tuned SpMVperformance across the suiteof matrices

Included SPE/local storeoptimized version

Why do some optimizationswork better on somearchitectures?

Performance is better,but is performance good?

+Cache/LS/TLB Blocking

+Matrix Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Auto-tuning resulted in better performance,

but did it result in good performance?

Page 34: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

34

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline Model

2

1/8

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/4 1/2 1 2 4 8 161

Log

scal

e !

Log scale !

Page 35: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

35

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Naïve Roofline Model

Unrealistically optimisticmodel

Hand optimized StreamBW benchmark

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4

peak DP peak DP

peak DP

8

peak DP

IBM QS20Cell Blade

Sun T2+ T5140(Victoria Falls)

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

hand

optim

ized S

tream

BW

hand

optim

ized S

tream

BW

hand

optim

ized S

tream

BW

hand

optim

ized S

tream

BW

Gflop/s(AI) = min Peak Gflop/sStreamBW * AI

Page 36: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

36

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV

Double precision rooflinemodels

In-core optimizations 1..i DRAM optimizations 1..j

FMA is inherent in SpMV(place at bottom)

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

datas

et da

taset

fits in

snoo

p filte

r

GFlopsi,j(AI) = min InCoreGFlopsi

StreamBWj * AI

Page 37: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

37

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV(overlay arithmetic intensity)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.1661

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

datas

et da

taset

fits in

snoo

p filte

r

Page 38: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

38

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV(out-of-the-box parallel)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166 For simplicity: dense

matrix in sparse format1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

datas

et da

taset

fits in

snoo

p filte

r

Page 39: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

39

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV(NUMA & SW prefetch)

compulsory flop:byte ~0.166

utilize all memory channels

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

datas

et da

taset

fits in

snoo

p filte

r

Page 40: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

40

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio,and FP% of instructions

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

datas

et da

taset

fits in

snoo

p filte

r

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

Page 41: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

41

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio,and FP% of instructions

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out N

UMAba

nk co

nflict

s

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

Perform

ance

is

bandwidth lim

ited

Page 42: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

42

BERKELEY PAR LAB

Auto-tuning Lattice-BoltzmannMagneto-Hydrodynamics

(LBMHD)Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick,"Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms",International Parallel & Distributed Processing Symposium (IPDPS), 2008.

Best Paper, Application Track

Page 43: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

43

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD

Plasma turbulence simulation via Lattice Boltzmann Method Two distributions:

momentum distribution (27 scalar components) magnetic distribution (15 vector components)

Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)

Arithmetic Intensity: Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes) Requires about 1300 floating point operations per lattice update Just over 1.0 flops/byte (ideal)

Cache capacity requirements are independent of problem size Two problem sizes:

643 (0.3 GB) 1283 (2.5 GB)

periodic boundaryconditions

momentum distribution

14

4

13

16

5

8

9

21

12

+Y

2

25

1

3

24

23

22

26

0

18

6

17

19

7

10

11

20

15

+Z

+X

magnetic distribution

14

13

16

21

12

25

24

23

22

26

18

17

19

20

15

+Y

+Z

+X

macroscopic variables

+Y

+Z

+X

Page 44: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

44

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Initial LBMHD Performance

Generally, scalability looksgood

Scalability is good but is performance good?

*collision() only

Naïve+NUMA

Page 45: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

45

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned LBMHD Performance(portable C)

Auto-tuning avoids cacheconflict and TLB capacitymisses

Additionally, it exploits SIMDwhere the compiler doesn’t

Include a SPE/Local Storeoptimized version

*collision() only

+Vectorization

+Padding

Naïve+NUMA

Page 46: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

46

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned LBMHD Performance(architecture specific optimizations)

Auto-tuning avoids cacheconflict and TLB capacitymisses

Additionally, it exploits SIMDwhere the compiler doesn’t

Include a SPE/Local Storeoptimized version

*collision() only

+Explicit SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

+small pages

Page 47: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

47

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD

Far more adds thanmultiplies (imbalance)

Huge data sets

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

Page 48: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

48

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD(overlay arithmetic intensity)

Far more adds thanmultiplies (imbalance)

Essentially randomaccess to memory

Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

Page 49: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

49

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD(out-of-the-box parallel performance)

Far more adds thanmultiplies (imbalance)

Essentially randomaccess to memory

Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses

Peak VF performance with64 threads (out of 128) -high conflict misses

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Sun T2+ T5140(Victoria Falls)

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

Intel Xeon E5345(Clovertown)

w/out SIMD

w/out ILP

Page 50: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

50

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD(Padding, Vectorization, Unrolling, Reordering, …)

Vectorize the code toeliminate TLB capacitymisses

Ensures unit stride access(bottom bandwidth ceiling)

Tune for optimal VL Clovertown pinned to

lower BW ceiling1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Sun T2+ T5140(Victoria Falls)

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

Intel Xeon E5345(Clovertown)

w/out SIMD

w/out ILP

Page 51: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

51

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~1.0 on x86/Cell

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

w/out ILP

w/out SIMD

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Sun T2+ T5140(Victoria Falls)

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

Intel Xeon E5345(Clovertown)

w/out ILP

w/out SIMD

Page 52: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

52

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for LBMHD(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~1.0 on x86/Cell

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Sun T2+ T5140(Victoria Falls)

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

Intel Xeon E5345(Clovertown)

w/out SIMD

w/out ILP

3 out o

f 4 m

achines

hit the R

oofline

Page 53: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

53

BERKELEY PAR LAB

The Heat Equation Stencil

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter,Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “Stencil ComputationOptimization and Autotuning on State-of-the-Art Multicore Architecture”,Supercomputing (SC) (to appear), 2008.

Page 54: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

54

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

The Heat Equation Stencil

PDE grid

+Y

+Z

+X

stencil for heat equation PDE

y+1

y-1

x-1

z-1

z+1

x+1x,y,z

Explicit Heat equation (Laplacian ~ ∇2F(x,y,z)) on a regular grid Storage: one double per point in space 7-point nearest neighbor stencil Must:

read every point from DRAM perform 8 flops (linear combination) write every point back to DRAM

Just over 0.5 flops/byte (ideal) Cache locality is important Run one problem size: 2563

Page 55: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

55

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Stencil Performance(out-of-the-box code)

Expect performance tobe between SpMV andLBMHD

Scalability isuniversally poor

Performance is poor

Naïve

Page 56: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

56

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Stencil Performance(portable C)

Proper NUMAmanagement is essentialon most architectures

Moreover proper cacheblocking is still essentialeven with MB’s of cache

+SW Prefetching

+Unrolling

+Thread/Cache Blocking

+Padding

+NUMA

Naïve

+Collaborative Threading

Page 57: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

57

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuned Stencil Performance(architecture specific optimizations)

Cache bypass cansignificantly improveBarcelona performance.

DMA, SIMD, and cacheblocking were essential onCell

+Explicit SIMDization

+SW Prefetching

+Unrolling

+Thread/Cache Blocking

+Padding

+NUMA

+Cache bypass / DMA

Naïve

+Collaborative Threading

Page 58: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

58

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3 High locality requirements Capacity and conflict

misses will severely impairflop:byte ratio

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

Page 59: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

59

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3 High locality requirements Capacity and conflict

misses will severely impairflop:byte ratio

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

No naïve SPEimplementation

IBM QS20Cell Blade

Page 60: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

60

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3 High locality requirements Capacity and conflict

misses will severely impairflop:byte ratio

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

No naïve SPEimplementation

IBM QS20Cell Blade

Page 61: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

61

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …)

Cache blocking helpsensure flop:byte ratio is asclose as possible to 1/3

Clovertown has hugecaches but is pinned tolower BW ceiling

Cache management isessential whencapacity/thread is low1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

No naïve SPEimplementation

IBM QS20Cell Blade

Page 62: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

62

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~0.5 on x86/Cell

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

IBM QS20Cell Blade

Page 63: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

63

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline model for Stencil(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~0.5 on x86/Cell

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

atta

inab

le G

flop/

s

4

8

16

32

64

128

1/8 1/4 1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out S

W pr

efetch

w/out N

UMA

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out N

UMAba

nk co

nflict

s

peak DP

mul/add imbalance

datas

et fits

in sn

oop f

ilter

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out S

W pr

efetch

w/out N

UMA

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

w/out SIMD

w/out ILP

IBM QS20Cell Blade

3 out o

f 4 m

achines

basica

lly on R

oofline

Page 64: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

64

BERKELEY PAR LAB

Summary

Page 65: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

65

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Out-of-the-box Performance

Ideal productivity (just type ‘make’) Kernels sorted by arithmetic intensity Maximum performance with any concurrency Note: Cell = 4 PPE threads (no SPEs) Surprisingly, most

architectures gotsimilar performance

Page 66: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

66

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Portable Performance

Portable (C only) auto-tuning Sacrifice some productivity upfront, amortize by reuse Dramatic increases in performance on Barcelona and Victoria Falls Clovertown is increasingly memory bound Cell = 4 PPE threads

(no SPEs)

Page 67: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

67

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Architecture Specific Performance

ISA specific auto-tuning (reduced portability & productivity) explicit:

SPE parallelization with DMAs SIMDization(SSE+Cell) cache bypass

Cell gets all itsperformance from theSPEs (DP limits perf)

Barcelona gets halfits performance fromSIMDization

Page 68: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

68

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Power Efficiency

Used a digital power meter to measure sustained power under load Efficiency = Sustained Performance / Sustained Power Victoria Falls’ power efficiency is severely limited by FBDIMM power Despite Cell’s weak double precision, it delivers the highest power

efficiency

Page 69: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

69

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Summary(auto-tuning)

Auto-tuning provides portable performance Auto-tuning with architecture (ISA) specific

optimizations are essential on some machines

The roofline model qualifies performance Also tells us which optimizations are important As well as how much further improvement is possible

Page 70: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

70

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Summary(architectures)

Clovertown severely bandwidth limited

Barcelona delivers good performance with architecture specific auto-tuning bandwidth limited after ISA optimizations

Victoria Falls challenged by interaction of TLP with shared caches often limited by in-core performance

Cell Broadband Engine performance entirely from architecture specific auto-tuning bandwidth limited on SpMV DP FP limited on Stencil(slightly) and LBMHD(severely) delivers the best power efficiency

Page 71: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

71

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Acknowledgements

Research supported by: Microsoft and Intel funding (Award #20080469) DOE Office of Science under contract number DE-AC02-05CH11231 NSF contract CNS-0325873 Sun Microsystems - Niagara2 / Victoria Falls machines AMD - access to Quad-core Opteron (barcelona) access Forschungszentrum Jülich - access to QS20 Cell blades IBM - virtual loaner program to QS20 Cell blades

Page 72: P A R A L L E L C O M P U T I N G L A B O R A T O R Y PERI...P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

72

BERKELEY PAR LAB

Questions ?Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick,James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on EmergingMulticore Platforms", Supercomputing (SC), 2007.

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick,"Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms",International Parallel & Distributed Processing Symposium (IPDPS), 2008.Best Paper, Application Track

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter,Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “StencilComputation Optimization and Autotuning on State-of-the-Art MulticoreArchitecture”, Supercomputing (SC) (to appear), 2008.