(mis) understanding the numa memory system performance of multithreaded workloads

41
(Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich, Switzerland

Upload: eros

Post on 16-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads. Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich, Switzerland. NUMA-multicore memory system. Processor 1. Processor 0. LOCAL_CACHE: 38 cycles. REMOTE_CACHE: 186 cycles. T. 0. 1. 2. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

(Mis)Understanding theNUMA Memory System Performance of Multithreaded WorkloadsZoltán MajóThomas R. Gross

Department of Computer ScienceETH Zurich, Switzerland

Page 2: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

2

NUMA-multicore memory system

Processor 1

IC MC DRAM

Processor 0

0 2

4 6

MC ICDRAM

Last-level cache Last-level cache

1 3

5 7

8 10

12 14

9 11

13 15

LOCAL_CACHE:38 cycles

LOCAL_DRAM:190 cycles

REMOTE_CACHE:186 cycles

REMOTE_DRAM: 310 cycles

T

All data based on experimental evaluation of Intel Nehalem (Hackenberg [MICRO ’09], Molka [PACT ‘09])

Page 3: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

3

Experimental setup

Three benchmark programs from PARSEC streamcluster, ferret, and dedup Grown size of inputs more pressure on the memory system

Intel Westmere 4 processors, 32 cores

3 execution scenarios w/o NUMA: Sequential w/o NUMA: Parallel (8 cores/1 processor) w/ NUMA: Parallel (32 cores/4 processors)

Page 4: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

4

Execution scenarios

Processor 1

IC MC DRAM

Processor 0

0 2

4 6

MC ICDRAM

Last-level cache Last-level cache

1 3

5 7

8 10

12 14

9 11

13 15

Processor 2

16 18

20 22

MC ICDRAM

Last-level cache

17 19

21 23

Processor 3

24 26

28 30

IC MC

Last-level cache

25 27

29 31

DRAM

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

Page 5: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

5

8 (1) 32 (4)05

1015202530

streamcluster ferret dedupActive cores (processors)

Speedup over sequential12

Parallel performance

Page 6: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

6

se-quen-

tial

8 (1) 32 (4)0

10203040506070

Useful cycles Other stallsBack-end stalls

Active cores (processors)

CPU cycles (x1012)

se-quen-

tial

8 (1) 32 (4)0

0.2

0.4

0.60.8

1

Useful cycles Other stallsBack-end stalls

Active cores (processors)

CPU cycles (x1012)

CPU cycle breakdown

dedup: good scaling (26X) streamcluster: poor scaling (11X)

Page 7: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

7

Outline

Introduction Performance analysis

Data locality Prefetcher effectiveness

Source-level optimizations Performance evaluation Conclusions

Page 8: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

8

Data locality

Page placement policy Commonly used policy: first-touch (default in Linux)

Measurement: data locality of the benchmarks

Data locality = [%]

Read transfers measured at the processor’s uncore

Remote memory referencesTotal memory references

Page 9: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

9

NUMA-multicore memory system

Processor 1

IC MC DRAM

Processor 0

0 2

4 6

MC ICDRAM

Last-level cache Last-level cache

1 3

5 7

8 10

12 14

9 11

13 15

LOCAL_CACHE

LOCAL_DRAM REMOTE_DRAM

T

Processor 2

16 18

20 22

MC ICDRAM

Last-level cache

17 19

21 23

Processor 3

24 26

28 30

IC MC

Last-level cache

25 27

29 31

DRAM

REMOTE_CACHE

Page 10: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

10

streamc. ferret dedup0%

20%

40%

60%

80%

REMOTE_DRAM REMOTE_CACHE

Fraction total uncore transfers

Data locality

Page 11: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

11

streamc. ferret dedup0%

20%40%60%80%

100%

Non-shared heap pages

Shared heap pages

Percentage sampled heap address space

Inter-processor data sharing

Cause of data sharing streamcluster: data points to be clustered ferret and dedup: in-memory databases

Page 12: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

12

Prefetcher performance

Experiment Run each benchmarks with

prefetcher on/off Compare performance streamc. ferret dedup

0%

1%

2%

3%

Performance improvementrelative to prefetching disabled

Causes of prefetcher inefficiency ferret and dedup: hash-based memory access patterns streamcluster: random shuffling

Page 13: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

13

streamcluster: random shuffling

while (input = read_data_points()) {

clusters = process(input);

}

Randomly shuffle data pointsto increase probabilitythat each point is compared to each cluster.

Page 14: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

14

streamcluster: prefetcher effectiveness

Original data layout (before shuffling)

T0

T1

ABCDEFGH

points coordinates

Page 15: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

15

streamcluster: prefetcher effectiveness

T0

T1

ABCDEFGH

points coordinates

Data layout (after pointer-based shuffle)

Page 16: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

16

streamcluster: prefetcher effectiveness

T0 ABCDEFGH

points coordinates

Data layout (after pointer-based shuffle)

Page 17: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

17

Outline

Introduction Performance analysis

Data locality Prefetcher effectiveness

Source-level optimizations Prefetching Data locality

Performance evaluation Conclusions

Page 18: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

18

ABCDEFGH

streamcluster: Optimizing prefetching

Copy-based shuffle

Performance improvement over pointer-based shuffle

Westmere: 12% Nehalem: 60%

T0

T1

GBCHFEAD

points coordinates

Page 19: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

19

Data locality optimizations

Control the mapping of data and computations:

1. Data placement Supported by numa_alloc(), move_pages() First-touch: also OK if data accessed at single processor Interleaved page placement: reduce interconnect contention

[Lachaize et al. USENIX ATC’12, Dashti et al. ASPLOS’13]

2. Computation scheduling Threads: affinity scheduling, supported by sched_setaffinity() Loop parallelism: rely on OpenMP static loop scheduling Pipeline parallelism: locality-aware task dispatch

Page 20: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

20

T0

Executed atProcessor 1

Executed atProcessor 0

T1

streamcluster

GCBHFEAD

points coordinates

Placed atProcessor 0

Placed atProcessor 1

Page 21: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

ferret

21

Stage 2:Segment

Stage 5:Rank

Stage 1: Input

Stage 6:Output

Stage 4: Index

Stage 3:Extract

Image database

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T TT T

T T

T T

T T

Stage 4: Index

T T

T T

T T

T T

Executingat Proc. 0

Executingat Proc. 1

Page 22: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

Stage 4: Index

ferret

22

Stage 2:Segment

Stage 1: Input

Stage 3:Extract

Stage 5:Rank

Stage 6:Output

Imagedatabase

Stage 4: Index’

Stage 4: Index’’

Executingat Proc. 0

Executingat Proc. 1

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T TT T

T T

T T

T T

Placedat Proc. 0

Placedat Proc. 1

Page 23: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

23

Performance evaluation

Two parameters with major effect on NUMA performance Data placement Schedule of computations

Execution scenario: schedule / placement

Scenario 1: default / FTSchedule: defaultPlacement: first-touch (FT)

Page 24: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

24

Processor 0

DRAM

default / FT

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

Page 25: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

25

Processor 0

DRAM

default / FT

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

Page 26: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

26

Processor 0

DRAM

default / FT

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

Page 27: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

27

Performance evaluation

Two parameters with major effect on NUMA performance Data placement Schedule of computations

Execution scenario: schedule / placement

Scenario 2: default / INTLSchedule: defaultPlacement: interleaved (INTL)

Scenario 1: default / FTSchedule: defaultPlacement: first-touch (FT)

changeplacement

Page 28: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

28

default / FTdefault / INTL

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

0 2

4 6

1 3

5 7

T T

T T

T T

T T

8 10

12 14

9 11

13 15

T T

T T

T T

T T

16 18

20 22

17 19

21 23

T T

T T

T T

T T

24 26

28 30

25 27

29 31

T T

T T

T T

T T

DDDDDDDD DDDDDDDD DDDDDDDD DDDDDDDD

Page 29: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

29

Performance evaluation

Two parameters with major effect on NUMA performance Data placement Schedule of computations

Execution scenario: schedule / placement

Scenario 2: default / INTLSchedule: defaultPlacement: interleaved (INTL)

Scenario 1: default / FTSchedule: defaultPlacement: first-touch (FT)

changeplacement

Scenario 3: NUMA / INTLSchedule: NUMA-awarePlacement: interleaved (INTL)

changeschedule

Page 30: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

30

default / INTLNUMA / INTL

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

Page 31: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

31

Performance evaluation

Two parameters with major effect on NUMA performance Data placement Schedule of computations

Execution scenario: schedule / placement

Scenario 2: default / INTLSchedule: defaultPlacement: interleaved (INTL)

Scenario 1: default / FTSchedule: defaultPlacement: first-touch (FT)

changeplacement

Scenario 3: NUMA / INTLSchedule: NUMA-awarePlacement: interleaved (INTL)

Scenario 4: NUMA / NUMASchedule: NUMA-awarePlacement: NUMA-aware (NA)

changeschedule

changeplacement

Page 32: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

32

NUMA / NUMA

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

Page 33: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

33

default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000

REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE

Total uncore transfers (GTransfers)

Performance evaluation: ferret

default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%

Uncoretransfers [x 109]

Improvementover default / FT

default / FT

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

Page 34: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

34

Performance evaluation: ferret

default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000

REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE

Total uncore transfers (GTransfers)

default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%

Uncoretransfers [x 109]

Improvementover default / FT

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

default / INTL

Page 35: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

35

Performance evaluation: ferret

default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000

REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE

Total uncore transfers (GTransfers)

default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%

Uncoretransfers [x 109]

Improvementover default / FT

NUMA / INTL

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

Page 36: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

36

Performance evaluation: ferret

default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000

REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE

Total uncore transfers (GTransfers)

default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%

Uncoretransfers [x 109]

Improvementover default / FT

NUMA / NUMA

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

Page 37: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

37

Performance evaluation (cont’d)

streamcluster dedup

default / INTL

NUMA / INTL

NUMA / NUMA

0%50%

100%150%200%250%

Performance improvement (over default (FT))

default / INTL

NUMA / INTL

NUMA / NUMA

0%

5%

10%

15%

20%

Performance improvement (over default (FT))

Page 38: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

38

Data locality optimizations: summary

Data locality better than avoiding interconnect contention Interleaved placement easy to control Data locality: lack of tools for implementing optimizations

Other options Data replication Automatic data migration

Page 39: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

39

Performance evaluation: ferret

default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000

REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE

Total uncore transfers (GTransfers)

default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%

default / FT

default / replicated

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%

Uncoretransfers [x 109]

Improvementover default (FT)

Page 40: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

40

Conclusions

Details matter Prefetcher efficiency Data locality Substantial improvements

Benchmarking using NUMA-multicores far from easy Two aspects to consider: data placement and computation scheduling

Appreciate memory system details to avoid misconceptions Limited support for understanding hardware bottlenecks

streamcluster ferret dedup214% 59% 17%

Page 41: (Mis) Understanding the NUMA  Memory  System Performance of  Multithreaded Workloads

41

Thank you for your attention!