(mis) understanding the numa memory system performance of multithreaded workloads

(Mis)Understanding theNUMA Memory System Performance of Multithreaded WorkloadsZoltán MajóThomas R. Gross

Department of Computer ScienceETH Zurich, Switzerland

2

NUMA-multicore memory system

Processor 1

IC MC DRAM

Processor 0

0 2

4 6

MC ICDRAM

Last-level cache Last-level cache

1 3

5 7

8 10

12 14

9 11

13 15

LOCAL_CACHE:38 cycles

LOCAL_DRAM:190 cycles

REMOTE_CACHE:186 cycles

REMOTE_DRAM: 310 cycles

T

All data based on experimental evaluation of Intel Nehalem (Hackenberg [MICRO ’09], Molka [PACT ‘09])

3

Experimental setup

Three benchmark programs from PARSEC streamcluster, ferret, and dedup Grown size of inputs more pressure on the memory system

Intel Westmere 4 processors, 32 cores

3 execution scenarios w/o NUMA: Sequential w/o NUMA: Parallel (8 cores/1 processor) w/ NUMA: Parallel (32 cores/4 processors)

4

Execution scenarios

Processor 1

IC MC DRAM

Processor 0

0 2

4 6

MC ICDRAM


1 3

5 7

8 10

12 14

9 11

13 15

Processor 2

16 18

20 22

MC ICDRAM

Last-level cache

17 19

21 23

Processor 3

24 26

28 30

IC MC

Last-level cache

25 27

29 31

DRAM

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

5

8 (1) 32 (4)05

1015202530

streamcluster ferret dedupActive cores (processors)

Speedup over sequential12

Parallel performance

6

se-quen-

tial

8 (1) 32 (4)0

10203040506070

Useful cycles Other stallsBack-end stalls

Active cores (processors)

CPU cycles (x1012)

se-quen-

tial

8 (1) 32 (4)0

0.2

0.4

0.60.8

1

Useful cycles Other stallsBack-end stalls

Active cores (processors)

CPU cycles (x1012)

CPU cycle breakdown

dedup: good scaling (26X) streamcluster: poor scaling (11X)

7

Outline

Introduction Performance analysis

Data locality Prefetcher effectiveness

Source-level optimizations Performance evaluation Conclusions

8

Data locality

Page placement policy Commonly used policy: first-touch (default in Linux)

Measurement: data locality of the benchmarks

Data locality = [%]

Read transfers measured at the processor’s uncore

Remote memory referencesTotal memory references

9

NUMA-multicore memory system

Processor 1

IC MC DRAM

Processor 0

0 2

4 6

MC ICDRAM


1 3

5 7

8 10

12 14

9 11

13 15

LOCAL_CACHE

LOCAL_DRAM REMOTE_DRAM

T

Processor 2

16 18

20 22

MC ICDRAM

Last-level cache

17 19

21 23

Processor 3

24 26

28 30

IC MC

Last-level cache

25 27

29 31

DRAM

REMOTE_CACHE

10

streamc. ferret dedup0%

20%

40%

60%

80%

REMOTE_DRAM REMOTE_CACHE

Fraction total uncore transfers

Data locality

11

streamc. ferret dedup0%

20%40%60%80%

100%

Non-shared heap pages

Shared heap pages

Percentage sampled heap address space

Inter-processor data sharing

Cause of data sharing streamcluster: data points to be clustered ferret and dedup: in-memory databases

12

Prefetcher performance

Experiment Run each benchmarks with

prefetcher on/off Compare performance streamc. ferret dedup

0%

1%

2%

3%

Performance improvementrelative to prefetching disabled

Causes of prefetcher inefficiency ferret and dedup: hash-based memory access patterns streamcluster: random shuffling

13

streamcluster: random shuffling

while (input = read_data_points()) {

clusters = process(input);

}

Randomly shuffle data pointsto increase probabilitythat each point is compared to each cluster.

14

streamcluster: prefetcher effectiveness

Original data layout (before shuffling)

T0

T1

ABCDEFGH

points coordinates

15


T0

T1

ABCDEFGH

points coordinates

Data layout (after pointer-based shuffle)

16


T0 ABCDEFGH

points coordinates

Data layout (after pointer-based shuffle)

17

Outline

Introduction Performance analysis

Data locality Prefetcher effectiveness

Source-level optimizations Prefetching Data locality

Performance evaluation Conclusions

18

ABCDEFGH

streamcluster: Optimizing prefetching

Copy-based shuffle

Performance improvement over pointer-based shuffle

Westmere: 12% Nehalem: 60%

T0

T1

GBCHFEAD

points coordinates

19

Data locality optimizations

Control the mapping of data and computations:

1. Data placement Supported by numa_alloc(), move_pages() First-touch: also OK if data accessed at single processor Interleaved page placement: reduce interconnect contention

[Lachaize et al. USENIX ATC’12, Dashti et al. ASPLOS’13]

2. Computation scheduling Threads: affinity scheduling, supported by sched_setaffinity() Loop parallelism: rely on OpenMP static loop scheduling Pipeline parallelism: locality-aware task dispatch

20

T0

Executed atProcessor 1

Executed atProcessor 0

T1

streamcluster

GCBHFEAD

points coordinates

Placed atProcessor 0

Placed atProcessor 1

ferret

21

Stage 2:Segment

Stage 5:Rank

Stage 1: Input

Stage 6:Output

Stage 4: Index

Stage 3:Extract

Image database

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T TT T

T T

T T

T T

Stage 4: Index

T T

T T

T T

T T

Executingat Proc. 0

Executingat Proc. 1

Stage 4: Index

ferret

22

Stage 2:Segment

Stage 1: Input

Stage 3:Extract

Stage 5:Rank

Stage 6:Output

Imagedatabase

Stage 4: Index’

Stage 4: Index’’

Executingat Proc. 0

Executingat Proc. 1

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T TT T

T T

T T

T T

Placedat Proc. 0

Placedat Proc. 1

23

Performance evaluation

Two parameters with major effect on NUMA performance Data placement Schedule of computations

Execution scenario: schedule / placement

Scenario 1: default / FTSchedule: defaultPlacement: first-touch (FT)

24

Processor 0

DRAM

default / FT

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

25

Processor 0

DRAM

default / FT

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

26

Processor 0

DRAM

default / FT

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

27




Scenario 2: default / INTLSchedule: defaultPlacement: interleaved (INTL)


changeplacement

28

default / FTdefault / INTL

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

0 2

4 6

1 3

5 7

T T

T T

T T

T T

8 10

12 14

9 11

13 15

T T

T T

T T

T T

16 18

20 22

17 19

21 23

T T

T T

T T

T T

24 26

28 30

25 27

29 31

T T

T T

T T

T T

DDDDDDDD DDDDDDDD DDDDDDDD DDDDDDDD

29






changeplacement

Scenario 3: NUMA / INTLSchedule: NUMA-awarePlacement: interleaved (INTL)

changeschedule

30

default / INTLNUMA / INTL

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

31






changeplacement

Scenario 3: NUMA / INTLSchedule: NUMA-awarePlacement: interleaved (INTL)

Scenario 4: NUMA / NUMASchedule: NUMA-awarePlacement: NUMA-aware (NA)

changeschedule

changeplacement

32

NUMA / NUMA

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

33

default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000

REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE

Total uncore transfers (GTransfers)

Performance evaluation: ferret

default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%

Uncoretransfers [x 109]

Improvementover default / FT

default / FT

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

34


default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000



default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%



Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

T T T T

default / INTL

35


default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000



default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%



NUMA / INTL

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

36


default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000



default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%



NUMA / NUMA

Processor 0

DRAM

Processor 1

DRAM

Processor 2

DRAM

Processor 3

DRAM

DDDDDDDD

DDDDDDDD

DDDDDDDD

DDDDDDDD

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

T T

37

Performance evaluation (cont’d)

streamcluster dedup

default / INTL

NUMA / INTL

NUMA / NUMA

0%50%

100%150%200%250%

Performance improvement (over default (FT))

default / INTL

NUMA / INTL

NUMA / NUMA

0%

5%

10%

15%

20%

Performance improvement (over default (FT))

38

Data locality optimizations: summary

Data locality better than avoiding interconnect contention Interleaved placement easy to control Data locality: lack of tools for implementing optimizations

Other options Data replication Automatic data migration

39


default (FT)

default (INTL)

NUMA-aware (INTL)

NUMA-aware (NA)0

5,00010,00015,00020,00025,00030,00035,00040,000



default / FT

default / INTL

NUMA / INTL

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%

default / FT

default / replicated

NUMA / NUMA

05000

10000150002000025000300003500040000

0%10%20%30%40%50%60%70%


Improvementover default (FT)

40

Conclusions

Details matter Prefetcher efficiency Data locality Substantial improvements

Benchmarking using NUMA-multicores far from easy Two aspects to consider: data placement and computation scheduling

Appreciate memory system details to avoid misconceptions Limited support for understanding hardware bottlenecks

streamcluster ferret dedup214% 59% 17%

41

Thank you for your attention!

(mis) understanding the numa memory system performance of multithreaded workloads

Documents

mapping of data

cyclestall data

data placementsupported

data pointsto

pointerbased shufflewestmere

sequentialwo numa

cores1 processorw numa

cores4 processors