(mis) understanding the numa memory system performance of multithreaded workloads
DESCRIPTION
(Mis) Understanding the NUMA Memory System Performance of Multithreaded Workloads. Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich, Switzerland. NUMA-multicore memory system. Processor 1. Processor 0. LOCAL_CACHE: 38 cycles. REMOTE_CACHE: 186 cycles. T. 0. 1. 2. - PowerPoint PPT PresentationTRANSCRIPT
(Mis)Understanding theNUMA Memory System Performance of Multithreaded WorkloadsZoltán MajóThomas R. Gross
Department of Computer ScienceETH Zurich, Switzerland
2
NUMA-multicore memory system
Processor 1
IC MC DRAM
Processor 0
0 2
4 6
MC ICDRAM
Last-level cache Last-level cache
1 3
5 7
8 10
12 14
9 11
13 15
LOCAL_CACHE:38 cycles
LOCAL_DRAM:190 cycles
REMOTE_CACHE:186 cycles
REMOTE_DRAM: 310 cycles
T
All data based on experimental evaluation of Intel Nehalem (Hackenberg [MICRO ’09], Molka [PACT ‘09])
3
Experimental setup
Three benchmark programs from PARSEC streamcluster, ferret, and dedup Grown size of inputs more pressure on the memory system
Intel Westmere 4 processors, 32 cores
3 execution scenarios w/o NUMA: Sequential w/o NUMA: Parallel (8 cores/1 processor) w/ NUMA: Parallel (32 cores/4 processors)
4
Execution scenarios
Processor 1
IC MC DRAM
Processor 0
0 2
4 6
MC ICDRAM
Last-level cache Last-level cache
1 3
5 7
8 10
12 14
9 11
13 15
Processor 2
16 18
20 22
MC ICDRAM
Last-level cache
17 19
21 23
Processor 3
24 26
28 30
IC MC
Last-level cache
25 27
29 31
DRAM
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
5
8 (1) 32 (4)05
1015202530
streamcluster ferret dedupActive cores (processors)
Speedup over sequential12
Parallel performance
6
se-quen-
tial
8 (1) 32 (4)0
10203040506070
Useful cycles Other stallsBack-end stalls
Active cores (processors)
CPU cycles (x1012)
se-quen-
tial
8 (1) 32 (4)0
0.2
0.4
0.60.8
1
Useful cycles Other stallsBack-end stalls
Active cores (processors)
CPU cycles (x1012)
CPU cycle breakdown
dedup: good scaling (26X) streamcluster: poor scaling (11X)
7
Outline
Introduction Performance analysis
Data locality Prefetcher effectiveness
Source-level optimizations Performance evaluation Conclusions
8
Data locality
Page placement policy Commonly used policy: first-touch (default in Linux)
Measurement: data locality of the benchmarks
Data locality = [%]
Read transfers measured at the processor’s uncore
Remote memory referencesTotal memory references
9
NUMA-multicore memory system
Processor 1
IC MC DRAM
Processor 0
0 2
4 6
MC ICDRAM
Last-level cache Last-level cache
1 3
5 7
8 10
12 14
9 11
13 15
LOCAL_CACHE
LOCAL_DRAM REMOTE_DRAM
T
Processor 2
16 18
20 22
MC ICDRAM
Last-level cache
17 19
21 23
Processor 3
24 26
28 30
IC MC
Last-level cache
25 27
29 31
DRAM
REMOTE_CACHE
10
streamc. ferret dedup0%
20%
40%
60%
80%
REMOTE_DRAM REMOTE_CACHE
Fraction total uncore transfers
Data locality
11
streamc. ferret dedup0%
20%40%60%80%
100%
Non-shared heap pages
Shared heap pages
Percentage sampled heap address space
Inter-processor data sharing
Cause of data sharing streamcluster: data points to be clustered ferret and dedup: in-memory databases
12
Prefetcher performance
Experiment Run each benchmarks with
prefetcher on/off Compare performance streamc. ferret dedup
0%
1%
2%
3%
Performance improvementrelative to prefetching disabled
Causes of prefetcher inefficiency ferret and dedup: hash-based memory access patterns streamcluster: random shuffling
13
streamcluster: random shuffling
while (input = read_data_points()) {
clusters = process(input);
}
Randomly shuffle data pointsto increase probabilitythat each point is compared to each cluster.
14
streamcluster: prefetcher effectiveness
Original data layout (before shuffling)
T0
T1
ABCDEFGH
points coordinates
15
streamcluster: prefetcher effectiveness
T0
T1
ABCDEFGH
points coordinates
Data layout (after pointer-based shuffle)
16
streamcluster: prefetcher effectiveness
T0 ABCDEFGH
points coordinates
Data layout (after pointer-based shuffle)
17
Outline
Introduction Performance analysis
Data locality Prefetcher effectiveness
Source-level optimizations Prefetching Data locality
Performance evaluation Conclusions
18
ABCDEFGH
streamcluster: Optimizing prefetching
Copy-based shuffle
Performance improvement over pointer-based shuffle
Westmere: 12% Nehalem: 60%
T0
T1
GBCHFEAD
points coordinates
19
Data locality optimizations
Control the mapping of data and computations:
1. Data placement Supported by numa_alloc(), move_pages() First-touch: also OK if data accessed at single processor Interleaved page placement: reduce interconnect contention
[Lachaize et al. USENIX ATC’12, Dashti et al. ASPLOS’13]
2. Computation scheduling Threads: affinity scheduling, supported by sched_setaffinity() Loop parallelism: rely on OpenMP static loop scheduling Pipeline parallelism: locality-aware task dispatch
20
T0
Executed atProcessor 1
Executed atProcessor 0
T1
streamcluster
GCBHFEAD
points coordinates
Placed atProcessor 0
Placed atProcessor 1
ferret
21
Stage 2:Segment
Stage 5:Rank
Stage 1: Input
Stage 6:Output
Stage 4: Index
Stage 3:Extract
Image database
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T TT T
T T
T T
T T
Stage 4: Index
T T
T T
T T
T T
Executingat Proc. 0
Executingat Proc. 1
Stage 4: Index
ferret
22
Stage 2:Segment
Stage 1: Input
Stage 3:Extract
Stage 5:Rank
Stage 6:Output
Imagedatabase
Stage 4: Index’
Stage 4: Index’’
Executingat Proc. 0
Executingat Proc. 1
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T TT T
T T
T T
T T
Placedat Proc. 0
Placedat Proc. 1
23
Performance evaluation
Two parameters with major effect on NUMA performance Data placement Schedule of computations
Execution scenario: schedule / placement
Scenario 1: default / FTSchedule: defaultPlacement: first-touch (FT)
24
Processor 0
DRAM
default / FT
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
25
Processor 0
DRAM
default / FT
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
26
Processor 0
DRAM
default / FT
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
DDDDDDDD
DDDDDDDD
DDDDDDDD
DDDDDDDD
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
27
Performance evaluation
Two parameters with major effect on NUMA performance Data placement Schedule of computations
Execution scenario: schedule / placement
Scenario 2: default / INTLSchedule: defaultPlacement: interleaved (INTL)
Scenario 1: default / FTSchedule: defaultPlacement: first-touch (FT)
changeplacement
28
default / FTdefault / INTL
Processor 0
DRAM
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
0 2
4 6
1 3
5 7
T T
T T
T T
T T
8 10
12 14
9 11
13 15
T T
T T
T T
T T
16 18
20 22
17 19
21 23
T T
T T
T T
T T
24 26
28 30
25 27
29 31
T T
T T
T T
T T
DDDDDDDD DDDDDDDD DDDDDDDD DDDDDDDD
29
Performance evaluation
Two parameters with major effect on NUMA performance Data placement Schedule of computations
Execution scenario: schedule / placement
Scenario 2: default / INTLSchedule: defaultPlacement: interleaved (INTL)
Scenario 1: default / FTSchedule: defaultPlacement: first-touch (FT)
changeplacement
Scenario 3: NUMA / INTLSchedule: NUMA-awarePlacement: interleaved (INTL)
changeschedule
30
default / INTLNUMA / INTL
Processor 0
DRAM
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
DDDDDDDD
DDDDDDDD
DDDDDDDD
DDDDDDDD
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
DDDDDDDD
DDDDDDDD
DDDDDDDD
DDDDDDDD
31
Performance evaluation
Two parameters with major effect on NUMA performance Data placement Schedule of computations
Execution scenario: schedule / placement
Scenario 2: default / INTLSchedule: defaultPlacement: interleaved (INTL)
Scenario 1: default / FTSchedule: defaultPlacement: first-touch (FT)
changeplacement
Scenario 3: NUMA / INTLSchedule: NUMA-awarePlacement: interleaved (INTL)
Scenario 4: NUMA / NUMASchedule: NUMA-awarePlacement: NUMA-aware (NA)
changeschedule
changeplacement
32
NUMA / NUMA
Processor 0
DRAM
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
DDDDDDDD
DDDDDDDD
DDDDDDDD
DDDDDDDD
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
33
default (FT)
default (INTL)
NUMA-aware (INTL)
NUMA-aware (NA)0
5,00010,00015,00020,00025,00030,00035,00040,000
REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE
Total uncore transfers (GTransfers)
Performance evaluation: ferret
default / FT
default / INTL
NUMA / INTL
NUMA / NUMA
05000
10000150002000025000300003500040000
0%10%20%30%40%50%60%70%
Uncoretransfers [x 109]
Improvementover default / FT
default / FT
Processor 0
DRAM
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
DDDDDDDD
DDDDDDDD
DDDDDDDD
DDDDDDDD
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
34
Performance evaluation: ferret
default (FT)
default (INTL)
NUMA-aware (INTL)
NUMA-aware (NA)0
5,00010,00015,00020,00025,00030,00035,00040,000
REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE
Total uncore transfers (GTransfers)
default / FT
default / INTL
NUMA / INTL
NUMA / NUMA
05000
10000150002000025000300003500040000
0%10%20%30%40%50%60%70%
Uncoretransfers [x 109]
Improvementover default / FT
Processor 0
DRAM
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
DDDDDDDD
DDDDDDDD
DDDDDDDD
DDDDDDDD
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
default / INTL
35
Performance evaluation: ferret
default (FT)
default (INTL)
NUMA-aware (INTL)
NUMA-aware (NA)0
5,00010,00015,00020,00025,00030,00035,00040,000
REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE
Total uncore transfers (GTransfers)
default / FT
default / INTL
NUMA / INTL
NUMA / NUMA
05000
10000150002000025000300003500040000
0%10%20%30%40%50%60%70%
Uncoretransfers [x 109]
Improvementover default / FT
NUMA / INTL
Processor 0
DRAM
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
DDDDDDDD
DDDDDDDD
DDDDDDDD
DDDDDDDD
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
36
Performance evaluation: ferret
default (FT)
default (INTL)
NUMA-aware (INTL)
NUMA-aware (NA)0
5,00010,00015,00020,00025,00030,00035,00040,000
REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE
Total uncore transfers (GTransfers)
default / FT
default / INTL
NUMA / INTL
NUMA / NUMA
05000
10000150002000025000300003500040000
0%10%20%30%40%50%60%70%
Uncoretransfers [x 109]
Improvementover default / FT
NUMA / NUMA
Processor 0
DRAM
Processor 1
DRAM
Processor 2
DRAM
Processor 3
DRAM
DDDDDDDD
DDDDDDDD
DDDDDDDD
DDDDDDDD
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
T T
37
Performance evaluation (cont’d)
streamcluster dedup
default / INTL
NUMA / INTL
NUMA / NUMA
0%50%
100%150%200%250%
Performance improvement (over default (FT))
default / INTL
NUMA / INTL
NUMA / NUMA
0%
5%
10%
15%
20%
Performance improvement (over default (FT))
38
Data locality optimizations: summary
Data locality better than avoiding interconnect contention Interleaved placement easy to control Data locality: lack of tools for implementing optimizations
Other options Data replication Automatic data migration
39
Performance evaluation: ferret
default (FT)
default (INTL)
NUMA-aware (INTL)
NUMA-aware (NA)0
5,00010,00015,00020,00025,00030,00035,00040,000
REMOTE_DRAM LOCAL_DRAMREMOTE_CACHE LOCAL_CACHE
Total uncore transfers (GTransfers)
default / FT
default / INTL
NUMA / INTL
NUMA / NUMA
05000
10000150002000025000300003500040000
0%10%20%30%40%50%60%70%
default / FT
default / replicated
NUMA / NUMA
05000
10000150002000025000300003500040000
0%10%20%30%40%50%60%70%
Uncoretransfers [x 109]
Improvementover default (FT)
40
Conclusions
Details matter Prefetcher efficiency Data locality Substantial improvements
Benchmarking using NUMA-multicores far from easy Two aspects to consider: data placement and computation scheduling
Appreciate memory system details to avoid misconceptions Limited support for understanding hardware bottlenecks
streamcluster ferret dedup214% 59% 17%
41
Thank you for your attention!