thread cluster memory scheduling : exploiting differences in memory access behavior
DESCRIPTION
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter. Motivation. Memory is a shared resource Threads’ requests contend for memory Degradation in single thread performance - PowerPoint PPT PresentationTRANSCRIPT
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
Yoongu KimMichael PapamichaelOnur MutluMor Harchol-Balter
2
Motivation• Memory is a shared resource
• Threads’ requests contend for memory– Degradation in single thread performance– Can even lead to starvation
• How to schedule memory requests to increase both system throughput and fairness?
Core Core
Core CoreMemory
3
8 8.2 8.4 8.6 8.8 91
3
5
7
9
11
13
15
17
FRFCFSSTFMPAR-BSATLAS
Weighted Speedup
Max
imum
Slo
wdo
wn
Previous Scheduling Algorithms are Biased
System throughput bias
Fairness bias
No previous memory scheduling algorithm provides both the best fairness and system throughput
Ideal
Better system throughput
Bette
r fai
rnes
s
4
Take turns accessing memory
Why do Previous Algorithms Fail?Fairness biased approach
thread C
thread B
thread A
less memory intensive
higherpriority
Prioritize less memory-intensive threads
Throughput biased approach
Good for throughput
starvation unfairness
thread C thread Bthread A
Does not starve
not prioritized reduced throughput
Single policy for all threads is insufficient
5
Insight: Achieving Best of Both Worlds
thread
thread
higherpriority
thread
thread
thread
thread
thread
thread
Prioritize memory-non-intensive threads
For Throughput
Unfairness caused by memory-intensive being prioritized over each other • Shuffle threads
Memory-intensive threads have different vulnerability to interference• Shuffle asymmetrically
For Fairness
thread
thread
thread
thread
6
OutlineMotivation & InsightsOverviewAlgorithm
Bringing it All TogetherEvaluationConclusion
Overview: Thread Cluster Memory Scheduling1. Group threads into two clusters2. Prioritize non-intensive cluster3. Different policies for each cluster
7
thread
Threads in the system
thread
thread
thread
thread
thread
thread
Non-intensive cluster
Intensive cluster
thread
thread
thread
Memory-non-intensive
Memory-intensive
Prioritized
higherpriority
higherpriority
Throughput
Fairness
8
OutlineMotivation & InsightsOverviewAlgorithm
Bringing it All TogetherEvaluationConclusion
9
TCM Outline
1. Clustering
10
Clustering ThreadsStep1 Sort threads by MPKI (misses per kiloinstruction)
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
higher MPKI
T α < 10% ClusterThreshold
Intensive clusterαT
Non-intensivecluster
T = Total memory bandwidth usage
Step2 Memory bandwidth usage αT divides clusters
11
TCM Outline
1. Clustering
2. Between Clusters
12
Prioritize non-intensive cluster
• Increases system throughput– Non-intensive threads have greater potential for
making progress
• Does not degrade fairness– Non-intensive threads are “light”– Rarely interfere with intensive threads
Prioritization Between Clusters
>priority
13
TCM Outline
1. Clustering
2. Between Clusters
3. Non-Intensive Cluster
Throughput
14
Prioritize threads according to MPKI
• Increases system throughput– Least intensive thread has the greatest potential
for making progress in the processor
Non-Intensive Cluster
thread
thread
thread
thread
higherpriority lowest MPKI
highest MPKI
15
TCM Outline
1. Clustering
2. Between Clusters
3. Non-Intensive Cluster
4. Intensive Cluster
Throughput
Fairness
16
Periodically shuffle the priority of threads
• Is treating all threads equally good enough?• BUT: Equal turns ≠ Same slowdown
Intensive Cluster
thread
thread
thread
Increases fairness
Most prioritizedhigherpriority
thread
thread
thread
17
random-access streaming02468
101214
Slow
dow
n
Case Study: A Tale of Two ThreadsCase Study: Two intensive threads contending1. random-access2. streaming
Prioritize random-access Prioritize streaming
random-access thread is more easily slowed down
random-access streaming02468
101214
Slow
dow
n 7xprioritized
1x
11x
prioritized1x
Which is slowed down more easily?
18
Why are Threads Different?random-access streaming
reqreqreqreq
Bank 1 Bank 2 Bank 3 Bank 4 Memoryrows
•All requests parallel•High bank-level parallelism
•All requests Same row•High row-buffer locality
reqreqreqreq
activated rowreqreqreqreq reqreqreqreqstuck
Vulnerable to interference
19
TCM Outline
1. Clustering
2. Between Clusters
3. Non-Intensive Cluster
4. Intensive Cluster
Fairness
Throughput
20
NicenessHow to quantify difference between threads?
Vulnerability to interferenceBank-level parallelism
Causes interferenceRow-buffer locality
+ Niceness -
NicenessHigh Low
21
Shuffling: Round-Robin vs. Niceness-Aware1. Round-Robin shuffling2. Niceness-Aware shuffling
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice thread
GOOD: Each thread prioritized once
What can go wrong?
ABCD
D A B C D
22
Shuffling: Round-Robin vs. Niceness-Aware1. Round-Robin shuffling2. Niceness-Aware shuffling
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice thread
What can go wrong?
ABCD
D A B C D
A
B
DC
B
C
AD
C
D
BA
D
A
CB
BAD: Nice threads receive lots of interference
GOOD: Each thread prioritized once
23
Shuffling: Round-Robin vs. Niceness-Aware1. Round-Robin shuffling2. Niceness-Aware shuffling
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice thread
GOOD: Each thread prioritized once
ABCD
D C B A D
24
Shuffling: Round-Robin vs. Niceness-Aware1. Round-Robin shuffling2. Niceness-Aware shuffling
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice threadABCD
D C B A D
D
A
CB
B
A
CD
A
D
BC
D
A
CB
GOOD: Each thread prioritized once
GOOD: Least nice thread stays mostly deprioritized
25
TCM Outline
1. Clustering
2. Between Clusters
3. Non-Intensive Cluster
4. Intensive Cluster
1. Clustering
2. Between Clusters
3. Non-Intensive Cluster
4. Intensive Cluster
Fairness
Throughput
26
OutlineMotivation & InsightsOverviewAlgorithm
Bringing it All TogetherEvaluationConclusion
27
Quantum-Based Operation
Time
Previous quantum (~1M cycles)
During quantum:•Monitor thread behavior
1. Memory intensity2. Bank-level parallelism3. Row-buffer locality
Beginning of quantum:• Perform clustering• Compute niceness of
intensive threads
Current quantum(~1M cycles)
Shuffle interval(~1K cycles)
28
TCM Scheduling Algorithm1. Highest-rank: Requests from higher ranked threads prioritized• Non-Intensive cluster > Intensive cluster• Non-Intensive cluster: lower intensity higher rank• Intensive cluster: rank shuffling
2.Row-hit: Row-buffer hit requests are prioritized
3.Oldest: Older requests are prioritized
29
Implementation CostsRequired storage at memory controller (24 cores)
• No computation is on the critical path
Thread memory behavior Storage
MPKI ~0.2kb
Bank-level parallelism ~0.6kb
Row-buffer locality ~2.9kb
Total < 4kbits
30
OutlineMotivation & InsightsOverviewAlgorithm
Bringing it All TogetherEvaluationConclusion
Fairness
Throughput
31
Metrics & Methodology• Metrics
System throughput
i
alonei
sharedi
IPCIPCSpeedupWeighted shared
i
alonei
i IPCIPCSlowdownMaximum max
Unfairness
• Methodology– Core model• 4 GHz processor, 128-entry instruction window• 512 KB/core L2 cache
– Memory model: DDR2– 96 multiprogrammed SPEC CPU2006 workloads
32
Previous WorkFRFCFS [Rixner et al., ISCA00]: Prioritizes row-buffer hits– Thread-oblivious Low throughput & Low fairness
STFM [Mutlu et al., MICRO07]: Equalizes thread slowdowns– Non-intensive threads not prioritized Low throughput
PAR-BS [Mutlu et al., ISCA08]: Prioritizes oldest batch of requests while preserving bank-level parallelism– Non-intensive threads not always prioritized Low
throughput
ATLAS [Kim et al., HPCA10]: Prioritizes threads with less memory service– Most intensive thread starves Low fairness
33
Results: Fairness vs. Throughput
7.5 8 8.5 9 9.5 104
6
8
10
12
14
16
TCM
ATLAS
PAR-BS
STFM
FRFCFS
Weighted Speedup
Max
imum
Slo
wdo
wn
Better system throughput
Bette
r fai
rnes
s
5%
39%
8%5%
TCM provides best fairness and system throughput
Averaged over 96 workloads
34
Results: Fairness-Throughput Tradeoff
12 12.5 13 13.5 14 14.5 15 15.5 162
4
6
8
10
12
FRFCFS
Weighted Speedup
Max
imum
Slo
wdo
wn
When configuration parameter is varied…
Adjusting ClusterThreshold
TCM allows robust fairness-throughput tradeoff
STFMPAR-BS
ATLAS
TCM
Better system throughput
Bette
r fai
rnes
s
35
Operating System Support• ClusterThreshold is a tunable knob– OS can trade off between fairness and throughput
• Enforcing thread weights– OS assigns weights to threads– TCM enforces thread weights within each cluster
36
OutlineMotivation & InsightsOverviewAlgorithm
Bringing it All TogetherEvaluationConclusion
Fairness
Throughput
37
Conclusion• No previous memory scheduling algorithm provides
both high system throughput and fairness– Problem: They use a single policy for all threads
• TCM groups threads into two clusters1. Prioritize non-intensive cluster throughput2. Shuffle priorities in intensive cluster fairness3. Shuffling should favor nice threads fairness
• TCM provides the best system throughput and fairness
38
THANK YOU
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
Yoongu KimMichael PapamichaelOnur MutluMor Harchol-Balter
40
Thread Weight Support• Even if heaviest weighted thread happens to
be the most intensive thread…– Not prioritized over the least intensive thread
41
Harmonic Speedup
Better system throughput
Bette
r fai
rnes
s
42
Shuffling Algorithm Comparison• Niceness-Aware shuffling– Average of maximum slowdown is lower– Variance of maximum slowdown is lower
Shuffling AlgorithmRound-Robin Niceness-Aware
E(Maximum Slowdown) 5.58 4.84VAR(Maximum Slowdown) 1.61 0.85
43
Sensitivity Results
ShuffleInterval (cycles)500 600 700 800
System Throughput 14.2 14.3 14.2 14.7Maximum Slowdown 6.0 5.4 5.9 5.5
Number of Cores4 8 16 24 32
System Throughput(compared to ATLAS) 0% 3% 2% 1% 1%Maximum Slowdown(compared to ATLAS) -4% -30% -29% -30% -41%