illusionist: transforming lightweight cores into ...cccp.eecs.umich.edu/slides/ansari-hpca13.pdf ·...
Post on 09-Aug-2018
221 Views
Preview:
TRANSCRIPT
Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand
HPCA-19 February 27, 2013
Amin Ansari1, Shuguang Feng2, Shantanu Gupta3, Josep Torrellas1, and Scott Mahlke4
1 University of Illinois, Urbana-Champaign2 Northrop Grumman Corp.
3 Intel Corp.4 University of Michigan, Ann Arbor
Adapting to Application DemandsNumber of threads to execute is not constant
o Many threads availableSystem with many lightweight cores achieves a better throughput
o Few threads availableSystem with aggressive cores achieves a better throughput
o Single-thread performance is always better with aggressive cores
Asymmetric Chip Multiprocessors (ACMPs):o Adapt to the variability in the number of threadso Limited in that there is no dynamic adaptation
To provide dynamic adaptation: o We use core coupling
2
Core1
Perf
orm
ance
Core2
3
Core CouplingTypically configured as leader/follower cores where the leader runs ahead and attempts to accelerates the follower
o Slipstreamo Master/slave Speculation
o Flea Flickero Dual-core Execution
o Paceline
o DIVA
The leader runs ahead by executing a “pruned” version of the application
The leader speculates on long-latency operations
The leader is aggressively frequency scaled (reduced safety margins)
A smaller follower core simplifies the design/verification of the leader core
Extending Core Coupling
AggressiveCore(AC)
LightweightCore
(LWC)
LightweightCore
Thro
ughp
ut
Configuration
LightweightCore
LightweightCore
LightweightCore
LightweightCore
LightweightCore
LightweightCore
HintsA 9 Core ACMP System
4
9 co
re A
CM
P
7 LW
Cs
+ a
coup
led
core
s
Illus
ioni
st
Illusionist vs Prior Work
AggressiveCore
LightweightCore
LightweightCore
LightweightCore
LightweightCore
LightweightCore
LightweightCore
LightweightCore
LightweightCore
Hints
Higher single-thread performance for all LWCso By using a single aggressive coreo Giving the appearance of 8 semi-aggressive cores
5
Illusionist vs Prior WorkHints
Master Slave1 Slave2 Slave3A’ A
B’B
CC’
C’C
Master Slave Parallelization [Zilles’02]
6
Providing Hints for Many CoresOriginal IPC of the aggressive core ~2X of that of a LWC
We want an AC to keep up with a large number of LWCso We need to substantially reduce the amount of work that the
aggressive core needs to do per each thread running on a LWC
We need to run lower num of instructions per each threado We distill the program that the aggressive core needs to runo We limit the execution of the program only to most fruitful parts
The main challenge here is too Preserve the effectiveness of the hints while removing instructions
7
Program DistillationObjective: reduce the size of program while preserving the effectiveness of the original hints (branch prediction and cache hits)
Distillation techniqueso Aggressive instruction removal (on average, 77%)
Remove instructions which do not contribute to hint generationRemove highly biased branches and their back sliceRemove memory inst. accessing the same cache line
o Select the most promising program phasesPredictor that uses performance countersRegression model based on IPC, $ and BP miss rates
8
Example of Instruction Removal
9
if (high<=low) return;
srand(10);for (i=low;i<high;i++) {for (j=0;j<numf1s;j++) {if (i%low) {tds[j][i] = tds[j][0];tds[j][i] = bus[j][0];
} else {tds[j][i] = tds[j][1];tds[j][i] = bus[j][1];
}}
}
for (i=low;i<high;i++) {for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i] += noise2;bus[j][i] += noise2;
}}
…
for (i=low;i<high;i=i+4) {for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i] += noise2;bus[j][i] += noise2;
} for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i+1] += noise2;bus[j][i+1] += noise2;
}for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i+2] += noise2;bus[j][i+2] += noise2;
}for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i+3] += noise2;bus[j][i+3] += noise2;
}}
srand(10);for (i=low;i<high;i=i+4) {for (j=0;j<numf1s;j++) {tds[j][i] = tds[j][1];tds[j][i] = bus[j][1];
}}
for (i=low;i<high;i=i+4) {for (j=0;j<numf1s;j++) {tds[j][i] = noise2;bus[j][i] = noise2;
}}
Original code Distilled code
179.art
Hint Phases
10
If we can predict these phases without actually running theprogram on both lightweight and aggressive cores, we canlimit the dual core execution only to the most useful phases
Performance(accelerated LWC) / Performance(original LWC)
Groups of 10K instr
Phase Prediction
11
Phase predictor : o does a decent job predicting the IPC trendo can sit either in the hypervisor or operating system and reads the
performance counters while the threads running
Aggressive core runs the thread that will benefit the most
Illusionist: Core Coupling Architecture
12
Agg
ress
ive
Cor
e
L1-Data
Shared L2 cacheRead-Only
Lightweight C
ore
L1-Data
Hint Gathering
FET
Memory Hierarchy
Queue
tail
head
DEC REN DIS EXE MEM COM
FE DE RE DI EX ME CO
Hint Distribution
L1-InstL1-Inst
Cache Fingerprint
Hint DisablingResynchronization
signal and hint disabling information
Illusionist System
13
L2 Cache Banks
L2 Cache Banks L2 Cache BanksData Switch
L2 Cache Banks
AggressiveCore
Queue
Hin
t Gat
herin
g
Queue Queue Queue
Lightweight Core
Queue Queue Queue Queue
Lightweight Core
Lightweight Core
Lightweight Core
Lightweight Core
Lightweight Core
Lightweight Core
Lightweight Core
Queue
Lightweight Core
Lightweight Core
Queue
Experimental Methodology
14
Performance : Heavily modified SimAlphao Instruction removal and phase-based program pruningo SPEC-CPU-2K with SimPoint
Power : Wattch, HotLeakage, and CACTIArea : Synopsys toolchain + 90nm TSMC
Instruction Type Breakdown
In most benchmarks, the breakdowns are similar.
16
b: before distillation a: after distillation
17
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
All Aggressive Cores(ACs)
1 AC + 1 LWC After InstructionRemoval
After Phase-BasedPruning
All Lightweight Cores(LWCs)
Nor
mal
ized
to A
ll A
ggre
ssiv
e C
ores
System Throughput Power Average Single-Thread Performance Total Energy
Area-Neutral Comparison of Alternatives
More Lightweight Cores
34%
2X
Conclusion
18
On-demand acceleration of lightweight cores o using a few aggressive cores
Aggressive core keeps up with many LWCs by o Aggressive inst. removal with a minimal impact on the hintso Phase-based program pruning based on hint effectiveness
Illusionist provides an interesting design pointo Compared to a CMP with only lightweight cores
35% better single thread performance per threado Compared to a CMP with only aggressive cores
2X better system throughput
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
All Aggressive Cores(ACs)
1 AC + 1 LWC After InstructionRemoval
After Phase-BasedPruning
All Lightweight Cores(LWCs)
Nor
mal
ized
to A
ll A
ggre
ssiv
e C
ores
System Throughput Power Average Single-Thread Performance Total Energy
20
Comparison with Alternatives
More Lightweight Cores
1 6 10
number of available threads = 60% of the number of lightweight cores
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
All Aggressive Cores(ACs)
1 AC + 1 LWC After InstructionRemoval
After Phase-BasedPruning
All Lightweight Cores(LWCs)
Nor
mal
ized
to A
ll A
ggre
ssiv
e C
ores
System Throughput Power Average Single-Thread Performance Total Energy
21
Comparison with Alternatives
More Lightweight Cores
1 6 10
number of available threads = 30% of the number of lightweight cores
top related