international symposium on low power electronics and design energy-efficient non-minimal path...
Post on 14-Dec-2015
220 Views
Preview:
TRANSCRIPT
International Symposium on Low Power Electronics and Design
Energy-Efficient Non-Minimal Path On-chip Interconnection Network
for Heterogeneous Systems
Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai
University of Minnesota – Twin Cities
2
Network-on-Chips
CoreR
Leads to latencyLeads to energy
consumption
ScalableProvides high
bandwidth
CoreR
CoreR
CoreR
CoreR
CoreR
CoreR
CoreR
Heterogeneous System
DataParallel
DataParallel
DataParallel
DataParallel
Super-scalar
Super-scalar
Super-scalar
Super-scalar
3
Only some routers are fully utilized
4
DVFS for Reducing NoC Energy
Dynamic Voltage and Frequency Scaling • Router energy dominates• DVFS reduces router energy, but leads to delay• Previous work are conservative on aggressiveness
We need more aggressive DVFS
5
Limitations of Aggressive DVFS
Dynamic Voltage
Frequency Scaling
Our Previous Work *
This Work
Latency Throughput
• DVFS to reduce energy• Limitations of Aggressive DVFS– Increase latency– Reduce throughputWork for limited traffic pattern
Sensitive Insensitive
Hig
h
Latency
Thro
ughp
utLo
w
Contention
* Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011
1 2 3 4
1 2 3 4
Flexible-Pipeline Routers
Frequency = 0.5F
Frequency = 0.5F
TFlexible pipeline reduces router pipeline delay
T
T
6
7
Exploiting DVFS Opportunity
(a) Minimal path routing
High utilization
Mid utilization
Low utilization
1
Src1 Dest1
(b) Non-minimal path routing
1’
Src1 Dest1
8
• Dynamic Energy: EDyn ∝ Vdd2
• Static Energy: ESta ∝ Vdd
• Clock Energy: EClk ∝ (Freq* Vdd2)
Router Speed
DVFS Parameters Normalized EnergyFreq (GHz) Vdd (V)
High 1.5 1.2 1.0Mid 0.75 1.0 0.67Low 0.375 0.8 0.49
Exploiting DVFS Opportunity (cont.)
Operating at Mid-frequency gets most benefit
9
(a) Minimal path routing
100% frequency
50% frequency
25% frequency
1
Src1 Dest1
(b) Non-minimal path routing
1’
Src1 Dest1
Exploiting DVFS Opportunity (cont.)
1. Performance
2. Dynamic Energy
3. Static Energy
More benefit with bigger network
10
• Introduction• Non-minimal path selection
- Issue- Solution- Challenges
• Infrastructure (CPU+GPU)• Results• Conclusion
Outline
11
Non-minimal Path Routing
(a) Minimal path routing
High utilization
Mid utilization
Low utilizationSrc Dest
(b) Non-minimal path routingSrc Dest
12
Too Close !
(a) Minimal path routing
(b) Non-minimal path routing
High utilization
Mid utilization
Low utilizationSrc Dest
Src Dest
PerformanceStatic Energy
Dynamic Energy
13
Non-minimal path routing
Too Aggressive !
Src1 Dest1
High utilization
Mid utilization
Low utilization
Static EnergyDynamic Energy
14
Dynamic Network Tuning
Input
Slack == 1
Slack = 0
Output
Dx>=3 || Dy>=3
Y
Min. path port
N
N
YLeast busy port
Initial State
Utilization Monitor
V/F Scaling
Router:Packet:
Busy information propagation
How to determine Slack?
Busy Information Propagation• Busy Metrics- Buffer utilization- Crossbar utilization- Router utilization
• Propagation- Regional congestion awareness
[Grot et al. HPCA08]
15
Regional Congestion Awareness
16
• Local data collection• Propagation to neighboring routers• Aggregation of local & non-local data
Slack in Applications
Slack of a packet : The number of cycles the packet can be delayed without affecting the overall execution time
Thread 0 Thread 1 Thread 2 Thread n Thread 0
Thread 0 read miss
Thread 0 ready
Thread 0 schedule
• CPU: Not necessarily, but assume NO slack• GPU: Based on # of threads
17
M G
C L2
18
Tile-Based Multicore System
CPU Core/GPU SM/L2 Cache/
MC
RR
G G
MEM
C L2 C L2
G G G G
M L2 C L2
MEM
MEM
MEM
C L2
G G G G
G M
C L2
G G
C M
C L2
G G
19
Benchmark
• Benchmarks– CPU: afi, ammp, art, equake, kmeans, scalparc– GPU: blackscholes, lps, lib, nn, bfs
• Evaluate ALL 30 CPU+GPU combinations• For presentation purpose, classify- CPU: 1) Memory-bound
2) Computation-bound- GPU: 1) Latency-tolerant
2) Latency-intolerant
Based on: L1 cache miss rate
Based on: Slack cycles
20
Benchmark Categorization
Sensitive Insensitive
Hig
h
Latency
Thro
ughp
ut
Low
(I) memory-bound CPU + latency-tolerant GPU
(II) computation-bound CPU + latency-tolerant GPU
(III) memory-bound CPU + latency-intolerant GPU
(IV) computation-bound CPU + latency-intolerant GPU
Category I Category II Category III Category IV0.6000000000000010.6500000000000010.7000000000000010.7500000000000010.8000000000000010.8500000000000010.9000000000000010.950000000000001
1Baseline DVFS DVFS+Non-min
Net
wor
k En
ergy
21
Network Energy Saving
(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPUEnergy saving is significant on certain workloads
Category I
Category II
Category III
Category IV0.6000000000000010.6500000000000010.7000000000000010.7500000000000010.8000000000000010.8500000000000010.9000000000000010.950000000000001
1
Baseline DVFSDVFS+Non-min
Nor
mal
ized
IPC
22
Performance Impact (CPU)
(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPU
equake+LPS art+NN ammp+LIB0.9
0.910.920.930.940.950.960.970.980.99
1
Baseline DVFSDVFS+Non-min
Nor
mal
ized
IPC
Category I Category II Category III Category IV0.600000000000001
0.650000000000001
0.700000000000001
0.750000000000001
0.800000000000001
0.850000000000001
0.900000000000001
0.950000000000001
1
Baseline DVFS DVFS+Non-min
Nor
mal
ized
IPC
23
Performance Impact (GPU)
(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPU
Performance penalty is minimal compared to DVFS
24
Non-minimal Path NoC+ Balance on-chip workloads+ Reduce NoC energy
Workload Mix• High throughput• Latency Insensitive
Sensitive Insensitive
Hig
hLo
w
Latency
Thro
ughp
ut
Conclusion
Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed
25
Thank You!
Exploiting Slack in GPU
0 5 10 15 20 25 50 1000
0.20.40.60.8
11.2
BlackScholes LPS LIB NNBFS RAY MUM
Delay of Scheduling (cycles)
Syst
em S
peed
Up
26
Predict slack based on # of available warps
Exploiting Slack in GPU
0 5 10 15 20 25 300
5
10
15
20
25
BlackScholes
LPS
LIBNN
BFSRAY
MUM
Tolerable Delay Cycles
Avai
l War
ps
27
top related