international symposium on low power electronics and design energy-efficient non-minimal path...

International Symposium on Low Power Electronics and Design

Energy-Efficient Non-Minimal Path On-chip Interconnection Network

for Heterogeneous Systems

Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai

University of Minnesota – Twin Cities

Network-on-Chips

Leads to latencyLeads to energy

consumption

ScalableProvides high

bandwidth

Heterogeneous System

DataParallel

Super-scalar

Only some routers are fully utilized

DVFS for Reducing NoC Energy

Dynamic Voltage and Frequency Scaling • Router energy dominates• DVFS reduces router energy, but leads to delay• Previous work are conservative on aggressiveness

We need more aggressive DVFS

Limitations of Aggressive DVFS

Dynamic Voltage

Frequency Scaling

Our Previous Work *

This Work

Latency Throughput

• DVFS to reduce energy• Limitations of Aggressive DVFS– Increase latency– Reduce throughputWork for limited traffic pattern

Sensitive Insensitive

Latency

Contention

* Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011

1 2 3 4

Flexible-Pipeline Routers

Frequency = 0.5F

TFlexible pipeline reduces router pipeline delay

Exploiting DVFS Opportunity

(a) Minimal path routing

High utilization

Mid utilization

Low utilization

Src1 Dest1

(b) Non-minimal path routing

Src1 Dest1

• Dynamic Energy: EDyn ∝ Vdd2

• Static Energy: ESta ∝ Vdd

• Clock Energy: EClk ∝ (Freq* Vdd2)

Router Speed

DVFS Parameters Normalized EnergyFreq (GHz) Vdd (V)

High 1.5 1.2 1.0Mid 0.75 1.0 0.67Low 0.375 0.8 0.49

Exploiting DVFS Opportunity (cont.)

Operating at Mid-frequency gets most benefit

100% frequency

50% frequency

25% frequency

Src1 Dest1

Exploiting DVFS Opportunity (cont.)

1. Performance

2. Dynamic Energy

3. Static Energy

More benefit with bigger network

• Introduction• Non-minimal path selection

- Issue- Solution- Challenges

• Infrastructure (CPU+GPU)• Results• Conclusion

Outline

Non-minimal Path Routing

High utilization

Mid utilization

Low utilizationSrc Dest

(b) Non-minimal path routingSrc Dest

Too Close !

High utilization

Mid utilization

Low utilizationSrc Dest

Src Dest

PerformanceStatic Energy

Dynamic Energy

Non-minimal path routing

Too Aggressive !

Src1 Dest1

High utilization

Mid utilization

Low utilization

Static EnergyDynamic Energy

Dynamic Network Tuning

Slack == 1

Slack = 0

Output

Dx>=3 || Dy>=3

Min. path port

YLeast busy port

Initial State

Utilization Monitor

V/F Scaling

Router:Packet:

Busy information propagation

How to determine Slack?

Busy Information Propagation• Busy Metrics- Buffer utilization- Crossbar utilization- Router utilization

• Propagation- Regional congestion awareness

[Grot et al. HPCA08]

Regional Congestion Awareness

• Local data collection• Propagation to neighboring routers• Aggregation of local & non-local data

Slack in Applications

Slack of a packet : The number of cycles the packet can be delayed without affecting the overall execution time

Thread 0 Thread 1 Thread 2 Thread n Thread 0

Thread 0 read miss

Thread 0 ready

Thread 0 schedule

• CPU: Not necessarily, but assume NO slack• GPU: Based on # of threads

Tile-Based Multicore System

CPU Core/GPU SM/L2 Cache/

C L2 C L2

G G G G

M L2 C L2

G G G G

Benchmark

• Benchmarks– CPU: afi, ammp, art, equake, kmeans, scalparc– GPU: blackscholes, lps, lib, nn, bfs

• Evaluate ALL 30 CPU+GPU combinations• For presentation purpose, classify- CPU: 1) Memory-bound

2) Computation-bound- GPU: 1) Latency-tolerant

2) Latency-intolerant

Based on: L1 cache miss rate

Based on: Slack cycles

Benchmark Categorization

Latency

(I) memory-bound CPU + latency-tolerant GPU

(II) computation-bound CPU + latency-tolerant GPU

(III) memory-bound CPU + latency-intolerant GPU

(IV) computation-bound CPU + latency-intolerant GPU

Category I Category II Category III Category IV0.6000000000000010.6500000000000010.7000000000000010.7500000000000010.8000000000000010.8500000000000010.9000000000000010.950000000000001

1Baseline DVFS DVFS+Non-min

Network Energy Saving

(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPUEnergy saving is significant on certain workloads

Category I

Category II

Category III

Category IV0.6000000000000010.6500000000000010.7000000000000010.7500000000000010.8000000000000010.8500000000000010.9000000000000010.950000000000001

Baseline DVFSDVFS+Non-min

Performance Impact (CPU)

(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPU

equake+LPS art+NN ammp+LIB0.9

0.910.920.930.940.950.960.970.980.99

Baseline DVFSDVFS+Non-min

Category I Category II Category III Category IV0.600000000000001

0.650000000000001

0.700000000000001

0.750000000000001

0.800000000000001

0.850000000000001

0.900000000000001

0.950000000000001

Baseline DVFS DVFS+Non-min

Performance Impact (GPU)

(I) memory-bound CPU + latency-tolerant GPU(II) computation-bound CPU + latency-tolerant GPU(III) memory-bound CPU and latency-intolerant GPU(IV) computation-bound CPU and latency-intolerant GPU

Performance penalty is minimal compared to DVFS

Non-minimal Path NoC+ Balance on-chip workloads+ Reduce NoC energy

Workload Mix• High throughput• Latency Insensitive

Latency

Conclusion

Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed

Thank You!

Exploiting Slack in GPU

0 5 10 15 20 25 50 1000

0.20.40.60.8

BlackScholes LPS LIB NNBFS RAY MUM

Delay of Scheduling (cycles)

Predict slack based on # of available warps

Exploiting Slack in GPU

0 5 10 15 20 25 300

BlackScholes

BFSRAY

Tolerable Delay Cycles

international symposium on low power electronics and design energy-efficient non-minimal path...

Documents

reference interconnection / interconnection …... · 2019....

bus interconnection

interconnection issues

interconnection & interoperability

electricity interconnection

› pdf › visesh.pdf · applications, environmental...

networks-on-chip. seminar contents the premises homogenous...

interconnection impacts

network interconnection methods/interconnection …

architecture for ip based interconnection of heterogeneous...

interconnection protocols

interconnection network

unexplored energy aspects of scalable heterogeneous ... ·...

powerpoint ® presentation chapter 12 utility...

isp interconnection

“vertical handoff in heterogeneous wireless...

deuks internship: on the interconnection of heterogeneous...

generator interconnection process · name of presentation ....

attachment v generator interconnection...

interconnection mechanisms