composite cores: pushing heterogeneity into a core

33
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke University of Michigan Micro 45 May 8 th 2012

Upload: nita

Post on 23-Feb-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Composite Cores: Pushing Heterogeneity into a Core. Andrew Lukefahr , Shruti Padmanabha , Reetuparna Das, Faissal M. Sleiman , Ronald Dreslinski , Thomas F. Wenisch , and Scott Mahlke University of Michigan Micro 45 May 8 th 2012. High Performance Cores. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science

Composite Cores:Pushing Heterogeneity into a Core

Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch,

and Scott Mahlke

University of Michigan

Micro 45May 8th 2012

Page 2: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science2

High Performance Cores

High performance cores waste energy on low performance phases

PerformanceEnergy

Time

High energy yields high performance

Low performance DOES NOT yield low energy

Page 3: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science3

Core Energy Comparison

Brooks, ISCA’00

Dally, IEEE Computer’08

Out-of-Order In-Order

• Out-Of-Order contains performance enhancing hardware• Not necessary for correctness

Do we always need the extra hardware?

Page 4: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science4

Previous Solution: Heterogeneous Multicore

• 2+ Cores• Same ISA, different implementations

– High performance, but more energy– Energy efficient, but less performance

• Share memory at high level– Share L2 cache ( Kumar ‘04) – Coherent L2 caches (ARM’s big.LITTLE)

• Operating System (or programmer) maps application to smallest core that provides needed performance

Page 5: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science5

Current System Limitations• Migration between cores incurs high overheads

– 20K cycles (ARM’s big.LITTLE)• Sample-based schedulers

– Sample different cores performances and then decide whether to reassign the application

– Assume stable performance with a phase• Phase must be long to be recognized and exploited

– 100M-500M instructions in lengthDo finer grained phases exist?Can we exploit them?

Page 6: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science6

Performance Change in GCC

• Average IPC over a 1M instruction window (Quantum) • Average IPC over 2K Quanta

100K 200K 300K 400K 500K 600K 700K 800K 900K 1M0

0.5

1

1.5

2

2.5

3Big Core Little Core

Instructions

Inst

ructi

ons /

Cyc

le

100K 200K 300K 400K 500K 600K 700K 800K 900K 1M0

0.5

1

1.5

2

2.5

3Big Core Little Core

Instructions

Inst

ructi

ons /

Cyc

le

Page 7: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science7

Finer Quantum

• 20K instruction window from GCC• Average IPC over 100 instruction quantaWhat if we could map these to a Little Core?

160K 170K 180K0

0.5

1

1.5

2

2.5

3

Big Core Little Core

Instructions

Inst

ruct

ions

/ Cy

cle

Page 8: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science8

Our Approach: Composite Cores• Hypothesis: Exploiting fine-grained phases allows

more opportunities to run on a Little core• Problems

I. How to minimize switching overheads?II. When to switch cores?

• QuestionsI. How fine-grained should we go?II. How much energy can we save?

Page 9: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science9

Problem I: State Transfer

Fetch

Decode

Rename

O3Execute

dTLB

dCache

Reg File

iCache

Branch Pred

iTLB

Decode

InOExecute

Fetch

iCache

Branch Pred

iTLB

dTLB

dCache

RAT

Reg File10s of KB

<1 KB

10s of KB

State transfer costs can be very high:~20K cycles (ARM’s big.LITTLE)

Limits switching to coarse granularity:100M Instructions ( Kumar’04)

Page 10: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science10

Creating a Composite Core

dTLB

dCache

RAT

Reg FileDecode

O3 Execute

dCache

dTLB

Fetch

iCache

Branch Pred

iTLB

DecodeinO Execute

Reg File Mem

Load/Store Queue

Fetch

iCache

Branch Pred

iTLB

iCache

Branch Pred

iTLB Fetch

ControllerdTLB

dCache<1KB

BiguEngine

LittleuEngine

Only one uEngine active at a time

Page 11: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science11

Hardware Sharing Overheads• Big uEngine needs

– High fetch width– Complex branch prediction– Multiple outstanding data cache misses

• Little uEngine wants– Low fetch width– Simple branch prediction– Single outstanding data cache miss

• Must build shared units for Big uEngine – over-provision for Little uEngine

• Assume clock gating for inactive uEngine– Still has static leakage energy

Little pays ~8% energy overhead to use over provisioned fetch + caches

Page 12: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science12

Problem II: When to Switch• Goal: Maximize time on the Little uEngine subject to

maximum performance loss• User-Configurable

• Traditional OS-based schedulers won’t work– Decisions to frequent– Needs to be made in hardware

• Traditional sampling-based approaches won’t work– Performance not stable for long enough– Frequent switching just to sample wastes cycles

Page 13: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science13

What uEngine to Pick

• This value is hard to determine a priori, depends on application– Use a controller to learn appropriate value over time

Δ𝐶𝑃𝐼 h h𝑇 𝑟𝑒𝑠 𝑜𝑙𝑑

Run on Big

Run on Little

200K 400K 600K 800K 1M0

0.5

1

1.5

2

2.5

3

Big Core Little Core Difference

Instructions

Inst

ructi

ons /

Cyc

le

Run on Big

Run on Little

Let user configure the target value

Page 14: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science14

Reactive Online Controller

Little uEngineTrue

Big uEngineFalse

𝐶𝑃𝐼 𝑙𝑖𝑡𝑡𝑙𝑒

Δ𝐶𝑃𝐼 h h𝑇 𝑟𝑒𝑠 𝑜𝑙𝑑

𝐶𝑃𝐼𝑏𝑖𝑔 Switching Controller

𝐶𝑃𝐼 𝑙𝑖𝑡𝑡𝑙𝑒

Big Model𝐶𝑃𝐼𝑏𝑖𝑔

∑𝐶𝑃𝐼𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑

𝑆∗∑𝐶𝑃𝐼 𝐵𝑖𝑔

+

𝐶𝑃𝐼 𝑎𝑐𝑡𝑢𝑎𝑙

𝐶𝑃𝐼 𝑡𝑎𝑟𝑔𝑒𝑡

ThresholdController

𝐶𝑃𝐼 𝑒𝑟𝑟𝑜𝑟

Little Model𝐶𝑃𝐼 𝑙𝑖𝑡𝑡𝑙𝑒

Δ𝐶𝑃𝐼 h h𝑇 𝑟𝑒𝑠 𝑜𝑙𝑑+𝐶𝑃𝐼𝑙𝑖𝑡𝑡𝑙𝑒≤𝐶𝑃𝐼𝑏𝑖𝑔Δ𝐶𝑃𝐼 h h𝑇 𝑟𝑒𝑠 𝑜𝑙𝑑=𝐾 𝑝𝐶𝑃𝐼 𝑒𝑟𝑟𝑜𝑟+𝐾 𝑖∑𝐶𝑃𝐼 𝑒𝑟𝑟𝑜𝑟

𝐶𝑃𝐼𝑏𝑖𝑔

User-Selected Performance

Page 15: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science15

uEngine Modeling

while(flag){ foo(); flag = bar();}

Little uEngine IPC: 1.66

IPC: ???Big uEngine

Collect Metrics of active uEngine• iL1, dL1 cache misses• L2 cache misses• Branch Mispredicts• ILP, MLP, CPI

Use a linear model to estimate inactive uEngine’s performance

IPC: 2.15

Page 16: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science16

EvaluationArchitectural Feature ParametersBig uEngine 3 wide O3 @ 1.0GHz

12 stage pipeline128 ROB Entries128 entry register file

Little uEngine 2 wide InOrder @ 1.0GHz8 stage pipeline32 entry register file

Memory System 32 KB L1 i/d cache, 1 cycle access1MB L2 cache, 15 cycle access1GB Main Mem, 80 cycle access

Controller 5% performance loss relative to all big core

Page 17: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science17

Little Engine Utilization

100 1K 10K 100K 1M 10M0%

10%20%30%40%50%60%70%80%90%

100%

astarbzip2gccgobmkh264refhmmermcfomnetppsjengaverage

Quantum Length (Instructions)

Litt

le E

ngin

e U

tiliz

ation

• 3-Wide O3 (Big) vs. 2-Wide InOrder (Little)• 5% performance loss relative to all Big

More time on little engine with sameperformance loss

Traditional OS-Based QuantumFine-Grained Quantum

Page 18: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science18

Engine Switches

100 1K 10K 100K 1M 10M0

500100015002000250030003500400045005000

astarbzip2gccgobmkh264refhmmermcfomnetppsjengaverage

Quantum Length (Instructions)

Switc

hes /

Mill

ion

Inst

ructi

ons

Need LOTS of switching to maximize utilization

~1 Switch / 2800 Instructions

~1 Switch / 306 Instructions

Page 19: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science19

Performance Loss

100 1K 10K 100K 1M 10M80%

85%

90%

95%

100%

105%

astarbzip2gccgobmkh264refhmmermcfomnetppsjengaverage

Quantum Length (Instructions)

Perf

orm

ance

Rel

ative

to B

ig

Composite Cores( Quantum Length = 1000 )

Switching overheads negligible until ~1000 instructions

Page 20: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science20

Fine-Grained vs. Coarse-Grained• Little uEngine’s average power 8% higher

– Due to shared hardware structures• Fine-Grained can map 41% more instructions to the

Little uEngine over Coarse-Grained. • Results in overall 27% decrease in average power

over Coarse-Grained

Page 21: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science21

1. OracleKnows both uEngine’s performance for all quantums

2. Perfect PastKnows both uEngine’s past performance perfectly

3. ModelKnows only active uEngine’s past, models inactive uEngine using default weights

All models target 95% of the all Big uEngine’s performance

Decision Techniques

Page 22: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science22

Little Engine Utilization

Astar Bzip2 Gcc GoBmk H264ref Hmmer Mcf OmnetPP Sjeng Average0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Oracle Perfect Past

Model

Dyna

mic

Inst

ructi

ons O

n Li

ttle

High utilization for memory bound applicationIssue width dominates computation boundMaps 25% of the dynamic instructions

onto the Little uEngine

Page 23: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science23

Energy Savings

• Includes the overhead of shared hardware structures

Astar Bzip2 Gcc GoBmk H264ref Hmmer Mcf OmnetPP Sjeng Average0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Oracle Perfect Past

Model

Ener

gy S

avin

gs R

elati

ve to

Big

18% reduction in energy consumption

Page 24: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science24

User-Configured Performance

1% 5% 10% 20% 1% 5% 10% 20% 1% 5% 10% 20%Utilization Overall Performance Energy Savings

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1% performance loss yields 4% energy savings20% performance loss yields 44% energy savings

Page 25: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science25

More Details in the Paper• Estimated uEngine area overheads• uEngine model accuracy• Switching timing diagram• Hardware sharing overheads analysis

Page 26: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science26

Conclusions• Even high performance applications experience fine-

grained phases of low throughput– Map those to a more efficient core

• Composite Cores allows– Fine-grained migration between cores– Low overhead switching

• 18% energy savings by mapping 25% of the instructions to Little uEngine with a 5% performance loss

Questions?

Page 27: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science

Composite Cores:Pushing Heterogeneity into a Core

Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch,

and Scott Mahlke

University of Michigan

Micro 45May 8th 2012

Page 28: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science28

Back Up

Page 29: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science29

The DVFS Question• Lower voltage is useful when:

– L2 Miss (stalled on commit)• Little uArch is useful when:

– Stalled on L2 Miss (stalled at issue)– Frequent branch mispredicts (shorter pipeline)– Dependent Computation

http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf

Page 30: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science30

Sharing Overheads

astar bzip2 gcc gobmk h264ref hmmer mcf omnetpp sjeng average0%

10%20%30%40%50%60%70%80%90%

100%110%

Big uEngine Little Core Little uEngine

Aver

age

Pow

er R

elati

ve to

the

Big

Core

Page 31: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science31

Performance

5% performance loss

Astar Bzip2 Gcc GoBmk H264ref Hmmer Mcf OmnetPP Sjeng Average90%

93%

95%

98%

100%

103%

Oracle Perfect Past

Model

Perf

orm

ance

Rel

ative

to B

ig

Page 32: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science32

Model Accuracy

-100% -50% 0% 50% 100%0%

5%

10%

15%

20%

25%

30%Model Average Performance

Percent Deviation From Actual

Perc

ent o

f Qua

ntum

s

-100% -50% 0% 50% 100%0%

5%

10%

15%

20%

25%

30%

35%Model Average Performance

Percent Deviation From ActualPe

rcen

t of Q

uant

ums

Little -> Big Big -> Little

Page 33: Composite Cores: Pushing Heterogeneity into a Core

University of MichiganElectrical Engineering and Computer Science33

Regression Coefficients

Little -> Big Big -> Little0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L2 MissBranch MispredictsILPL2 HitMLPActive uEngine CyclesConstant

Rela

tive

Coeffi

cien

t Mag

natu

de