the heterogeneous block architecturethe heterogeneous block architecture a flexible substrate for...

59
1 Carnegie Mellon University 2 Intel Corporation Chris Fallin 1 Chris Wilkerson 2 Onur Mutlu 1 The Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores

Upload: others

Post on 02-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

1 Carnegie Mellon University 2 Intel Corporation

Chris Fallin1 Chris Wilkerson2 Onur Mutlu1

The Heterogeneous Block Architecture

A Flexible Substrate for Building Energy-Efficient High-Performance Cores

Page 2: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Executive Summary n  Problem: General purpose core design is a compromise between

performance and energy-efficiency n  Our Goal: Design a core that achieves high performance and high

energy efficiency at the same time n  Two New Observations: 1) Applications exhibit fine-grained

heterogeneity in regions of code, 2) A core can exploit this heterogeneity by grouping tens of instructions into atomic blocks and executing each block in the best-fit backend

n  Heterogeneous Block Architecture (HBA): 1) Forms atomic blocks of code, 2) Dynamically determines the best-fit (most efficient) backend for each block, 3) Specializes the block for that backend

n  Initial HBA Implementation: Chooses from Out-of-order, VLIW or In-order backends, with simple schedule stability and stall heuristics

n  Results: HBA provides higher efficiency than four previous designs at negligible performance loss; HBA enables new tradeoff points

2

Page 3: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Backend  1  

Backend  2  

Backend  3  

Shared

 Fronten

d  

“Spe

cializa

7on”  

Picture of HBA

3

Out-­‐of-­‐order  

VLIW/IO  

Shared

 Fronten

d  

“Spe

cializa

7on”  

Out-­‐of-­‐order  

VLIW/IO  

Page 4: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Talk Agenda n  Background and Motivation n  Two New Observations n  The Heterogeneous Block Architecture (HBA) n  An Initial HBA Design n  Experimental Evaluation n  Conclusions

4

Page 5: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Competing Goals in Core Design n  High performance and high energy-efficiency

q  Difficult to achieve both at the same time

n  High performance: Sophisticated features to extract it q  Out-of-order execution (OoO), complex branch prediction,

wide instruction issue, … n  High energy efficiency: Use of only features that provide

the highest efficiency for each workload q  Adaptation of resources to different workload requirements

n  Today’s high-performance designs: Features may not yield high performance, but every piece of code pays for their energy penalty

5

Page 6: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Principle: Exploiting Heterogeneity n  Past observation 1: Workloads have different characteristics

at coarse granularity (thousands to millions of instructions) q  Each workload or phase may require different features

n  Past observation 2: This heterogeneity can be exploited by switching “modes” of operation q  Separate out-of-order and in-order cores q  Separate out-of-order and in-order pipelines with shared

frontend

n  Past coarse-grained heterogeneous designs q  provide higher energy efficiency q  at slightly lower performance vs. a homogeneous OoO core

6

Page 7: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Talk Agenda n  Background and Motivation n  Two New Observations n  The Heterogeneous Block Architecture (HBA) n  An Initial HBA Design n  Experimental Evaluation n  Conclusions

7

Page 8: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Two Key Observations n  Fine-Grained Heterogeneity of Application Behavior

q  Small chunks (10s or 100s of instructions) of code have different execution characteristics n  Some instruction chunks always have the same instruction

schedule in an out-of-order processor n  Nearby chunks can have different schedules

n  Atomic Block Execution with Multiple Execution Backends

q  A core can exploit fine-grained heterogeneity with n  Having multiple execution backends n  Dividing the program into atomic blocks n  Executing each block in the best-fit backend

8

Page 9: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Fine-Grained Heterogeneity n  Small chunks of code have different execution characteristics

9

Coarse-GrainedHeterogeneity

phase 1:regular�oating-point

{high ILP

stall

low ILP

phase 2:pointer-chasing

Fine-GrainedHeterogeneity

}

(1K-1M insns) (10-100 insns)

fadd r1,r2,r3fadd r4,r5,r6add r13,r14,8fmul r0,r1,r4......ld r1,(r13) ^-MISSES LLCadd r13,r14,8fadd r2,r1,r7fadd r4,r5,r6......fadd r4,r1,r2fadd r4,r4,r3fadd r4,r4,r7...

time (instructions)

}high ILP,stableinstructionschedule

} high ILP,varyinginstructionschedule

{} low ILP,

stableinstructionschedule

phase 3:...

Good fit for Wide VLIW

Good fit for Wide OoO

Good fit for Narrow In-order

Page 10: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Instruction Schedule Stability Varies at Fine Grain

n  Observation 1: Some regions of code are scheduled in the same instruction scheduling order in an OoO engine across different dynamic instances

n  Observation 2: Stability of instruction schedules of (nearby) code regions can be very different

n  An experiment: q  Same-schedule chunk: A chunk of code that has the same

schedule it had in its previous dynamic instance q  Examined all chunks of <=16 instructions in 248 workloads q  Computed the fraction of “same schedule chunks” in each

workload

10

Page 11: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Instruction Schedule Stability Varies at Fine Grain

11

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250

Fra

ctio

n o

fD

yn

amic

Co

de

Ch

un

ks

Benchmark (Sorted by Y-axis Value)

different-schedule chunks

same-schedule chunks

1.  Many  chunks  have  the  same  schedule  in  their  previous  instances  à  Can  reuse  the  previous  instruc8on  scheduling  order  the  next  8me  2.  Stability  of  instruc8on  schedule  of  chunks  can  be  very  different  à  Need  a  core  that  can  dynamically  schedule  insts.  in  some  chunks  

3.  Temporally-­‐adjacent  chunks  can  have  different  schedule  stability  à  Need  a  core  that  can  switch  quickly  between  mul8ple  schedulers  

Page 12: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Two Key Observations n  Fine-Grained Heterogeneity of Application Behavior

q  Small chunks (10s or 100s of instructions) of code have different execution characteristics n  Some instruction chunks always have the same schedule n  Nearby chunks can have different schedules

n  Atomic Block Execution with Multiple Execution Backends

q  A core can exploit fine-grained heterogeneity with n  Having multiple execution backends n  Dividing the program into atomic blocks n  Executing each block in the best-fit backend

12

Page 13: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Atomic Block Execution w/ Multiple Backends

n  Fine grained heterogeneity can be exploited by a core that:

q  Has multiple (specialized) execution backends

q  Divides the program into atomic (all-or-none) blocks

q  Executes each atomic block in the best-fit backend

13

Atomicity  enables  specializa8on  of  a  block  for  a  backend  (can  freely  reorder/rewrite/eliminate  instruc8ons  in  an  atomic  block)  

Page 14: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Talk Agenda n  Background and Motivation n  Two New Observations n  The Heterogeneous Block Architecture (HBA) n  An Initial HBA Design n  Experimental Evaluation n  Conclusions

14

Page 15: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

HBA: Principles n  Multiple different execution backends

q  Customized for different block execution requirements: OoO, VLIW, in-order, SIMD, …

q  Can be simultaneously active, executing different blocks q  Enables efficient exploitation of fine-grained heterogeneity

n  Block atomicity q  Either the entire block executes or none of it q  Enables specialization (reordering and rewriting) of

instructions freely within the block to fit a particular backend

n  Exploit stable block characteristics (instruction schedules) q  E.g., reuse schedule learned by the OoO backend in VLIW/IO q  Enables high efficiency by adapting to stable behavior

15

Page 16: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

HBA Operation at High Level

16

Backend  1  

Backend  2  

Backend  3  Shared

 Fronten

d  

“Spe

cializa

7on”  

Applica7on  

Block  1  

Block  2  

Block  3   Inst

ruct

ion

sequ

ence

Page 17: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

HBA Design: Three Components (I) n  Block formation and fetch n  Block sequencing and value communication n  Block execution

17

Page 18: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

HBA Design: Three Components (II)

18

Branch  predictor  

Block  Cache  

Block  Forma7on  

Block  Liveout  Rename  

Global  RF  

Subscrip7o

n  Matrix  

Sequ

encer  

Block  Execu7on  Unit  

Block  Execu7on  Unit  

Block  Execu7on  Unit  

Block  Execu7on  Unit  

Block Core Block Frontend

Block  Execu7on  Unit  

Block  Execu7on  Unit  

...

Block  ROB  &  Atomic  Re7re  

Shared  LSQ  L1D Cache

1. Frontend forms blocks dynamically 2. Blocks are dispatched and communicate via global registers

3. Heterogeneous block execution units execute different types of blocks in a specialized way

Page 19: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Talk Agenda n  Background and Motivation n  Two New Observations n  The Heterogeneous Block Architecture (HBA) n  An Initial HBA Design n  Experimental Evaluation n  Conclusions

19

Page 20: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

An Example HBA Design: OoO/VLIW n  Block formation and fetch n  Block sequencing and value communication n  Block execution

n  Two types of backends q  OoO scheduling q  VLIW/in-order scheduling

20

Out-­‐of-­‐order  

VLIW/IO  

Shared

 Fronten

d  

“Spe

cializa

7on”  

Out-­‐of-­‐order  

VLIW/IO  

Page 21: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Block Formation and Fetch n  Atomic blocks are microarchitectural and dynamically formed n  Block info cache (BIC) stores metadata about a formed block

q  Indexed with PC and branch path of a block (ala Trace Caches) q  Metadata info: backend type, instruction schedule, …

n  Frontend fetches instructions/metadata from I-cache/BIC n  If miss in BIC, instructions are sent to OoO backend n  Otherwise, buffer instructions until the entire block is fetched

n  The first time a block is executed on the OoO backend, its’ OoO version is formed (& OoO schedule/names stored in BIC) q  Max 16 instructions, ends in a hard-to-predict or indirect branch

21

Page 22: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

An Example HBA Design: OoO/VLIW n  Block formation and fetch n  Block sequencing and value communication n  Block execution

n  Two types of backends q  OoO scheduling q  VLIW/in-order scheduling

22

Out-­‐of-­‐order  

VLIW/IO  

Shared

 Fronten

d  

“Spe

cializa

7on”  

Out-­‐of-­‐order  

VLIW/IO  

Page 23: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Block Sequencing and Communication n  Block Sequencing

q  Block based reorder buffer for in-order block sequencing q  All subsequent blocks squashed on a branch misprediction q  Current block squashed on an exception or intra-block

misprediction q  Single-instruction sequencing from non-optimized code upon

an exception in a block to reach the exception point

n  Value Communication q  Across blocks: via the Global Register File q  Within block: via the Local Register File q  Only liveins and liveouts to a block need to be renamed and

allocated global registers [Sprangle & Patt, MICRO 1994]

n  Reduces energy consumption of register file and bypass units

23

Page 24: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

An Example HBA Design: OoO/VLIW n  Block formation and fetch n  Block sequencing and value communication n  Block execution

n  Two types of backends q  OoO scheduling q  VLIW/in-order scheduling

24

Out-­‐of-­‐order  

VLIW/IO  

Shared

 Fronten

d  

“Spe

cializa

7on”  

Out-­‐of-­‐order  

VLIW/IO  

Page 25: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Block Execution n  Each backend contains

q  Execution units, scheduler, local register file

n  Each backend receives q  A block specialized for it (at block dispatch time) q  Liveins for the executing block (as they become available)

n  A backend executes the block q  As specified by its logic and any information from the BIC

n  Each backend produces q  Liveouts to be written to the Global RegFile (as available) q  Branch misprediction, exception results, block completion

signal to be sent to the Block Sequencer

n  Memory operations are handled via a traditional LSQ 25

Page 26: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

An Example HBA Design: OoO/VLIW n  Block formation and fetch n  Block sequencing and value communication n  Block execution

n  Two types of backends q  OoO scheduling q  VLIW/in-order scheduling

26

Out-­‐of-­‐order  

VLIW/IO  

Shared

 Fronten

d  

“Spe

cializa

7on”  

Out-­‐of-­‐order  

VLIW/IO  

Page 27: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

OoO and VLIW Backends

27

n  Two customized backend types (physically, they are unified)

Out-of-order VLIW/In-order

•  Dynamic scheduling with matrix scheduler •  Renaming performed before the backend (block specialization logic) •  No renaming or matrix computation logic (done for previous instance of block) •  No need to maintain per-instruction order

•  Static scheduling •  Stall-on-use policy •  VLIW blocks specialized by OoO backends •  No per-instruction order

Page 28: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

OoO Backend

28

Block Dispatch

Liveins (reg, val)

Liveouts (reg, val)

Resolved Branches

Dispatch Queue

Scheduler (no CAM)

Block Completion

Generic Interface Dynamically-scheduled implementation

Selec7on

 logic  

Local  RF  

Execu7

on  Units  

Consum

er  wakeu

p  vectors   srcs

ready? srcs ready? srcs ready? srcs ready? srcs ready? srcs ready?

Liveout?  

Branch?  

Writeback bus Counter  

Livein bus

To shared EUs

Page 29: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

VLIW/In-Order Backend

29

Block Dispatch

Liveins (reg, val)

Liveouts (reg, val)

Resolved Branches

Dispatch Queue

Issue Logic

Block Completion

Generic Interface Statically-scheduled implementation

Ready-­‐bit  S

corebo

ard  

Local  RF  

Execu7

on  Units  

Liveout?  

Branch?  

Writeback bus Counter  

Livein bus

To shared EUs

Page 30: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Backend Switching and Specialization

n  OoO to VLIW: If dynamic schedule is the same for N previous executions of the block on the OoO backend: q  Record the schedule in the Block Info Cache q  Mark the block to be executed on the VLIW backend

n  VLIW to OoO: If there are too many false stalls in the block on the VLIW backend q  Mark the block to be executed on the OoO backend

30

Page 31: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Talk Agenda n  Background and Motivation n  Two New Observations n  The Heterogeneous Block Architecture (HBA) n  An Initial HBA Design n  Experimental Evaluation n  Conclusions

31

Page 32: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Experimental Methodology n  Models

q  Cycle-accurate execution-driven x86-64 simulator – ISA remains unchanged q  Energy model with modified McPAT [Li+ MICRO’09]

– various leakage parameters investigated

n  Workloads q  184 different representative execution checkpoints q  SPEC CPU2006, Olden, MediaBench, Firefox, Ffmpeg, Adobe

Flash player, MySQL, lighttpd web server, LaTeX, Octave, an x86 simulator

32

Page 33: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Simulation Setup

33

Page 34: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Four Core Comparison Points n  Out-of-order core with large, monolithic backend

n  Clustered out-of-order core with split backend [Farkas+ MICRO’97]

q  Clusters of <scheduler, register file, execution units>

n  Coarse-grained heterogeneous design [Lukefahr+ MICRO’12]

q  Combines out-of-order and in-order cores q  Switches mode for coarse-grained intervals based on a

performance model (ideal in our results) q  Only one mode can be active: in-order or out-of-order

n  Coarse-grained heterogeneous design, with clustered OoO core

34

Page 35: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Key Results: Power, EPI, Performance

n  HBA greatly reduces power and energy-per-instruction (>30%) n  HBA performance is within 1% of monolithic out-of-order core n  HBA’s performance higher than coarse-grained hetero. core n  HBA is the most power- and energy-efficient design among the

five different core configurations

35

Averaged across 184 workloads

Page 36: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Energy Analysis (I)

n  HBA reduces energy/power by concurrently

q  Decoupling Execution Backends to Simplify the Core (Clustering)

q  Exploiting Block Atomicity to Simplify the Core (Atomicity)

q  Exploiting Heterogeneous Backends to Specialize (Heterogeneity)

n  What is the relative energy benefit of each when applied successively?

36

Page 37: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Energy Analysis (II)

37

0

0.5

1

1.5

2

OoO

Clu

stered

Coarse

Coarse,

Clu

stered

HB

A/O

oO

HB

A

Ener

gy/I

nst

ruct

ion (

nJ)

FrontendRATROB

RS (Scheduler)RF

Exec (ALUs)Bypass Buses

LSQL1L2

+Clustering: 8%

+Atomicity: 17% +Heterogeneity: 6%

HBA  effec8vely  takes  advantage  of  clustering,  atomic  blocks,    and  heterogeneous  backends  to  reduce  energy  

Averaged across 184 workloads

Page 38: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Comparison to Best Previous Design n  Coarse-grained heterogeneity + clustering the OoO core

38

0

0.5

1

1.5

2

OoO

Clu

stered

Coarse

Coarse,

Clu

stered

HB

A/O

oO

HB

A

Ener

gy/I

nst

ruct

ion (

nJ)

FrontendRATROB

RS (Scheduler)RF

Exec (ALUs)Bypass Buses

LSQL1L2

+Coarse-grained hetero: 9% +Clustering: 9%

HBA: 15% lower EPI

HBA  provides  higher  efficiency  (+15%)  at  higher  performance  (+2%)  than  coarse-­‐grained  heterogeneous  core  with  clustering  

Averaged across 184 workloads

Page 39: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Per-Workload Results

n  HBA reduces energy on almost all workloads n  HBA can reduce or improve performance (analysis in paper)

39

0.5

1

1.5

2

0 50 100 150Rel

ativ

e vs.

Bas

elin

e

Benchmark (Sorted by Rel. HBA Performance)

IPCEPI

S-curve of all 184 workloads

Page 40: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Power-Performance Tradeoff Space

n  HBA enables new design points in the tradeoff space

40

0.2

0.4

0.6

0.8

1

1.2

1.4

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Rel

. A

vg. C

ore

Pow

er (

vs.

4-w

ide

OoO

)

Rel. IPC (vs. 4-wide OoO)

HBA(16,OoO)HBA(16)

HBA(8)HBA(4)

HBA(16,2-wide)

1-wide in-order2-wide in-order

256-ROB, 4-wide OoO, clustered

Coarse-grained (256-ROB)

HBA(64)

1024-ROB, 4-wide OoO, clustered

4-wide in-order

128-ROB, 4-wide OoO

256-ROB, 4-wide OoO

Coarse-grained (128-ROB)

More Efficient

In-OrderOut-of-OrderCoarse-grained HeteroHBA

Relative Performance (vs. Baseline OoO)

GOAL

Averaged across 184 workloads

Page 41: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Cost of Heterogeneity n  Higher core area

q  Due to more backends q  Can be optimized with good design (unified backends) q  Core area cost increasingly small in an SoC

n  only 17% in Apple’s A7

q  Analyzed in the paper but our design is not optimized for area

n  Increased leakage power q  HBA benefits reduce with higher transistor leakage q  Analyzed in the paper

41

Page 42: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

More Results and Analyses in the Paper n  Performance bottlenecks n  Fraction of OoO vs. VLIW blocks n  Energy optimizations in VLIW mode n  Energy optimizations in general n  Area and leakage

n  Other and detailed results available in our technical report

Chris Fallin, Chris Wilkerson, and Onur Mutlu, "The Heterogeneous Block Architecture"

SAFARI Technical Report, TR-SAFARI-2014-001, Carnegie Mellon University, March 2014.

42

Page 43: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Talk Agenda n  Background and Motivation n  Two New Observations n  The Heterogeneous Block Architecture (HBA) n  An Initial HBA Design n  Experimental Evaluation n  Conclusions

43

Page 44: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Conclusions n  HBA is the first heterogeneous core substrate that enables

q  concurrent execution of fine-grained code blocks q  on the most-efficient backend for each block

n  Three characteristics of HBA q  Atomic block execution to exploit fine-grained heterogeneity q  Exploits stable instruction schedules to adapt code to backends q  Uses clustering, atomicity and heterogeneity to reduce energy

n  HBA greatly improves energy efficiency over four core designs

n  A flexible execution substrate for exploiting fine-grained heterogeneity in core design

44

Page 45: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Building on This Work … n  HBA is a substrate for efficient core design

n  One can design different cores using this substrate: q  More backends of different types (CPU, SIMD, reconfigurable

logic, …) q  Better switching heuristics between backends q  More optimization of each backend q  Dynamic code optimizations specific to each backend q  …

45

Page 46: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

1 Carnegie Mellon University 2 Intel Corporation

Chris Fallin1 Chris Wilkerson2 Onur Mutlu1

The Heterogeneous Block Architecture

A Flexible Substrate for Building Energy-Efficient High-Performance Cores

Page 47: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Why Atomic Blocks? n  Each block can be pre-specialized to its backend

(pre-rename for OoO, pre-schedule VLIW in hardware, reorder instructions, eliminate instructions)

n  Each backend can operate independently n  User-visible ISA can remain unchanged despite multiple

execution backends

47

Page 48: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Nearby Chunks Have Different Behavior

48

0

0.2

0.4

0.6

1 8 16 24 32

Fre

qu

ency

Chunks Retired Before Chunk-Type Switch

Page 49: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

In-Order vs. Superscalar vs. OoO

49

0

0.2

0.4

0.6

0.8

1

In-O

rder

Superscalar

Out-o

f-Ord

er

Avg. P

erfo

rman

ce (

IPC

)

In-O

rder

Superscalar

Out-o

f-Ord

er

0

2

4

6

8

Avg. C

ore

Pow

er (

W)

Page 50: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Baseline EPI Breakdown

50

0

1

2

2.5

1-w

ide

In-o

rder

4-w

ide

In-o

rder

4-w

ide

OoO

En

erg

y/I

nst

ruct

ion

(n

J)

FrontendRATROBRS (Scheduler)RFExec (ALUs)Bypass BusesLSQL1L2

Page 51: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Retired Block Types (I)

51

0

0.2

0.4

0.6

0.8

1

Fra

ctio

n o

f R

etir

ed B

lock

s

Benchmark (Sorted by OoO Fraction)

OoO blocks

VLIW blocks (wide or narrow)

Page 52: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Retired Block Types (II)

52

0

0.02

0.04

0.06

5 10 15 20 25 30

Spec

trum

Frequency ( / 64 block retires)

Frequency Spectrum of Retire-Order Block Types

Page 53: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Retired Block Types (III)

53

0 0.1 0.2 0.3 0.4 0.5

OoO 50% OoO/VLIW VLIW

Fre

quen

cy

Average Block Type

Average Block Type: Histogram per Block

Page 54: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

BIC Size Sensitivity

54

0.8

0.9

1

64

128

256

512

1K

2K

4K

perfect

Rel

ativ

e IP

C

Block Info Cache Size (blocks of 16 uops)

default

Page 55: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Livein-Liveout Statistics

55

0

0.1

0.2

0.3

0 -

- 1

2 -

- 3

4 -

- 5

6 -

- 7

8 -

- 9

10 -

- 11

12 -

- 13

14 -

- 15

16+

Fre

quen

cy

Liveins per Block

0

0.1

0.2

0.3

0 -

- 1

2 -

- 3

4 -

- 5

6 -

- 7

8 -

- 9

10

--

11

12

--

13

14

--

15

16

+

Fre

qu

ency

Liveouts per Block

Page 56: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Auxiliary Data: Benefits of HBA

56

0

0.2

0.4

0.6

0.8

1

RAT ROB Global RFNorm

ali

zed A

ccesses Baseline

HBA

Page 57: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Auxiliary Data: Benefits of HBA

57

0

0.2

0.4

0.6

0.8

1

Reads Writes

Norm

ali

zed R

AT

Accesses Baseline

HBA

0

0.2

0.4

0.6

0.8

1

Reads Writes

Norm

ali

zed R

OB

Accesses Baseline

HBA

Reads Writes

0

0.2

0.4

0.6

0.8

1

Norm

ali

zed R

F A

ccessesBaseline

HBA

Page 58: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Some Specific Workloads

58

0.5

1

ffmpeg

473.astar

429.mcf

latex401.bzip2

graphchi

403.gcc

437.leslie3d

octave

456.hmm

er

450.soplex

433.milc

Rel

. vs.

Bas

elin

e

IPCEPI

Page 59: The Heterogeneous Block ArchitectureThe Heterogeneous Block Architecture A Flexible Substrate for Building Energy-Efficient High-Performance Cores . Executive Summary ! ... Talk Agenda

Related Works n  Forming and reusing instruction schedules [DIF, ISCA’97]

[Transmeta’00][Banerjia, IEEE TOC’98][Palomar, ICCD’09]

n  Coarse-grained heterogeneous cores [Lukefahr, MICRO’12]

n  Atomic blocks and block-structured ISA [Melvin&Patt, MICRO’88, IJPP’95]

n  Instruction prescheduling [Michaud&Seznec, HPCA’01]

n  Yoga [Villavieja+, HPS Tech Report’14]

59