1 lecture 20: core design today: innovations for ilp, tlp, power sign up for class presentations

20
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Upload: elisabeth-caldwell

Post on 19-Jan-2018

213 views

Category:

Documents


0 download

DESCRIPTION

3 Recent Trends Not much change in structure capacities Not much change in cycle time Pipeline depths have become shorter (circuit delays have reduced); this is good for energy efficiency Optimal performance is observed at about 50 pipeline stages (we are currently at ~20 stages for energy reasons) Deep pipelines improve parallelism (helps if there’s ILP); Deep pipelines increase the gap between dependent instructions (hurts when there is little ILP)

TRANSCRIPT

Page 1: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

1

Lecture 20: Core Design

• Today: Innovations for ILP, TLP, power

• Sign up for class presentations

Page 2: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Clustering

Reg-rename &Instr steer IQ

Regfile

F F

IQ

Regfile

F F

r1 r2 + r3r4 r1 + r2r5 r6 + r7r8 r1 + r5

p21 p2 + p3p22 p21 + p2p42 p21

p41 p56 + p57p43 p42 + p41

40 regs in each cluster

r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs

Page 3: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

3

Recent Trends

• Not much change in structure capacities

• Not much change in cycle time

• Pipeline depths have become shorter (circuit delays have reduced); this is good for energy efficiency

• Optimal performance is observed at about 50 pipeline stages (we are currently at ~20 stages for energy reasons)

• Deep pipelines improve parallelism (helps if there’s ILP); Deep pipelines increase the gap between dependent instructions (hurts when there is little ILP)

Page 4: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

ILP Limits Wall 1993

Page 5: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

5

Techniques for High ILP

• Better branch prediction and fetch (trace cache) cascading branch predictors?• More physical registers, ROB, issue queue, LSQ two-level regfile/IQ?• Higher issue width clustering?• Lower average cache hierarchy access time• Memory dependence prediction• Latency tolerance techniques: ILP, MLP, prefetch, runahead, multi-threading

Page 6: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

2Bc-gskew Branch Predictor

Address

Address+History

BIM

Meta

G1

G0

Pred

Vote

44 KB; 2-cycle access; used in the Alpha 21464

Page 7: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Rules

• On a correct prediction if all agree, no update if they disagree, strengthen correct preds and chooser

• On a misprediction update chooser and recompute the prediction

on a correct prediction, strengthen correct preds on a misprediction, update all preds

Page 8: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Impact of Mem-Dep Prediction

• In the perfect model, loads only wait for conflicting stores; in naïve model, loads issue speculatively and must be squashed if a dependence is later discovered

From Chrysos and Emer, ISCA’98

Page 9: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Runahead Mutlu et al., HPCA’03

TraceCache

CurrentRename

IssueQRegfile (128)

CheckpointedRegfile (32)

RetiredRename

ROB

FUs L1 DRunahead

Cache

When the oldest instruction is a cache miss, behave like itcauses a context-switch:

• checkpoint the committed registers, rename table, return address stack, and branch history register• assume a bogus value and start a new thread• this thread cannot modify program state, but can prefetch

Page 10: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Memory Bottlenecks

• 128-entry window, real L2 0.77 IPC

• 128-entry window, perfect L2 1.69

• 2048-entry window, real L2 1.15

• 2048-entry window, perfect L2 2.02

• 128-entry window, real L2, runahead 0.94

Page 11: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

SMT Pipeline Structure

FrontEnd

FrontEnd

FrontEnd

FrontEnd

Execution Engine

Rename ROB

I-Cache Bpred

Regs IQ

FUsDCache

Private/Shared

Front-end

PrivateFront-end

SharedExec Engine

SMT maximizes utilization of shared execution engine

Page 12: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

12

SMT Fetch Policy

• Fetch policy has a major impact on throughput: depends on cache/bpred miss rates, dependences, etc.

• Commonly used policy: ICOUNT: every thread has an equal share of resources

faster threads will fetch more often: improves thruput slow threads with dependences will not hoard resources low probability of fetching wrong-path instructions higher fairness

Page 13: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Area Effect of Multi-Threading

• The curve is linear for a while• Multi-threading adds a 5-8% area overhead per thread (primary caches are included in the baseline)

From Davis et al., PACT 2005

Page 14: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Single Core IPC

4 bars correspond to 4 different L2 sizes

IPC range for different L1 sizes

Page 15: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Maximal Aggregate IPCs

Page 16: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Power/Energy Basics

• Energy = Power x time

• Power = Dynamic power + Leakage power

• Dynamic Power = C V2 f switching activity factor C capacitances being charged V voltage swing f processor frequency

Page 17: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

17

Guidelines

• Dynamic frequency scaling (DFS) can impact power, but has little impact on energy

• Optimizing a single structure for power/energy is good for overall energy only if execution time is not increased

• A good metric for comparison: ED (because DVFS is an alternative way to play with the E-D trade-off)

• Clock gating is commonly used to reduce dynamic energy, DFS is very cheap (few cycles), DVFS and power gating are more expensive (micro-seconds or tens of cycles, fewer margins, higher error rates)

2

Page 18: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Criticality Metrics

• Criticality has many applications: performance and power; usually, more useful for power optimizations

• QOLD – instructions that are the oldest in the issueq are considered critical

can be extended to oldest-N does not need a predictor young instrs are possibly on mispredicted paths young instruction latencies can be tolerated older instrs are possibly holding up the window older instructions have more dependents in the pipeline than younger instrs

Page 19: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Other Criticality Metrics

• QOLDDEP: Producing instructions for oldest in q

• ALOLD: Oldest instr in ROB

• FREED-N: Instr completion frees up at least N dependent instrs

• Wake-Up: Instr completion triggers a chain of wake-up operations

• Instruction types: cache misses, branch mpreds, and instructions that feed them

Page 20: 1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

20

Title

• Bullet