energy efficient latency tolerance: single-thread performance for the multi-core era

Duke:: March 18, 2010

Energy Efficient Latency Tolerance:Single-Thread Performance for the Multi-Core Era

Andrew HiltonUniversity of [email protected]

Multi-Core Architecture

[ 2 ][ 2 ]

Core i7

Single-thread performance growth has diminished• Clock frequency has hit an energy wall• Instruction level parallelism (ILP) has hit energy, memory, idea walls

Future chips will be heterogeneous multi-cores • Few high-performance out-of-order cores (Core i7) for serial code• Many low-power in-order cores (Atom) for parallel code

Atom Atom Atom Atom

Atom Atom Atom Atom

Multi-Core Performance

[ 3 ][ 3 ]

Core i7

Obvious performance key: write more parallel software

Less obvious performance key: speed up existing cores• Core i7? Keep serial portion from becoming a bottleneck (Amdahl)• Atoms? Parallelism is typically not elastic

Key constraint: energy• Thermal limitations of chip, cost of energy, cooling costs,…

Atom Atom Atom Atom

Atom Atom Atom Atom

“TurboBoost”

[ 4 ][ 4 ]

Existing technique: Dynamic Voltage Frequency Scaling?• Increase clock frequency (requires increasing voltage)+Simple+Applicable to both types of cores- Not very energy-efficient (energy ≈ frequency2)- Doesn’t help “memory bound” programs (performance < frequency)

Effectiveness of “TurboBoost”

[ 5 ][ 5 ]

Hig

her

Is b

ett

er

Low

er

is b

ett

er

Example: TurboBoost 3.2 GHz 4.0 GHz (25%)• Ideal conditions: 25% speedup, constant Energy * Delay2

• Memory bound programs: far from ideal

“Memory Bound”

Main memory is slow relative to core (~250 cycles)

Cache hierarchy makes most accesses fast• “Memory bound” = many L3 misses• … or in some cases many L2 misses• … or for in-order cores many L1 misses• Clock frequency (“TurboBoost”) accelerates only core/L1/L2

[ 6 ][ 6 ]

Core i7

Main Memory (250 cycles)

Atom

Atom

L1$

L2$ (10)

L3$ (40 cycles)

L1$

L2$ (10)L1

$L1$

L2$ (10)L1

$L1$

L2$ (10)L1

$Atom

Atom

Atom

Atom

Goal: Help Memory Bound Programs

Wanted: complementary technique to TurboBoost

Successful applicants should• Help “memory bound” programs• Be at least as energy efficient as TurboBoost (at least ED2 constant)• Work well with both out-of-order and in-order cores

Promising previous idea: latency tolerance• Helps “memory bound” programs

My work: energy efficient latency tolerance for all cores• Today: primarily out-of-order (BOLT) [HPCA’10]

[ 7 ][ 7 ]

Talk Outline

Introduction

Background: memory latency & latency tolerance

My work: energy efficient latency tolerance in BOLT • Implementation aspects• Runtime aspects

Other work and future plans

[ 8 ][ 8 ]

[ 9 ][ 9 ]

LLC (Last-Level Cache) Misses

What is this picture? Loads A & H miss caches

This is an in-order processor• Misses serialize latencies add dominate performance

We want Miss Level Parallelism (MLP): overlap A & H

Time

250

250(not to scale)

[ 10 ][ 10 ]

Miss-Level Parallelism (MLP)

One option: prefetching• Requires predicting address of H at A

Another option: out-of-order execution (Core i7)• Requires sufficiently large “window” to do this

Time

250

250 250

[ 11 ][ 11 ]

Out-of-Order Execution & “Window”

Important “window” structures• Register file (number of in-flight instructions): 128 insns on Core i7 • Issue queue (number of un-executed instructions): 36 on Core i7• Sized to “tolerate” (keep core busy for) ~30 cycle latencies• To tolerate ~250 cycles need order of magnitude bigger structures

Latency tolerance big idea: scale window virtually

Rename

IssueQueu

e

Reorder BufferFetch

Register

FileFU

D$

I$

BC A

AD

B C D

D

A

LLC miss

completed

unexecuted

[ 12 ][ 12 ]

Latency Tolerance

Prelude: Add slice buffer• New structure (not in conventional processors)• Can be relatively large: low bandwidth, not in critical execution core

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC A

D

D

A B C DA

RenameFetch

[ 13 ][ 13 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC A

D

D

A B C DA

RenameFetch

[ 14 ][ 14 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC A

D

D

B C D

A

RenameFetch

[ 15 ][ 15 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC A

D

D

B C D

A

miss dependent

RenameFetch

[ 16 ][ 16 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents• Pseudo-execute them too

Slice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C

AD

RenameFetch

[ 17 ][ 17 ]

Latency Tolerance

Phase #1: Long-latency cache miss slice out• Pseudo-execute: copy to slice buffer, release register & IQ slot• Propagate “poison” to identify dependents• Pseudo-execute them too• Proceed under miss

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

ADEHEF

F

H

I I

G

G

RenameFetch

[ 18 ][ 18 ]

Latency Tolerance

Phase #2: Cache miss return slice in

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

ADEHEF

F

H

I I

G

G

RenameFetch

[ 19 ][ 19 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

ADEHEF

F

H

I I

G

GA

RenameFetch

[ 20 ][ 20 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

ADEHEF

F

H

I I

G

GA

A

RenameFetch

[ 21 ][ 21 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

DEHEF

F

H

I I

G

GA

RenameFetch

[ 22 ][ 22 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction• Problems with sliced in instructions (exceptions, mis-predictions)?

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

EHEF

F

H

I I

G

GA

EED

Exception!

RenameFetch

[ 23 ][ 23 ]

Latency Tolerance

Phase #2: Cache miss return slice in• Allocate new registers• Put in issue queue• Re-execute instruction• Problems with sliced in instructions (exceptions, mis-predictions)?• Recover to checkpoint (taken before A)

EHSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AI D

B C

EHEF

F

H

I I

G

GA

EED

Exception!

ChkRenam

eFetch

Slice Self Containment

Important for latency tolerance: self-contained slices• A,D, & E have miss-independent inputs• Capture these values during slice out• This decouples slice from rest of program

[ 24 ][ 24 ]

AB

C

D

E

F

G

H

Latency Tolerance

Latency tolerance example• Slice out miss and dependent instructions “grow” window• Slice in after miss returns

[ 25 ][ 25 ]

Time

Energy:

1.5x

Delay:0.5

x

Energy ≈ #Boxes

Combine into ED2

ED2 < 1.0 = Good

ED2: 0.38x

…

Previous Design: CFP

[ 26 ][ 26 ]

Prior design: Continual Flow Pipelines [Srinivasan’04]

• Obtains speedups, but…

Hig

her

Is b

ett

er


[ 27 ][ 27 ]


• Obtains speedups, but also slowdowns

Hig

her

Is b

ett

er


[ 28 ][ 28 ]


• Obtains speedups, but also slowdowns• Typically not energy efficient

Hig

her

Is b

ett

er

Low

er

is b

ett

er

Energy-Efficient Latency Tolerance?

Efficient Implementation• Re-use existing structures when possible• New structures must be simple, low-overhead

Runtime efficiency• Minimize superfluous re-executions

Previous designs have not achieved (or considered) these• Waiting Instruction Buffer [Lebeck’02]

• Continual Flow Pipeline [Srinivasan’04]

• Decoupled Kilo Instruction Processor [Pericas ’06,’07]

[ 29 ][ 29 ]

Sneak Preview: Final Results

[ 30 ][ 30 ]

This talk: my work on efficient latency tolerance+Improved performance+Performance robustness (do no harm)+Performance is energy efficient

Hig

her

Is b

ett

er

Low

er

is b

ett

er

Talk Outline

Introduction




[ 31 ][ 31 ]

[ 32 ][ 32 ]

Examination of the Problem

Problem with existing design: register management• Miss-dependent instructions free registers when they execute

Slice BufferADHK L E

Chk

IssueQueu

e

Register

FileFU

D$

I$

BC AI D

B C

EF

F

H

I I

G

G

RenameFetch

Reorder Buffer

[ 33 ][ 33 ]

Examination of the Problem

Problem with existing design: register management• Miss-dependent instructions free registers when they execute• Actually, all instructions free registers when they execute

What’s wrong with this?• No instruction level precise state hurts on branch mispredictions• Execution order slice buffer

Slice BufferADHK L E

IssueQueu

e

Register

FileFU

D$

I$

I

ChkChkRenam

eFetch

hard to re-rename & re-acquire registers

[ 34 ][ 34 ]

BOLT Register Management

Youngest instructions: keep in re-order buffer• Conventional, in-order register freeing

Miss-dependent instructions: in slice buffer• Execution based register freeing

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C DA

RenameFetch

[ 35 ][ 35 ]


In-order speculative retirement stage• Head of ROB completed or poison?

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C DA

RenameFetch

[ 36 ][ 36 ]


In-order speculative retirement stage• Head of ROB completed or poison?• Release registers

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C DA

RenameFetch

[ 37 ][ 37 ]


In-order speculative retirement stage• Head of ROB completed or poison?• Release registers

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BC AD

B C D

RenameFetch

[ 38 ][ 38 ]


In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BCD

B C D

RenameFetch

A

[ 39 ][ 39 ]


In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer• Completed instructions are done and simply removed

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

BCD

B C D

RenameFetch

A

[ 40 ][ 40 ]


In-order speculative retirement stage• Head of ROB completed or poison?• Release registers• Poison instructions enter slice buffer• Completed instructions are done and simply removed

ChkSlice Buffer

IssueQueu

e

Reorder Buffer

Register

FileFU

D$

I$

CD

C D

RenameFetch

A

[ 41 ][ 41 ]


Benefits of BOLT’s management• Youngest instructions (ROB) get conventional recovery (do no harm)

Slice Buffer Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT ADHKL E

[ 42 ][ 42 ]


Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)

Slice BufferADHKL E

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT

[ 43 ][ 43 ]


Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)• Scale single, conventionally sized register file

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch

Reorder BufferT HKL E

Register File Contribution #1:Hybrid register management—best of both

worlds

[ 44 ][ 44 ]


Benefits of BOLT’s register management• Youngest instructions (ROB) get conventional recovery (do no harm)• Program order slice buffer allows re-use of SMT (“HyperThreading”)• Scale single, conventionally sized register file

Challenging part: two algorithms, one register file• Note: two register files not a good solution

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch


[ 45 ][ 45 ]

Two Algorithms, One Register File

Conventional algorithm (ROB)• In-order allocation/freeing from circular queue• Efficient squashing support by moving queue pointer

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch


[ 46 ][ 46 ]


Conventional algorithm (ROB)• In-order allocation/freeing from circular queue• Efficient squashing support by moving queue pointer

Aggressive algorithm (slice instructions)• Execution driven reference counting scheme

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch


[ 47 ][ 47 ]


How to combine these two algorithms?• Execution based algorithm uses reference counting• Efficiently encode conventional algorithm as reference counting• Combine both into one reference count matrix

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch


Register File Contribution #2:Efficient implementation of new hybrid

algorithm

[ 48 ][ 48 ]

Management of Loads and Stores

Large window requires support for many loads and stores • Window effectively A-V now, what about the loads & stores ?• This could be an hour+ talk by itself… so just a small piece

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch


Store to Load Dependences

Different from register state: cannot capture inputs• Store -> load dependences determined by addresses

• Cannot “capture” like registers• Must be able to find proper (older, matching) stores

[ 49 ][ 49 ]

B

C

D

E

A

F

?

Store to Load Dependences

Different from register state: cannot capture inputs• Store -> load dependences determined by addresses

• Cannot “capture” like registers• Must be able to find proper (older, matching) stores• Must avoid younger matching stores (“write-after-read” hazards)

[ 50 ][ 50 ]

B

C

D

E

A

F

?

X

[ 51 ][ 51 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 12

addressvalue

0 0 1 0 1 0 0poison

Tail (younger) Head (older)Store Buffers

86 85 84 83 82 81 80

Conventional store queue/store buffer• Holds stores in program order• Loads search “associatively” (all entries in parallel)• Doesn’t scale to sizes we need

For latency tolerance, we need….• Poison (easy)• Scalable way to search, accounting for age (hard)

[ 52 ][ 52 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

valuepoison

Tail (younger) Head (older)Chained Store Buffer

86 85 84 83 82 81 80

Replace associative search with iterative indexed search• Overlay store buffer with address-based hash table• Exploits in-order nature of speculative retirement to build

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

address

[ 53 ][ 53 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

valuepoison


86 85 84 83 82 81 80

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

85AC 2AC85

81

1AC81

Match, forward

address

[ 54 ][ 54 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

valuepoison


86 85 84 83 82 81 80

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

Deferred loads ignore younger stores, avoid WAR hazards• For example, deferred load to address 1B4 …• … whose immediately older store 81 (note when entering pipeline)

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

83B4

1B483

Younger store, ignore

15

Go to D$

address

[ 55 ][ 55 ]

Chained Store Buffer

+ Non-speculative search, scalable …+ Fast• Most non-forwarding loads access root table only• Most forwarding loads find store on first shot• Average number of excess hops < 0.05 with 64-entry root table

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

valuepoison

Tail (younger) Head (older)

86 85 84 83 82 81 80

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

address

[ 56 ][ 56 ]

BOLT: Implementation Recap

Three key implementation efficiencies in BOLT1. Re-use of existing renaming hardware

2. Hybrid register management algorithm in single register file

3. Efficient management of loads and stores

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch



[ 57 ][ 57 ]

Experimental Evaluation

SPEC 2006 Benchmarks • Focus on memory bound programs (TurboBoost gets < 15%)

Performance: detailed cycle-level timing simulation in x86• Baseline “Core i7” (includes prefetching)

Energy: re-execution overhead + new structures• Estimate energy of new structures using CACTI-4.1 [Tarjan06]

CFP vs. BOLT

[ 58 ][ 58 ]

• Speedups: Overall 5% 11% MEM 14% 18%

Hig

her

Is b

ett

er

CFP vs. BOLT

[ 59 ][ 59 ]

• Re-execution: increases due to more latency tolerance

Hig

her

Is b

ett

er

Low

er

is

bett

er

CFP vs. BOLT

[ 60 ][ 60 ]

• ED2: overall improvement• Fewer and simpler new structures (lower energy)• Increased re-executions typically correspond to higher performance

Hig

her

Is b

ett

er

Low

er

is

bett

er

Low

er

is

bett

er

Talk Outline

Introduction




[ 61 ][ 61 ]

Non-Blocking Latency Tolerance

Latency tolerance = non-blocking execution• Re-execution should not block pipeline either• Suppose B & C miss (C depends on B)• C should also not block the pipeline: reapply latency tolerance

[ 62 ][ 62 ]

Time

Execution Inefficiency

Dynamic inefficiency: excessive multiple re-execution• Observe: multiple re-execution dependence on multiple loads• Two possibilities: loads in parallel or loads in series• Different approaches to each

[ 63 ][ 63 ]

C

A B

C

A

B

Loads in Parallel

Example: accumulating sumfor(i = 0; i < n; i++)

total += array[i];

Assembly:loop:

load [r1] -> r2

add r2 + r3 -> r3

add r1 + 4 -> r1

bnz r1, r5 loop

load

addload

addload

addload

addload

addload

addload

addload

add

add

add

add

add

add

add

add

bnz

bnz

bnz

bnz

bnz

bnz

bnz

Loads in Parallel

Example: accumulating sumfor(i = 0; i < n; i++)

total += array[i];

Assembly:loop:

load [r1] -> r2

add r2 + r3 -> r3

add r1 + 4 -> r1

bne r1, r5 loop

load

addload

addload

addload

addload

addload

addload

addload

add

add

add

add

add

add

add

add

bnz

bnz

bnz

bnz

bnz

bnz

bnz

Loads in Parallel

[ 66 ][ 66 ]Time

A

BC

DE

FG

HI

JK

LM

NO

P

A

B

D

F

H

J

L

N

P

C

D

MLP!

Energy:3.8xDelay: 0.4x

ED2: 0.6x

Keep PerformanceReduce re-executions

Join Pruning

A’s miss poisoned B… so A’s return provides its antidote

[ 67 ][ 67 ]

A

BC

DE

FG

H

Join Pruning


B now executes correctly, provides antidote to D• D must capture this input

[ 68 ][ 68 ]

A

BC

DE

FG

H

Join Pruning



D is still poisoned by C, cannot provide antidote

[ 69 ][ 69 ]

A

BC

DE

FG

H

Join Pruning



D is still poisoned by C, cannot provide antidote

F is not receiving any antidote, no need to re-execute

[ 70 ][ 70 ]

A

BC

DE

FG

H

[ 71 ][ 71 ]

Antidote Vector

BOLT filters re-execution using an antidote bit-vector• Track (per-logical register) if antidote is available• Also through store to load dependences (know poisoning store)

Antidote

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch



Join Pruning

[ 72 ][ 72 ]Time

A

BC

DE

FG

HI

JK

LM

NO

P

D

F

B

A

C

D

H

J

L

N

P

E

FG

HI

JK

LM

NO

P

Energy:

2.8x

Delay:0.4

x

ED2:0.45

x

Energy:

3.8x

Delay:0.4

x

ED2:0.6

x

Join Pruning Performance

[ 73 ][ 73 ]

• Performance: strictly better (especially lbm)

Hig

her

Is

bett

er


[ 74 ][ 74 ]

• Performance: strictly better (especially lbm)• Execution overhead: strictly lower (especially lbm)

Hig

her

Is

bett

er

Low

er

is

bett

er


[ 75 ][ 75 ]

• Performance: strictly better (especially lbm)• Execution overhead: strictly lower (especially lbm)• ED2: overall improvements (again, especially lbm)

Hig

her

Is

bett

er

Low

er

is

bett

er

Low

er

is

bett

er

Loads in Series

Example: Count elements of linked listwhile (node != NULL) {

count++;

node = node->next;

}

Assemblyloop:

load [r1] -> r1

add r2 + 1 -> r2

bnz r1, loop

[ 76 ][ 76 ]

load

add

…

bnzload

add bnzload

add bnzload

add bnzload

add bnzload

add bnzload

add bnz

D

G

J

M

Pointer Chasing

[ 77 ][ 77 ]Time

A

CB D

FE G

IH J

LK M

ON …

Energy:

2.2x

Delay: 1x

ED2: 2.2x

A

CD

FG

IJ

LM

O

B

E

H

K

N

Dr! Dr! It hurts when I apply latency tolerance to pointer

chasing…

D

G

J

M

So Don’t Do It…

[ 78 ][ 78 ]Time

A

CB

FE

IH

LK

ON …

Energy:

1xDelay: 1x

ED2: 1x

Energy:

2.2x

Delay: 1x

ED2:2.2

x

Loads in Series

Not all dependent loads are badfor (int i =0; i < n; i++)

x += objects[i]->val;

Assemblyloop:

load [r1] -> r2

load [r2] -> r3

add r4, r3 -> r4

add r1, 4 -> r1

bne r1, r5 loop

Important: prune pointer chasing only• Preserve general indirection

[ 79 ][ 79 ]

load

load

load

load

load

load

load

load

add

add

add

add

add

add

add

add

load

load

load

load

load

load

load

load

Parallel

Parallel

Pointer Chasing

How to distinguish the two?

[ 80 ][ 80 ]

loop1: load [r1] -> r1 add r2 + 1 -> r2 bnz r1, loop1

loop2: load [r1] -> r2 load [r2] -> r3 add r4, r3 -> r4 add r1, 4 -> r1 bne r1, r5 loop2

Pointer Chasing

How to distinguish the two?• Pointer chasing: load poisons younger instances of itself

[ 81 ][ 81 ]



Pointer Chasing

How to distinguish the two?• Pointer chasing: load poisons younger instances of itself• Benign indirection: poison comes from different (static) load

[ 82 ][ 82 ]



Pointer Chasing

How to distinguish the two?• Pointer chasing: load poisons younger instances of itself• Benign indirection: poison comes from different (static) load

• Loop induction not chain of poison loads

[ 83 ][ 83 ]



[ 84 ][ 84 ]

Extended Antidote Vector

Idea: extend poison information with low bits of PC• Poison from same PC pointer chasing

One implementation: detect at execution• Shuts pointer-chasing down immediately• Complicates latency-critical execution structures

A better one: detect at re-dispatch (extend antidotes) • Learn identity of pointer-chasing PC and shut down future instances

Antidote

Slice BufferAD

Chk

IssueQueu

e

Register

FileFU

D$

I$

UV

U VT T

RenameFetch



Pointer Chasing Performance

[ 85 ][ 85 ]

• Speedups: same (good: no harm)

Hig

her

Is

bett

er


[ 86 ][ 86 ]

• Speedups: same (good: no harm)• Execution overhead: reduced (mcf 290% 44%)

Hig

her

Is

bett

er

Low

er

is

bett

er


[ 87 ][ 87 ]

• Speedups: same (good: no harm)• Execution overhead: reduced (mcf 290% 44%)• ED2: overall improvement (mcf basically breaks even now)

Hig

her

Is

bett

er

Low

er

is

bett

er

Low

er

is

bett

er

BOLT vs. TurboBoost

[ 88 ][ 88 ]

BOLT able to help performance where TurboBoost cannot

…and more energy efficiently

Hig

her

Is

bett

er

Low

er

is

bett

er

BOLT vs. TurboBoost

[ 89 ][ 89 ]

BOLT + TurboBoost? • Synergistic: BOLT “un-memory-bounds” programs• BOLT + TurboBoost still an ED2 win!

Hig

her

Is

bett

er

Low

er

is

bett

er

Partial Summary

Latency tolerance • Scale window virtually under long cache misses• No good implementations + excessive overhead• Potentially good complement to TurboBoost

Energy-efficient latency tolerance• Low-cost implementation: re-use SMT, registers & load/stores• Low runtime overhead: prune pointer-chasing and “joins”• Actually good complement to TurboBoost• Applicable to both in-order and out-of-order cores

[ 90 ][ 90 ]

[ 91 ][ 91 ]

iCFP: In-order Latency Tolerance

BOLT – (out-of-order core) + (in-order core) = ?

Antidote

Slice Buffer Chk

IssueQueu

e

Register

FileFU

D$

I$

RenameFetch

Reorder Buffer


[ 92 ][ 92 ]


BOLT – (out-of-order core) + (in-order core) = ?


Slice Buffer Chk

Antidote

[ 93 ][ 93 ]


BOLT – (out-of-order core) + (in-order core) = iCFP [HPCA’09]

• Some details obviously different due to in-order pipeline• Useful for any level cache miss (L1, L2, L3)• Joint work with Santosh Nagarakatte

Other in-order latency tolerant designs• Sun’s Rock “processor” [Chaudhry’09]

• Simple Latency Tolerant Processor [Nekkalapu’09]

RF 0

RF 1

FetchI$

FU

D$


Slice Buffer Chk

Antidote

Talk Outline

Introduction


My work: energy efficient latency tolerance in BOLT


[ 94 ][ 94 ]

Other Work / Future Directions

Micro-architecture• Control independence [ISCA’07]

• Plans for more work related to latency tolerance• Store latency tolerance• Possibly changes out-of-order “sweet spot”

• In submission / in progress: energy efficient load/store data path• Trident: reduce D$ accesses improve energy + perf• SMT-directory: reduce lq coherence searches in SMT

• Future work: register reference counting register file gating• Generally interested in performance and energy

[ 95 ][ 95 ]

Other Work / Future Directions

Simulation and workload methodologies• Multi-programming workload methodology [MobS’09]

• Future plans include adapting ideas to multi-threaded applications• Generally interested in research on better simulation

Operating systems and security• Operating system based security project for layered sandboxing

• Provides system calls to restrict behavior of less trusted code• Many future plans on this project, most involving hardware support• Generally interested in how hardware can improve software

[ 96 ][ 96 ]

[ 97 ][ 97 ]

[ 98 ][ 98 ]

Sun’s Rock

Rock [Chaudhry’09] does in-order latency tolerance• Slice buffer (“Deferral Queues”) divided by multiple checkpoints• Re-execution limited to oldest region• Values from slices reintegrated to main register file when DQs empty

RF 0

RF 1

FetchI$

FU

D$

Slice Buffer

ChkChkChk

ADHKL E

Unrolled Loops

What if compiler unrolled loop with pointer chasing?• Still detectable, just takes one time per unrolling

[ 99 ][ 99 ]

loop1: load [r1] -> r1 bz r1, endMidLoop load [r1] -> r1 add r2 + 2 -> r2 bnz r1, loop1

[ 100 ][ 100 ]

FIESTA

Experiments with multiple programs• Simulation: cannot run entire program (too slow)• How do you do this?

[ 101 ][ 101 ]

FIESTA


Fixed workloads: run all programs for X million insns

[ 102 ][ 102 ]

Other Work: FIESTA



Variable workloads: run both until sum = X million insns

[ 103 ][ 103 ]

Other Work: FIESTA




[ 104 ][ 104 ]

Two very different answers for one question…

Why is this? Which is the right answer?

[ 105 ][ 105 ]

Traditional Fixed-Workload

Single-program workload x N• X insns (i.e. 5M/sample) from each program• Workload composition is fixed across experiments + Direct comparisons between experiments– Load imbalance: time spent executing only slowest programs

A:A: 5M5M

B:B: 5M5M

timetime

[ 106 ][ 106 ]

Variable-Workload

Multi-program execution defines workload• Execute all programs until some condition (i.e. total insns = 10M)• Normalize to single-program region defined by this execution+Eliminates load imbalance (by construction)- Naturally oversamples programs which perform better

A:A: 3M3M

B:B: 7M7M

timetime

[ 107 ][ 107 ]

Deconstructing Load Imbalance

Fixed-workload runs experience two forms of imbalance

Sample imbalance: different standalone runtimes• Artifact of finite experiments• Should be eliminated• Easy: choose samples with same standalone runtimes

Schedule imbalance: asymmetric (“unfair”) contention• Characteristic of concurrent execution• Should be preserved, measured

[ 108 ][ 108 ]

FIESTA

FIESTA: Fixed-Instruction with Equal STAndalone runtimes• Run single-programs for C cycles, record insn count• Build fixed workloads from time-balanced samples+ Eliminates sample imbalance+ Remaining imbalance is schedule imbalance

A:A: 5M5M

B:B: 7M7Mtimetime

A:A: 5M5M

B:B: 7M7M

timetime

schedule imbalanceschedule imbalance

Other Work: FIESTA




FIESTA [MobS’09]: create a-priori balanced samples

Joint work with Neeraj Eswaran[ 109 ][ 109 ]

Other Work: Paladin

Large software systems: many components, different trust• No way to restrict behavior of called modules

[ 110 ][ 110 ]

Trusted Code

Junior Developer’s

Module

PluginThird Party

Library

Other Work: Paladin

Large software systems: many components, different trust• No way to restrict behavior of called modules

Paladin [In submission]: OS support for layered sandboxing• New system calls to restrict system call behavior• Also ensure restrictions only removed when module returns• Joint work with Jeff Vaughan

[ 111 ][ 111 ]

Trusted Code

Junior Developer’s

Module

PluginThird Party

Library

energy efficient latency tolerance: single-thread performance for the multi-core era

Documents

core i7main memory

order cores core i7

core architecture

core busy

core i7sized

ideal memory bound main

cost of energy

constant energy