accelerating decoupled look-ahead to exploit implicit ...parihar/talks/proposal_talk.pdf ·...

69
Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis Summary Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang March 22, 2013 Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Upload: others

Post on 16-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

March 22, 2013

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Page 2: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

MotivationDespite the proliferation of multi-core, multi-threaded systems

Single-thread performance is still an important design goal

Modern programs do not lack instruction level parallelism

Real challenge: exploit implicit parallelism without undue costs

One effective approach: Decoupled look-ahead

bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1

10

100

IPC

128 512 2K 128 512 2K 107

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 2

Page 3: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

MotivationDespite the proliferation of multi-core, multi-threaded systems

Single-thread performance is still an important design goal

Modern programs do not lack instruction level parallelism

Real challenge: exploit implicit parallelism without undue costs

One effective approach: Decoupled look-ahead

bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1

10

100

IPC

128 512 2K 128 512 2K 107

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 3

Page 4: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Motivation: Decoupled Look-ahead

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

The look-ahead thread can often become a new bottleneck

We explore techniques to accelerate the look-ahead thread

Speculative parallelization: aptly suited due to increasedparallelism in the look-ahead binaryWeak dependence: lack of correctness constraint allows weakinstruction removal w/o affecting the quality of look-ahead

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 4

Page 5: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Motivation: Decoupled Look-ahead

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

The look-ahead thread can often become a new bottleneck

We explore techniques to accelerate the look-ahead thread

Speculative parallelization: aptly suited due to increasedparallelism in the look-ahead binaryWeak dependence: lack of correctness constraint allows weakinstruction removal w/o affecting the quality of look-ahead

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 4

Page 6: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Motivation: Decoupled Look-ahead

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

The look-ahead thread can often become a new bottleneck

We explore techniques to accelerate the look-ahead thread

Speculative parallelization: aptly suited due to increasedparallelism in the look-ahead binaryWeak dependence: lack of correctness constraint allows weakinstruction removal w/o affecting the quality of look-ahead

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 4

Page 7: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Outline

Motivation

Baseline decoupled look-ahead

Look-ahead thread acceleration

Speculative parallelization in look-ahead

Weak dependence removal in look-ahead

Experimental analysis

Summary

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 5

Page 8: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Baseline Decoupled Look-ahead

Binary parser is used to generate skeleton from original programThe skeleton runs on a separate core and

Maintains its memory image in local L1, no writeback to shared L2Sends branch outcomes through FIFO queue; also helps prefetching

A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO-08

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 6

Page 9: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Practical Advantages of Decoupled Look-ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 7

Page 10: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Practical Advantages of Decoupled Look-ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 7

Page 11: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Practical Advantages of Decoupled Look-ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 7

Page 12: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:

Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0

1

2

3

4

IPC

Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 8

Page 13: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:

Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0

1

2

3

4

IPC

Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 9

Page 14: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:

Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0

1

2

3

4

IPC

Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only

81%53%

47%27%

11%

70%11%

16%35%

47%68%

154%189%

117%

262%262%

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 10

Page 15: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead aloneApplication categories:

Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

apl msa wup mgr six swim fac gal gcc gap eon fm3d gzip crft vor apsi vpr bzp2 eqk amp luc art pbmkmcf twlf0

1

2

3

4

IPC

Single−thread Decoupled look−ahead Ideal(Cache,Br) Look−ahead only

81%53%

47%27%

11%

70%11%

16%35%

47%68%

154%189%

117%

262%262%

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 11

Page 16: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Unique Opportunities for Speculative Parallelization

Skeleton code offers more parallelism

Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation

Look-ahead is inherently error-tolerant

Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 12

Page 17: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Unique Opportunities for Speculative Parallelization

Skeleton code offers more parallelism

Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation

Look-ahead is inherently error-tolerant

Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 12

Page 18: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 13

Page 19: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 13

Page 20: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 13

Page 21: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 14

Page 22: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)

Loop based FP applications exhibit more BB level parallelism

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 14

Page 23: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 14

Page 24: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Hardware and Runtime Support

Thread spawning and merging are verysimilar to regular thread spawning except

Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC

Value communication

Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 15

Page 25: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Hardware and Runtime Support

Thread spawning and merging are verysimilar to regular thread spawning except

Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC

Value communication

Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 15

Page 26: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speedup of Speculative Parallelization

14 applications in which the look-ahead thread is bottleneck

Speedup of look-ahead systems over single-thread

Decoupled look-ahead over single-thread baseline: 1.53xSpeculative parallel look-ahead over single-thread: 1.73x

Speculative look-ahead over decoupled look-ahead: 1.13x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16

Page 27: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speedup of Speculative Parallelization

14 applications in which the look-ahead thread is bottleneckSpeedup of look-ahead systems over single-thread

Decoupled look-ahead over single-thread baseline: 1.53x

Speculative parallel look-ahead over single-thread: 1.73x

Speculative look-ahead over decoupled look-ahead: 1.13x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16

Page 28: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speedup of Speculative Parallelization

14 applications in which the look-ahead thread is bottleneckSpeedup of look-ahead systems over single-thread

Decoupled look-ahead over single-thread baseline: 1.53xSpeculative parallel look-ahead over single-thread: 1.73x

Speculative look-ahead over decoupled look-ahead: 1.13x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16

Page 29: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speedup of Speculative Parallelization

14 applications in which the look-ahead thread is bottleneckSpeedup of look-ahead systems over single-thread

Decoupled look-ahead over single-thread baseline: 1.53xSpeculative parallel look-ahead over single-thread: 1.73x

Speculative look-ahead over decoupled look-ahead: 1.13x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 16

Page 30: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 17

Page 31: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13x

Speculative main thread over single thread baseline: 1.07x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 17

Page 32: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 17

Page 33: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Motivation for Exploiting Weak Dependences

Not all instructions are equally important and critical

Example of weak instructions:

Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions

Plenty of weak instructions are

present in programs

Challenges involved:

Context-based, hard to identify and combine – much like Jenga

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 18

Page 34: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Motivation for Exploiting Weak Dependences

Not all instructions are equally important and critical

Example of weak instructions:

Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions

Plenty of weak instructions are

present in programs

Challenges involved:

Context-based, hard to identify and combine – much like Jenga

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 18

Page 35: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Comparison of Weak and Strong Instructions

Static attributes of weak and strong insts are remarkably same

Static attributes: opcode, number of inputsThe correlation coefficient of the two distributions is 0.96

Weakness has very poor correlation with static attributes

Hard to identify the weak insts through static heuristics

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 19

Page 36: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Comparison of Weak and Strong Instructions

Static attributes of weak and strong insts are remarkably same

Static attributes: opcode, number of inputsThe correlation coefficient of the two distributions is 0.96

Weakness has very poor correlation with static attributes

Hard to identify the weak insts through static heuristics

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 19

Page 37: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Genetic Algorithm based Framework

Genetic algorithm based framework to identify and eliminateweak instructions from the look-ahead skeleton

Genetic evolution: procreation and natural selectionChromosomes creation and hybridizationBaseline look-ahead skeleton construction

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 20

Page 38: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Genetic Algorithm based Framework

Genetic algorithm based framework to identify and eliminateweak instructions from the look-ahead skeleton

Genetic evolution: procreation and natural selection

Chromosomes creation and hybridizationBaseline look-ahead skeleton construction

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 20

Page 39: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Genetic Algorithm based Framework

Genetic algorithm based framework to identify and eliminateweak instructions from the look-ahead skeleton

Genetic evolution: procreation and natural selectionChromosomes creation and hybridizationBaseline look-ahead skeleton construction

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 20

Page 40: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Heuristic Based Solutions

Heuristic based solutions are helpful to jump start the evolution

Superposition based chromosomesOrthogonal subroutine based chromosomes

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 21

Page 41: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative parallelization in look-aheadWeak dependence removal in look-ahead

Progress of Genetic Evolution Process

Per generation progress compared to the final best solution

After 2 generations, more than half of the benefits are achievedAfter 5 generations, at least 90% of benefits are achieved

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 22

Page 42: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Experimental Setup

Program/binary analysis tool: based

on ALTO

Simulator: based on heavilymodified SimpleScalar

SMT, look-ahead and speculativeparallelization supportTrue execution-driven simulation(faithfully value modeling)

Genetic algorithm framework

Modeled as offline and onlineextension to the simulator

Microarchitectural config:

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 23

Page 43: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speedup of Self-tuned Look-ahead

Applications in which the look-ahead thread is a bottleneck

Self-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.16xSpeedup over single-thread baseline: 1.78x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 24

Page 44: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speedup of Self-tuned Look-ahead

Applications in which the look-ahead thread is a bottleneckSelf-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.16x

Speedup over single-thread baseline: 1.78x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 24

Page 45: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speedup of Self-tuned Look-ahead

Applications in which the look-ahead thread is a bottleneckSelf-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.16xSpeedup over single-thread baseline: 1.78x

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 24

Page 46: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Comparison with Speculative Parallel Look-ahead

Self-tuned skeleton is used in the speculative parallel look-ahead

In some cases, self-tuned and speculative parallel look-ahead

techniques are synergistic (ammp, art)

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 25

Page 47: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Ongoing and Future Explorations

Load balancing through skipping non-critical branches

Weak instruction classification and identification

Single-core version of decoupled look-ahead

Static heuristics to construct adaptive skeletons

Role of look-ahead to promote parallelization and accelerate the

execution of interpreted programs

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 26

Page 48: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Summary

Decoupled look-ahead can uncover significant implicit parallelism

However, look-ahead thread often becomes a new bottleneck

Fortunately, look-ahead lends itself to various optimizations:

Speculative parallelization is more beneficial in look-ahead threadWeak instructions can be removed w/o affecting look-ahead quality

Intelligent look-ahead technique is a promising solution in the

era of flat frequency and modest microarchitecture scaling

Idle cores in multicore environment will further strengthen the

case of decoupled look-ahead adoption in mainstream system

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 27

Page 49: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Backup Slides

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

March 22, 2013

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 28

Page 50: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Microthreads vs Decoupled Look-ahead

Lightweight Microthreads: Decoupled Look-ahead:

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 29

Page 51: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Look-ahead Skeleton Construction

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 30

Page 52: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Performance Benefits of Decoupled Look-ahead

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 31

Page 53: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

IPC of Speculative Parallelization

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 32

Page 54: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Speculative Parallelization: Cortex-A9 vs POWER5

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 33

Page 55: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Flexibility in Look-ahead Hardware Design

Comparison of regular (partial versioning) cache support with twoother alternatives

No cache versioning supportDependence violation detection and squash

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 34

Page 56: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Partial Recoveries and Spawns

Partial recoveries:

Breakdown of all the spawns:

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 35

Page 57: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Genetic Algorithm Evolution

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 36

Page 58: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Multi-instruction Gene Examples

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 37

Page 59: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Optimizations to Implementation

Fitness test optimizations

Sampling based fitnessMulti-instruction genesEarly termination of tests

GA framework optimizations

Hybridization of solutionsAdaptive mutation rateUnique chromosomesFusion crossover operatorElitism policy

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 38

Page 60: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Superposition based Chromosomes

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 39

Page 61: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Recovery based Early Termination of Fitness Test

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 40

Page 62: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Weak Dependence: IPC and Instructions Removed

IPC comparison:

Instructions removed from baseline skeleton:

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 41

Page 63: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Sampling based Fitness Test

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 42

Page 64: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

L2 Cache Sensitivity Study

Speedup for various L2 caches is quite stable

1.161x (1 MB), 1.154x (2 MB), and 1.152x (4 MB) L2 caches

Avg. speedups, shown in the figure, are relative to

single-threaded execution with a 1 MB L2 cache

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 43

Page 65: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Single-Gene vs Heuristic based Chromosomes

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 44

Page 66: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Genetic Algorithm based Heuristics

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 45

Page 67: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Other Details

Energy reduction: 11% over baseline decoupled look-ahead

Reduced cache accesses, less stalling of main thread

On an average, 10% of the dynamic instructions are removed

from the baseline skeletonOffline profiling and control software overhead

Offline profiling time: 2 to 20 seconds on the target machineOnline control software: 17 million instructions for whole evolution

Average extra recoveries: 3-4 per 100,000 instructions

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 46

Page 68: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Load Balancing in Look-ahead

Non-critical branches can be transformed to accelerate thelook-ahead thread and achieve better load balancing

Non-critical branches are skipped in the look-ahead threadMain thread executes such branches by itself w/o any helpWe call these branches Do-It-Yourself or DIY branches

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 47

Page 69: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/talks/proposal_talk.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Experimental analysis

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Experimental analysisSummary

Preliminary Speedup of Load Balancing

Load balancing through skipping of the non-critical branches

Max. speedup over decoupled look-ahead: 1.76x (art)Avg. speedup over decoupled look-ahead: 1.12x (Gmean)

Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism 48