soha hassoun tufts university medford, ma thanks to: carl ebeling university of washington seattle,...

41
Soha Hassoun Soha Hassoun Tufts University Tufts University Medford, MA Medford, MA Thanks to: Carl Ebeling Thanks to: Carl Ebeling University of Washington University of Washington Seattle, WA Seattle, WA Fine Grain Incremental Rescheduling Fine Grain Incremental Rescheduling Via Via Architectural Retiming Architectural Retiming

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Soha HassounSoha Hassoun

Tufts UniversityTufts University

Medford, MAMedford, MA

Thanks to: Carl EbelingThanks to: Carl Ebeling

University of WashingtonUniversity of Washington

Seattle, WASeattle, WA

Fine Grain Incremental ReschedulingFine Grain Incremental ReschedulingViaVia

Architectural RetimingArchitectural Retiming

RAM

OffsetOffset

ExampleExample

Problem -- Clock period is too largeProblem -- Clock period is too large

Write AddressWrite Address

Read AddressRead Address

RAM

Write AddressWrite Address

Read AddressRead Address

OffsetOffset

PipeliningPipelining

Problems w/ consecutive dependent operationsProblems w/ consecutive dependent operations

Performance BottleneckPerformance Bottleneck

Latency constrained pathsLatency constrained paths

Latency = n

Performance BottleneckPerformance Bottleneck

Latency constrained pathsLatency constrained paths

Latency = n

ApproachApproachapply architectural retiming at the RT levelapply architectural retiming at the RT level

Problem:Problem: too much work, too little timetoo much work, too little time

Architectural RetimingArchitectural Retiming

yk

Problem:Problem: too much work, too little timetoo much work, too little time

D

pipelinepipelineregisterregister

yk

Architectural RetimingArchitectural Retiming

N

negative registernegative register

Problem:Problem: too much work, too little timetoo much work, too little time

pipelinepipelineregisterregister

DCyk

Architectural RetimingArchitectural Retiming

N

negative registernegative register

Problem:Problem: too much work, too little timetoo much work, too little time

pipelinepipelineregisterregister

DCyk

Architectural RetimingArchitectural Retiming

precomputation prediction

OutlineOutline

PrecomputationPrecomputationincremental rescheduling incremental rescheduling withoutwithout resource resource

constraintsconstraints

PredictionPredictionincremental rescheduling incremental rescheduling withwith resource resource

constraintsconstraints

ResultsResults

DD t t = C = C t+1t+1

Precomputation FunctionPrecomputation Function

hhhDCxi

ffggyk

x iN

DD t t = C = C t+1t+1

= f ( ... , x= f ( ... , xi i t+1t+1 , ... ) , ... )

Precomputation FunctionPrecomputation Function

hhhDCxi

ffggyk

x iN

DD t t = C = C t+1t+1

= f ( ... , x= f ( ... , xi i t+1t+1 , ... ) , ... )

xxi i t+1t+1 = x´= x´ii

t t == gg ( ... , y( ... , ykktt , ... ) , ... )

Precomputation FunctionPrecomputation Function

hhhDCxi

ffggyk

x iN

f´f´DD t t = C = C t+1t+1

= f ( ... , x= f ( ... , xi i t+1t+1 , ... ) , ... )

xxi i t+1t+1 = x´= x´ii

t t == gg ( ... , y( ... , ykktt , ... ) , ... )

Precomputation FunctionPrecomputation Function

hhhDCxi

ffggyk

x iN

DD tt = f ( ... , g= f ( ... , g ( ... , y( ... , ykktt , ... ) , ...) , ... ) , ...)

= f´( ... , y= f´( ... , ykktt , ... ) , ... )

Incremental ReschedulingIncremental Rescheduling

hhhffggyk

Time n g

Time n+1 f, h

N

f´f´

Incremental ReschedulingIncremental Rescheduling

hhhffggyk

Time n g

Time n+1 f, h

N

Time n f ’

Time n+1 h

PrecomputingPrecomputingWith Register ArraysWith Register Arrays

Read Data

Write Address

Read Address

Write Data

Read Data

PrecomputingPrecomputingWith Register ArraysWith Register Arrays

Write Address

Read Address

Write Data

Read Data

Out

N

F

PrecomputingPrecomputingWith Register ArraysWith Register Arrays

F t = Out t+1

Write Address

Read Address

Write Data

Read Data

Out

N

F

PrecomputingPrecomputingWith Register ArraysWith Register Arrays

F t = Out t+1

= Arrayt+1 [Read Addresst+1 ]

Write Address

Read Address

Write Data

Read Data

Out

N

F

Synthesizing Bypass PathsSynthesizing Bypass Paths

Write Address

PrecomputedRead

Address

Write Data

Read Data

=?

Write Address

Read Address

Write Data

Read Data

Precomputing RAM OutputPrecomputing RAM Output

RAM

N

RAM

PredictionPrediction

DCffgi

Z

N

What if ? What if ? can’t precompute, can’t precompute, too many additional resources, ortoo many additional resources, orperformance is unsatisfactoryperformance is unsatisfactory

PredictionPrediction

DCffgi

Z

N

What if ? What if ? can’t precompute, can’t precompute, too many additional resources, ortoo many additional resources, orperformance is unsatisfactoryperformance is unsatisfactory

Predict C one cycle before its arrivalPredict C one cycle before its arrival

Schedule with MispredictionsSchedule with Mispredictions

C HR1 R2

t-1 t t+1C c1 c2

H h1 h2

Schedule with MispredictionsSchedule with Mispredictions

C HR1 R2

t-1 t t+1C c1

H

Verify

NegativeRegister

c2

h1 h2

Schedule with MispredictionsSchedule with Mispredictions

C HR1 R2

t-1 t t+1C c1

H

Verify

NegativeRegister

Schedule with MispredictionsSchedule with Mispredictions

C HR1 R2

t-1 t t+1C c1

H

h1

c1*=? c1

c1*

Verify

NegativeRegister

c2*

c2

h2

c2*=? c2

c2

Synthesis Issues in PredictionSynthesis Issues in Prediction

Negative register as predicting FSM Negative register as predicting FSM use signal transition probabilitiesuse signal transition probabilitiesincorporate don’t care conditionsincorporate don’t care conditions

Nullifying mispredictionsNullifying mispredictionsTwo correction strategiesTwo correction strategies

• As-Soon-As-Possible restoration• As-Late-As-Possible correction

Add handshaking signals to coordinate with Add handshaking signals to coordinate with interfaceinterface

Related WorkRelated Work PrecomputationPrecomputation

Bypass Synthesis Bypass Synthesis lookahead [Kogge ‘81, …..]lookahead [Kogge ‘81, …..]

Prediction / Speculative ExecutionPrediction / Speculative ExecutionMost likely path, arbitrarily deep [Holtmann & Ernst Most likely path, arbitrarily deep [Holtmann & Ernst

‘93,’95]‘93,’95]Pre-execution [Radivojevic & Brewer ‘94]Pre-execution [Radivojevic & Brewer ‘94]Possible multiple paths & arbitrarily deep Possible multiple paths & arbitrarily deep

[Lakshminarayana et al. ‘98][Lakshminarayana et al. ‘98]

Percolation scheduling Percolation scheduling [Potasman et al. ‘90][Potasman et al. ‘90]

ResultsResults

0

0.5

1

1.5

2

2.5

Seq QC GCD-prec FA1 FA2 MIM MIM-pred GCD-pred

Speed up Area Increase

Architectural RetimingArchitectural Retiming Improves throughput while preserving Improves throughput while preserving

functionality and sometimes latencyfunctionality and sometimes latency

Bridge gap between HLS and logic optimizationsBridge gap between HLS and logic optimizations

Unifies several sequential optimizationsUnifies several sequential optimizationsbypass synthesisbypass synthesislookahead transformationlookahead transformationbranch predictionbranch predictionfine-grain cross register optimizationsfine-grain cross register optimizations

Ph.D. Forum at DAC ‘99Ph.D. Forum at DAC ‘99 Goal Goal

increase interaction between academia and industryincrease interaction between academia and industry

FormatFormatstudents present work at poster session at DAC students present work at poster session at DAC researchers give feedbackresearchers give feedback

Who’s eligible?Who’s eligible?Students within 1 or 2 years of finishing Ph.D. thesisStudents within 1 or 2 years of finishing Ph.D. thesis

www.cs.washington.edu/homes/soha/forum

The EndThe End

Precomputing in Precomputing in Single-Register CyclesSingle-Register Cycles

Original CircuitBA

Precomputing in Precomputing in Single-Register CyclesSingle-Register Cycles

Original CircuitN BA

Precomputing in Precomputing in Single-Register CyclesSingle-Register Cycles

Lookahead -- A(n) is a function of B(n-2)

N BA

A' BAB'

[Kogge, ‘81], [Parhi & Messerschmidtt, ‘89]

Precomputing RAM OutputPrecomputing RAM Output

RAMRAM

Precomputing RAM OutputPrecomputing RAM Output

RAMRAM

Speculative Execution Speculative Execution

c1

c2

c3

c4

c5

c6

Scope and Depth

Speculative Execution Speculative Execution

Scope and Depth