-
ComputerArchitectureLecture5:CompilerTechniquesforILP&BranchPrediction(Chapter3)
ChihWeiLiuNationalChiao [email protected]
-
DataDependenceandParallelism
If2instructionsareparallel theycanbeexecutedsimultaneouslyinapipelinewithoutcausinganystalls(exceptthestructuralhazards)
theirexecutionordercanbeswapped If2instructionsaredependent
theymustbeexecutedinorder orpartiallyoverlapped.
Toexploitparallelismsoverinstructionsisequivalenttodeterminedependencesoverinstructions
CA-Lec5 [email protected]
Introduction
2
-
ExploitInstructionLevelParallelism
Twomainapproaches: Hardwarebaseddynamicapproaches
Hardwarelocatestheparallelisminruntime Usedinserveranddesktopprocessors(NotusedasextensivelyinPMPprocessors)
Superscalarprocessors:Pentium4,IBMPower,AMDOpteron
Compilerbasedstaticapproaches Softwarefindsparallelismatcompiletime UsedinDSPprocessors(Notassuccessfuloutsideofscientificapplications)
VLIWprocessors:Itanium2,ITRIPAC
CA-Lec5 [email protected]
Introduction
3
-
WhyILP?
MultiissueProcessor:twoormoreinstructionscanbeissued(orexecuted)inparallel ThegoalistomaximizeIPC(instructionpercycle) PipelineCPI=IdealpipelineCPI+Structuralstalls+Datahazardstalls+Controlstalls
Toreducetheimpactofdataandcontrolhazards BasicblockILP
CanwemakeCPIcloserto1? Ifwehavencyclelatency,thenweneedn1instructionsbetweenaproducinginstructionanditsuse
CA-Lec5 [email protected]
Introduction
4
-
BasicBlockILP BasicBlock(BB)ILPisquitesmall :
BB:astraightlinecodesequencewithnobranchesin excepttotheentryandnobranchesout exceptattheexit
averagedynamicbranchfrequency15%to25%=>4to7instructionsexecutebetweenapairofbranches
PlusinstructionsinBBlikelytodependoneachother WemustexploitILPacrossmultiplebasicblocks
Loopunrolling toexploitlooplevelparallelism
CA-Lec5 [email protected] 5
for (i=1; i
-
OvercometheDataDependence
Maintainingthedependencebutavoidingahazard schedulingthecodeinHW/SWapproach
Eliminatingadependencebytransformingthecode primarybysoftware
Dependencedetection byregisternames:simpler bymemorylocations:morecomplicated
Twoaddressesmayrefertothesamelocationbutlookquitedifferent(e.g.100(R4), 20(R6) maybeidentical)
Theeffectiveaddressofaload/storemaychangedfrominstructiontoinstruction(20(R4), 20(R4) maybedifferent)
CA-Lec5 [email protected] 6
-
(True)DataDependence
DataandControldependenciesareapropertyoftheprogram(application)
Datadependenceconveys: Possibilityofapipelinehazard Orderinwhichresultsmustbecalculated(i.e.programbehavior)
UpperboundonILP Dependenciesthatflowthroughmemorylocationsaredifficulttodetect Hardwarebaseddynamicapproachismoreattractive
CA-Lec5 [email protected]
Introduction
7
-
NameDependence Twoinstructionsusethesameregisterormemorylocation(i.e.thesamename)butnoflowofinformation Notatruedatadependence,butisaproblemwhenreorderinginstructionsorairregularlypipelineddatapath.
Antidependence:instructionjwritesaregisterormemorylocationthatinstructioni reads
Initialordering(i beforej)mustbepreserved Outputdependence:instructioni andinstructionjwritethesameregisterormemorylocation
Orderingmustbepreserved
Toresolve,userenamingtechniques
CA-Lec5 [email protected]
Introduction
8
-
RegisterRenamingandWAW/WAR
DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0 (R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8
DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0 (R1) SUB.D T, F10, F14 MUL.D F6, F10, T
CA-Lec5 [email protected] 9
WAW: ADD.D/MUL.D
WAR: ADD.D/SUB.D, S.D/MUL.D
RAW: DIV.D/ADD.D, ADD.D/S.D SUB.D/MUL.D
Register renaming result
-
ControlDependence
ControlDependence Orderingofinstructioni withrespecttoabranchinstruction
Instructioncontroldependentonabranchcannotbemovedbeforethebranchsothatitsexecutionisnolongercontrollerbythebranch
Aninstructionnotcontroldependentonabranchcannotbemovedafterthebranchsothatitsexecutioniscontrolledbythebranch
CA-Lec5 [email protected]
Introduction
10
-
ControlDependenceIgnored
Themethodtopreservethecontroldependence BeusedinmostsimplepipelineCPUs Simplebutinefficient
Controldependenceisnotthecriticalpropertythatmustbepreserved Wemayexecuteinstructionthatshouldnothavebeenexecuted,
therebyviolatingthecontroldependence,ifwecandosowithoutaffectingthecorrectnessoftheprogram
Thetwopropertiescriticaltoprogramcorrectnessare Theexceptionbehavior Dataflow
CA-Lec5 [email protected] 11
-
ControlDependenceExamples
R1inORinstructiondependsonDADDUorDSUBU,reliedonthebranchistakenornot.
AssumeR4isntusedafterskip PossibletomoveDSUBUbeforethe
branch(thisviolatesthecontroldependencebutnotaffectsthedataflow)
CA-Lec5 [email protected]
Introduction
Example 1:DADDU R1,R2,R3BEQZ R4,LDSUBU R1,R1,R6
L: OR R7,R1,R8
Example 2:DADDU R1,R2,R3BEQZ R12,skipDSUBU R4,R5,R6DADDU R5,R4,R9
skip:OR R7,R8,R9
12
-
PreserveDataFlowBehavior
Thedataflowistheactualflowofdata amonginstructions Branchesmakedataflowdynamic Example
R1 valuedependsonthebranchistakenornot DSUBU cannotbemovedabovethebranch. Speculationshouldtakecarethisproblem
Programorder,thatdetermineswhichpredecessorwillactuallydeliveradatavaluetotheinstruction,shouldbeensuredbymaintainingthecontroldependences
CA-Lec5 [email protected] 04-13
-
CompilerTechniquesforExposingILP
Pipelinescheduling Separatedependentinstructionfromthesourceinstructionbythe
pipelinelatency(orinstructionlatency)ofthesourceinstruction Example:
for(i=999;i>=0;i=i1)x[i]=x[i]+s;
CA-Lec5 [email protected]
Com
piler Techniques
14
Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)DADDUI R1,R1,#-8BNE R1,R2,Loop
-
DataDependenceLoop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer
BNE R1, R2, Loop ;branch R1!=R2
CA-Lec5 [email protected] 15
The arrows show the order that must be preserved for correct execution.
If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped.
-
Step1:InsertPipelineStallsLoop: L.D F0,0(R1)
stallADD.D F4,F0,F2stallstallS.D F4,0(R1)DADDUI R1,R1,#8stall (assumeintegerloadlatencyis1)BNE R1,R2,Loop
CA-Lec5 [email protected]
Com
piler Techniques
16
9 C.C/ iteration
-
Step2:ReSchedulingScheduledcode:Loop: L.D F0,0(R1)
DADDUI R1,R1,#8ADD.D F4,F0,F2stallstallS.D F4,8(R1)BNE R1,R2,Loop
CA-Lec5 [email protected]
Com
piler Techniques
17
7 C.C/ iteration
-
Step3:LoopUnrolling Loopunrolling
Unrollbyafactorof4(assume#elementsisdivisibleby4) Eliminateunnecessaryinstructions
Loop: L.DF0,0(R1)ADD.DF4,F0,F2S.DF4,0(R1);dropDADDUI&BNEL.DF6,8(R1)ADD.DF8,F6,F2S.DF8,8(R1);dropDADDUI&BNEL.DF10,16(R1)ADD.DF12,F10,F2S.DF12,16(R1);dropDADDUI&BNEL.DF14,24(R1)ADD.DF16,F14,F2S.DF16,24(R1)DADDUIR1,R1,#32BNER1,R2,Loop
CA-Lec5 [email protected]
Com
piler Techniques
note:numberofliveregistersvs.originalloop
18
-
Step4:ReScheduletheUnrolledloop
Pipelinescheduletheunrolledloop:
Loop: L.DF0,0(R1)L.DF6,8(R1)L.DF10,16(R1)L.DF14,24(R1)ADD.DF4,F0,F2ADD.DF8,F6,F2ADD.DF12,F10,F2ADD.DF16,F14,F2S.DF4,0(R1)S.DF8,8(R1)DADDUIR1,R1,#32S.DF12,16(R1)S.DF16,8(R1)BNER1,R2,Loop
CA-Lec5 [email protected]
Com
piler Techniques
19
14 C.C/ 4 iterationsor 3.5 C.C/ iteration
-
ILPandDataDependencies
HW/SWmustpreserveprogramorder:orderinstructionswouldexecuteinifexecutedsequentiallyasdeterminedbyoriginalsourceprogram Dependencesareapropertyofprograms
Presenceofdependenceindicatespotential forahazard,butactualhazardandlengthofanystallispropertyofthepipeline
Importanceofthedatadependencies1)indicatesthepossibilityofahazard2)determinesorderinwhichresultsmustbecalculated3)setsanupperboundonhowmuchparallelismcanpossiblybeexploited
HW/SWgoal:exploitparallelismbypreservingprogramorderonlywhereitaffectstheoutcomeoftheprogram
CA-Lec5 [email protected] 20
-
UnrolledLoopDetail
Donotusuallyknowupperboundofloop Supposeitisn,andwewouldliketounrollthelooptomakek
copiesofthebody Insteadofasingleunrolledloop,wegenerateapairof
consecutiveloops: 1stexecutes(n mod k)timesandhasabodythatistheoriginalloop 2ndistheunrolledbodysurroundedbyanouterloopthatiterates
(n/k)times
Forlargevaluesofn,mostoftheexecutiontimewillbespentintheunrolledloop
CA-Lec5 [email protected]
Com
piler Techniques
21
-
5LoopUnrollingDecisions Requiresunderstandinghowoneinstructiondependsonanotherandhowthe
instructionscanbechangedorreorderedgiventhedependences:1. Determineloopunrollingusefulbyfindingthatloopiterationswereindependent
(exceptformaintenancecode)2. Usedifferentregisterstoavoidunnecessaryconstraintsforcedbyusingsame
registersfordifferentcomputations3. Eliminatetheextratestandbranchinstructions andadjustthelooptermination
anditerationcode4. Determinethatloadsandstoresinunrolledloopcanbeinterchanged by
observingthatloadsandstoresfromdifferentiterationsareindependent Transformationrequiresanalyzingmemoryaddressesandfindingthattheydonotrefer
tothesameaddress
5. Schedulethecode,preservinganydependencesneededtoyieldthesameresultastheoriginalcode
CA-Lec5 [email protected] 22
-
3LimitstoLoopUnrolling1. Decreaseinamountofoverheadamortizedwitheachextra
unrolling AmdahlsLaw
2. Growthincodesize Forlargerloops,concernitincreasestheinstructioncache
missrate3. Registerpressure(compilerlimitation):potentialshortfallin
registerscreatedbyaggressiveunrollingandscheduling Ifnotbepossibletoallocatealllivevaluestoregisters,
maylosesomeorallofitsadvantage Loopunrollingreducesimpactofbranchesonpipeline;
anotherwayisbranchprediction
CA-Lec5 [email protected] 23
-
GettingCPIbelow1 CPI 1ifissueonly1instructioneveryclockcycle Multipleissueprocessors comein3flavors:
1. staticallyscheduledsuperscalar processors,2. dynamicallyscheduledsuperscalarprocessors,and3. VLIW (verylonginstructionword)processors
2typesofsuperscalarprocessorsissuevaryingnumbersofinstructionsperclock useinorderexecutioniftheyarestaticallyscheduled,or outoforderexecutioniftheyaredynamicallyscheduled
VLIWprocessors,incontrast,issueafixednumberofinstructionsformattedeitherasonelargeinstructionorasafixedinstructionpacketwiththeparallelismamonginstructionsexplicitlyindicatedbytheinstruction(Intel/HPItanium)
CA-Lec5 [email protected] 24
-
BasicVLIW(VeryLongInstructionWord)
AVLIWusesmultiple,independentfunctionalunits AVLIWpackagesmultipleindependentoperationsintooneverylong
instruction Theburdenforchoosingandpackagingindependentoperationsfalls
oncompiler HWinasuperscalarmakestheseissuedecisionsisunnecessary
VLIWdependsonenoughparallelismforkeepingFUsbusy Loopunrollingandthencodescheduling Compilermayneedtodolocalschedulingandglobalscheduling
HereweconsideraVLIWprocessormighthaveinstructionsthatcontain5operations,including1integer(orbranch),2FP,and2memoryreferences DependontheavailableFUsandfrequencyofoperation
CA-Lec5 [email protected] 25
-
VLIW:VeryLargeInstructionWord
Eachinstruction hasexplicitcodingformultipleoperations InIA64,groupingcalledapacket InTransmeta,groupingcalledamolecule (withatoms asops)
Tradeoffinstructionspaceforsimpledecoding Thelonginstructionwordhasroomformanyoperations Bydefinition,alltheoperationsthecompilerputsinthelong
instructionwordareindependent=>executeinparallel E.g.,2integeroperations,2FPops,2Memoryrefs,1branch
16to24bitsperfield=>7*16or112bitsto7*24or168bitswide Needcompilingtechniquethatschedulesacrossseveralbranches
CA-Lec5 [email protected] 26
-
Recall:UnrolledLoopthatMinimizesStallsforScalar
CA-Lec5 [email protected] 27
1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
L.D to ADD.D: 1 CycleADD.D to S.D: 2 Cycles
-
LoopUnrollinginVLIW
CA-Lec5 [email protected] 28
Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.29 clocks per iteration 23 ops in 9 clocks, average 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW
-
VLIWProblems
Increaseincodesize Ambitiousloopunrolling Wheneverinstructionsarenotfull,theunusedFUstranslatetowaste
bitsintheinstructionencoding Aninstructionmayneedtobeleftcompletelyemptyifnooperationcanbescheduled
Cleverencodingorcompress/decompress
Binarycodecompatibility differentnumbersoffunctionalunitsandunitlatenciesrequire
differentversionsofthecode Needrecompilation Solution:Objectcodetranslationoremulation
CA-Lec5 [email protected] 29
-
Intel/HPIA64ExplicitlyParallelInstructionComputer(EPIC)
IA64:instructionsetarchitecture 12864bitintegerregs+12882bitfloatingpointregs
NotseparateregisterfilesperfunctionalunitasinoldVLIW Hardwarechecksdependencies
(interlocks=>binarycompatibilityovertime) Predicatedexecution(select1outof641bitflags)
=>40%fewermispredictions? Itanium wasfirstimplementation(2001)
Highlyparallelanddeeplypipelinedhardwareat800Mhz 6wide,10stagepipelineat800Mhzon0.18 process
Itanium2 isnameof2ndimplementation(2005) 6wide,8stagepipelineat1666Mhzon0.13 process Caches:32KBI,32KBD,128KBL2I,128KBL2D,9216KBL3
CA-Lec5 [email protected] 30
-
AnotherPossibility:SoftwarePipelining
Observation:ifiterationsfromloopsareindependent,thencangetmoreILPbytakinginstructionsfromdifferent iterations
Softwarepipelining:reorganizesloopssothateachiterationismadefrominstructionschosenfromdifferentiterationsoftheoriginalloop
Iteration 0 Iteration
1 Iteration 2 Iteration
3 Iteration 4
Software- pipelined iteration
-
ControlHazardAvoidance
ConsiderEffectsofIncreasingtheILP Controldependenciesrapidlybecomethelimitingfactor Theytendtonotgetoptimizedbythecompiler
Higherbranchfrequenciesresult Plusmultipleissue(morethanoneinstructions/sec)morecontrolinstructionspersec.
Controlstallpenaltieswillgoupasmachinesgofaster AmdahlsLawinaction again
BranchPrediction:helpsifcanbedoneforreasonablecost Staticbycompiler:appendixC
e.g.predictnottaken,delaybranch DynamicbyHW:thissection
CA-Lec5 [email protected] 32
-
DynamicBranchPrediction
Whydoespredictionwork? Underlyingalgorithmhasregularities Datathatisbeingoperatedonhasregularities Instructionsequencehasredundancies thatareartifactsofwaythat
humans/compilersthinkaboutproblems
Isdynamicbranchpredictionbetterthanstaticbranchprediction? Seemstobe Thereareasmallnumberofimportantbranchesinprogramswhich
havedynamicbehavior
CA-Lec5 [email protected] 33
-
DynamicBranchPrediction Thepredictorwilldependonthebehaviorofthebranchat
runtime Goals:
allowtheprocessortoresolvetheoutcomeofabranchearly,preventcontroldependencesfromcausingstalls
Effectivenessofabranchpredictionschemedependsnotonlyontheaccuracybutalsoonthecostofabranch BP_Performance=f (accuracy,costofmisprediction)
BranchHistoryTable(BHT) LowerbitsofPCaddressindextableof1bitvalues
Nopreciseaddresscheck justmatchthelowerbits Sayswhetherornotbranchtakenlasttime
CA-Lec5 [email protected] 34
-
1bitDynamicHardwarePrediction
CA-Lec5 [email protected] 35
predict taken predict not-taken
taken
not-taken
taken not-taken
Problem: Loop case
LOOP: LOAD R1, 100(R2)MUL R6, R6, R1SUBI R2, R2, #4BNEZ R2, LOOP
The steady-state prediction behavior will mispredict on the first and last loop iterations
-
BHTPrediction
CA-Lec5 [email protected] 36
If two branch instructions withthe same lower bits
Useful only for the target address is known before CC is decided
-
ProblemwiththeSimpleBHT
Aliasing Allbrancheswiththesameindex(lower)bitsreferencesameBHT
entry Hencetheymutuallypredicteachother Noguaranteethatapredictionisright.Butitmaynotmatteranyway
Avoidance Makethetablebigger OKsinceitsonlyasinglebitvector Thisisacommoncacheimprovementstrategyaswell
Othercachestrategiesmayalsoapply
Considerhowthisworksforloops Alwaysmispredicttwiceforeveryloop
Oneisunavoidablesincetheexitisalwaysasurprise Howeverpreviousexitwillalwayscauseamispredictionthefirsttryofeverynewloopentry
CA-Lec5 [email protected] 37
clear benefit is that its cheap and understandable
-
NbitPredictors
2bitcounterimplies4states Statistically2bitsgetsmostoftheadvantage
CA-Lec5 [email protected] 38
idea: improve on the loop entry problem
predict taken
predict not-taken
taken
not-takentaken
predict not-taken
predict taken
not-taken
not-taken
not-taken
taken
taken
Compiler could hint 11 init. on loop branches or it will go to 11 anyway in the 4th iteration
Only the loop exit causes a mispredict
11 10
01 00
A prediction must miss twice before it is changed
-
BHTAccuracy Mispredictbecauseeither:
Wrongguessforthatbranch(accuracy) Gotbranchhistoryofwrongbranchwhenindexthetable(size)
CA-Lec5 [email protected] 04-39
4K of BPB with 2-bit entries misprediction rates on SPEC89@IBM Power
-
ToIncreasetheBHTSize 4096aboutasgoodasinfinitetable Thehitrateofthebufferisclearlynotthelimitingfactorforanenough
largeBHTsize
CA-Lec5 [email protected] 04-40
-
Theworstcaseforthe2bitpredictor
if (aa==2)aa=0;
if (bb==2)bb=0;
if (aa != bb) {
CA-Lec5 [email protected] 41
DSUBUI R3, R1, #2BNEZ R3, L1 ;branch b1(aa!=2)DADDD R1, R0, R0 ;aa=0
L1: DSUBUI R3, R2, #2BNEZ R3, L2 ;branch b2(bb!=2)DADDD R2, R0, R0 ;bb=0
L2: DSUBU R3, R1, R2BEQZ R3, L3 ;branch b3(aa==bb)
if the first 2 untaken then the 3rd will always be taken
aa and bb are assigned to R1 and R2
-
ImprovePredictionStrategyByCorrelatingBranches
Considertheworstcaseforthe2bitpredictorif (aa==2) then aa=0;if (bb==2) then bb=0;if (aa != bb) then whatever
singlelevelpredictorscannevergetthiscase
Correlatingor2levelpredictors Thepredictorusesthebehaviorofotherbranch(es)tomakea
prediction Correlation=whathappenedonthelastbranch Predictor=whichwaytogo
CA-Lec5 [email protected] 42
if the first 2 fail then the 3rdwill always be taken
-
CorrelatingBranches Hypothesis:recentlyexecutedbranchesarecorrelated
Idea:recordmmostrecently executedbranchesastakenornottaken,andusethatpatterntoselecttheproperbranchhistorytable
Ingeneral,(m,n)predictor meansrecordlastmbranchestoselectbetween2m historytableseachwithnbitcounters Old2bitBHTisthena(0,2)predictor
GlobalBranchHistory:mbitshiftregisterkeepingT/NTstatusoflastmbranches
Eachentryintablehasmnbitpredictors
CA-Lec5 [email protected] 43
Two-level predictors
-
2Level(m,n)BHT Usethebehaviorofthelastmbranches tochoosefrom2m branchpredictors,eachofwhichisannbitpredictor forasinglebranch
Totalbitsforthe(m,n)BHTpredictionbuffer:
pbitsofbufferindex =2P bitBHT 2m banksofmemory selectedbytheglobalbranchhistory(whichisjustashiftregister) e.g.acolumnaddress
Usepbitsofthebranchaddresstoselectrow Getthenpredictorbits intheentrytomakethedecision
CA-Lec5 [email protected] 44
pm nbitsmemoryTotal 22__
-
(2,2)PredictorImplementation
CA-Lec5 [email protected] 45
4 banks = each with 32 2-bit predictor entries
p=5m=2n=2
5:32
-
ExampleofCorrelatingBranchPredictors
if (d==0)d = 1;
if (d==1)
CA-Lec5 [email protected] 46
BNEZ R1, L1 ;branch b1 (d!=0)DAAIU R1, R0, #1 ;d==0, so d=1
L1: DAAIU R3, R1, #-1BNEZ R3, L2 ;branch b2 (d!=1)
L2:
d is assigned to R1
-
ExampleofCorrelatingBranchPredictors(Cont.)initial
value of dd==0? b1 value of d
before b2d==1? b2
0 YES not taken 1 YES not taken1 NO taken 1 YES not taken2 NO taken 2 NO taken
CA-Lec5 [email protected] 47
d=? b1 prediction
b1 action New b1 prediction
b2 prediction
b2 action New b2 prediction
2 NT T T NT T T
0 T NT NT T NT NT
2 NT T T NT T T
0 T NT NT T NT NT
1-bit predictor initialized to NT
All the branches are mispredicted !!!
-
ExampleofCorrelatingBranchPredictors(Cont.)Prediction bits Prediction if last branch
not takenPrediction if last branch
takenNT/NT NT NTNT/T NT TT/NT T NTT/T T T
CA-Lec5 [email protected] 48
d=? b1 prediction
b1 action New b1 prediction
b2 prediction
b2 action New b2 prediction
2 NT/NT T T/NT NT/NT T NT/T0 T/NT NT T/NT NT/T NT NT/T2 T/NT T T/NT NT/T T NT/T0 T/NT NT T/NT NT/T NT NT/T
Use 1-bit correlation + 1-bit prediction with initialized to NT/NT(1,1) predictor
-
Comparison:AccuracyofDifferent2bitPredictors
CA-Lec5 [email protected] 04-49
8K bits
8K bits
-
TournamentPredictors Recallthatthecorrelatorisjustalocalpredictor Adaptivelycombinelocalandglobal predictors
Multiplepredictors Onebasedonglobalinformation:Resultsofrecentlyexecutedmbranches
Onebasedonlocalinformation:Resultsofpastexecutionsofthecurrentbranchinstruction
Selector tochoosewhichpredictorstouse E.g.:2bitsaturatingcounter,incrementedwheneverthepredictedpredictoriscorrectandtheotherpredictorisincorrect,anditisdecrementedinthereversesituation
Advantage Abilitytoselecttherightpredictorfortherightbranch
CA-Lec5 [email protected] 50
The most popular one
-
StateTransitionDiagramforATournamentPredictor
CA-Lec5 [email protected] 51
-
BranchPredictionPerformance
CA-Lec5 [email protected] 52
-
HighPerformanceInstructionDelivery
Foramultipleissueprocessor,predictingbrancheswellisnotenough
Deliverahighbandwidthinstructionstream isnecessary(e.g.,4~8instructions/cycle) Branchtargetbuffer Integratedinstructionfetchunit Indirectbranchbypredictingreturnaddress
CA-Lec5 [email protected] 53
-
BranchTargetBuffer/Cache
Toreducethebranchpenaltyfrom1cycleto0 NeedtoknowwhattheaddressisattheendofIF Buttheinstructionisnotevendecodedyet Sousetheinstructionaddressratherthanwaitfordecode
Ifpredictionworksthenpenaltygoesto0!
BTBIdea Cachetostoretakenbranches(noneedtostoreuntaken) AccesstheBTBduringIFstage Matchtagisinstructionaddress comparewithcurrentPC DatafieldisthepredictedPC
Maywanttoaddpredictorfield Toavoidthemispredict twiceoneveryloopphenomenon Addscomplexitysincewenowhavetotrackuntakenbranchesaswell
CA-Lec5 [email protected] 54
-
BTB Illustration
CA-Lec5 [email protected] 55
-
FlowchartforBTB
CA-Lec5 [email protected] 56
Send out the predicted PC
A branch?
-
PenaltiesUsingthisApproachfor5StageMIPS
Instruction in buffer
Prediction Actual Branch Penalty Cycles
Yes Taken Taken 0Yes Taken Not Taken 2No Taken 2No Not Taken 0
CA-Lec5 [email protected] 57
Note: Predict_wrong = 1 CC to update BTB + 1 CC to restart fetching Not found and taken = 2CC to update BTBNote: For complex pipeline design, the penalties may be higher
-
Example Givenpredictionaccuracy(forinst.inbuffer):90% Givenhitrateinbuffer(forbranchespredictedtoken):90% Assume60%ofthebranchesaretaken Determinethetotalbranchpenalty=?
Solution Probability(branchinbuffer,butactuallynottaken)=percentbuffer
hitrate percentincorrectprediction=90% 10%=0.09 Probability(branchnotinbuffer,butactuallytaken)=10% Hence,wehave2cycles (0.09+0.1)=0.38cycles
Comparingthedelaybranchwiththepenalty=0.5cycles/branch
CA-Lec5 [email protected] 58
-
IntegratedInstructionFetchUnits
Considerthefetchunitasaseparateautonomousunit,notapipelinestage
Functionsfortheintegratedinstructionfetchunit Branchprediction Prefetch
Todelivermultipleinstructionspercycle Instructionmemoryaccessandbuffering
mayrequireaccessingmultiplecachelines prefetchmayhidethelatencyformemoryaccess bufferingmaybenecessary
CA-Lec5 [email protected] 59
-
ReturnAddressPredictor
Indirectjump jumpswhosedestinationaddressvariesatruntime indirectprocedurecall,selectorcase,procedurereturn SPEC89benchmarks:85%ofindirectjumpsareprocedure
returns AccuracyofBTBforprocedurereturnsarelow
ifprocedureiscalledfrommanyplaces,andthecallsfromoneplacearenotclusteredintime
Useasmallbufferofreturnaddressesoperatingasastack Cachethemostrecentreturnaddresses Pushareturnaddressatacall,andpoponeoffatareturn Ifthecacheissufficientlarge(maxcalldepth) prefect
CA-Lec5 [email protected] 60
-
DynamicBranchPredictionSummary
Branchpredictionschemearelimitedby Predictionaccuracy Mispredictionpenalty
BranchHistoryTable:2bitsforloopaccuracy Correlation:Recentlyexecutedbranchescorrelatedwithnextbranch Tournamentpredictorstakeinsighttonextlevel,byusingmultiple
predictors usuallyonebasedonglobalinformationandonebasedonlocalinformation,
andcombiningthemwithaselector In2006,tournamentpredictorsusing 30Kbitsareinprocessorslikethe
Power5andPentium4 BranchTargetBuffer:includebranchaddress&prediction Reducepenaltyfurtherbyfetchinginstructionsfromboththepredicted
andunpredicteddirection Requiredualportedmemory,interleavedcache HWcost CachingaddressesorinstructionsfrommultiplepathinBTB
CA-Lec5 [email protected] 61