a comprehensive instruction fetch mechanism for a processor...

A Comprehensive Instruction Fetch Mechanism for a Processor Supporting Speculative Execution

Tse-Yu Yeh and Yale N. Patt

Department of Electrical Engineering and Computer Science The University of Michigan

Ann Arbor, Michigan 48109-2122

Abstract

A superscalar processor supporting speculative ex-

ecution requires an instruction fetch mechanism that

can provide instruction fetch addresses as nearly correct as possible and as soon as possible in order to re-

duce the likelihood of throwing away speculative work.

In this paper we propose a comprehensive instruction

fetch mechanism to satisfy that need. Implementation

issues are identified, possible solutions and designs for

resolving those issues are simulated, and the results of these simulations are presented. A metric for measur- ing the average penalty of executing a branch instruc-

tion is introduced and used to evaluate the perform-

ance of our instruction fetch mechanism. We achieve

an average performance of 1.19 IPC on the original

SPEC benchmarks in a machine which can execute

five instructions ideally by using the proposed mech-

anism.

1 Introduction

The importance of instruction fetch becomes vital as deeply-pipelined superscalar machines become pre- valent due to the correspondingly larger loss of work caused by both branch mispredictions and the inordin- ate latency to fetch a new instruction stream. Higher prediction accuracy means less speculative work needs to be thrown away, resulting in increased machine performance. Decreased latency means the newly direc- ted instruction stream can begin execution faster.

In this paper we propose a comprehensive instruction fetch mechanism for a superscalar processor supporting speculative execution. It consists of a conditional branch predictor, the Two-Level Adaptive Branch Predictor [l, 2, 31, a cache for storing branch target addresses, a return address stack for storing the return addresses of subroutine calls, and a pipeline which can generate one predicted instruction fetch address each cycle. There is no extra cycle delay between

predictions of two consecutive instruction fetch addresses, even when these two consecutive addresses involve predictions for the same branch.

We introduce a new metric called the average Branch Execution Penalty (BEP) to evaluate the performance of instruction fetch mechanisms. BEP meas- ures the average number of cycles lost due to executing a branch instruction. This is the number of wasted cycles between a branch instruction and the next legit- imate instruction in the pipeline. The delivered machine performance in instructions per cycle (IPC) can then be derived from the ideal machine performance, branch probability, and the branch execution penalty.

We examine various design choices with respect to our comprehensive instruction fetch mechanism. We compare the use of separate structures for storing conditional and unconditional branch information versus the use of a combined structure. We compare the use of various static predictors to be used in the event of a branch prediction miss. Finally, we compare our proposed instruction fetch mechanism with other mechanisms which use different conditional branch predictors.

This paper is organized into six sections. Sec- tion two introduces our comprehensive instruction fetch mechanism. Section three describes the average branch execution penalty metric which we use to evaluate the effectiveness of different instruction fetch mechanisms. Section four discusses the simulation models and traces used in this study. Section five re- ports the simulation results and our analysis. Section six contains some concluding remarks.

2 Instruction Fetch Mechanism Design

Approximately 20 percent of dynamic instructions are branch instructions as shown in the trace analysis in [3]. Since multiple instructions are issued each cycle, it is likely that at least one branch is issued every cycle in a wide-issue machine. Most branches can

129 O-8186-3175-9/92 $3.00 0 1992 IEEE

not be resolved as soon as they are fetched. Delayed branches are not an acceptable solution because there are too many delay slots to fill. Thus, an effective comprehensive instruction fetch mechanism must deal with rapid prediction of branch target addresses.

In general, there are four classes of branch instructions: conditional, immediate unconditional, indirect unconditional, and return. The return instruction distinguishes itself from the unconditional branch instruction by its semantics: a return instruction is paired with a function call. The prediction mechanisms for these four classes of branches and the comprehensive design of the instruction fetch mechanism to handle them are described below.

2.1 Conditional Branch Prediction

We chose to use the Two-Level Adaptive branch predictor [l, 2, 31 and a branch target buffer for conditional branch predictions. The Two-Level Adaptive branch predictor achieves substantially higher accuracy in predicting branch paths than other dynamic conditional branch prediction schemes. The branch target buffer reduces the delay in providing instruction fetch addresses. Each branch target buffer entry stores both the branch target address and the starting address of the fall-through basic block. A basic block is a sequence of consecutive instructions having exactly one entry point and exactly one exit point. A branch prediction is correct if both its direction and its target address are correct.

Machines using a static branch predictor without a branch target buffer must wait for the target address to be generated if the branch is predicted taken but not if the branch is predicted not taken. Trace scheduling [17] and Superblock scheduling [7] can reduce this delay by rearranging the instructions so that the fall-through path is the more likely branch path. The percentage of taken conditional branches can be reduced from 62 percent to approximately 50 percent by using trace scheduling. Still, about 50 percent of the instruction fetches following branches have to be delayed until the instruction fetch addresses are calculated. A branch target buffer eliminates that delay.

2.2 Immediate and Indirect Uncondi- tional Branch Prediction

Unconditional branch target addresses are predicted by using a branch target bufier in our instruction fetch mechanism. When an unconditional branch misses in the branch target buffer, the target address of an immediate unconditional branch instruction is

calculated by adding an offset to the address of the branch instruction. The target address of an indirect branch instruction is calculated by previous instructions. The delays for calculating the target addresses of these two types of unconditional branches are different. The target address of an immediate branch can be calculated immediately after the branch is decoded. However, the target address of an indirect branch has to wait until the register value is calculated.

2.3 Return Instruction Prediction

Return instruction target addresses are predicted by using a return address stack, A return instruction has the same execution penalty as an indirect unconditional branch if no prediction mechanism is provided. Since a return is always paired with a function call, a simple return address stack (RAS) is very effective for predicting the return address. A return address is pushed onto the return address stack when a function call is encountered, and popped when a return instruction occurs. In order to incur no delay between instruction fetches, an entry is allocated in the branch target buffer for the return instruction to store branch- type information. When the return is fetched, the type information identifies the presence of a return instruction, and the top of the return address stack is selected to be the next fetch address. Without an entry in the branch target buffer, the instruction is not known to be a return instruction until after it is decoded. Since the depth of function calls is variable, we store the return addresses of most-recent function calls and dis- card older ones when the stack overflows.

2.4 Comprehensive Design

Figure 1 shows the comprehensive design of the instruction fetch mechanism. The branch history table (BHT) used in the Two-Level Adaptive branch predictor is merged with the branch target buffer (BTB). Conditional branches, unconditional branches, and return instructions all access the branch target buffer by using the starting addresses of the basic blocks in which they occur. Each entry of the merged branch target buffer (shown in Figure 1) contains a valid bit, address tag, target address, fall-through basic block address, branch path prediction bit, branch history bits, and branch-type bits.

The address used to access the I-cache is also used to access the merged branch target buffer. These ac- cesses are done at the same time, allowing the instruction fetch mechanism to fetch (up to) one basic block

130

Figure 1: A comprehensive instruction fetch mechan- Figure 2: Pipeline timing diagram of the instruction

ism using Two-Level Adaptive branch prediction, a fetch mechanism using a Two-level Adaptive branch

branch target buffer, and a return address stack. predictor.

each cycle. Since there is at most one branch instruction in each basic block, the accessing address is used to index into the branch target buffer.

When an accessing address hits in the branch target

buffer, the information stored in the entry is known right after the access. The branch-type bits can be used to choose the prediction source before the branch is decoded. If the branch is conditional, the next instruction fetch address is either the fall-through address or the target address depending on the branch path prediction bit. If the branch is a return instruction, the next address is obtained from the top of the

return address stack. If the branch is unconditional, the next address is the target address.

When an accessing address misses in the branch target buffer, we continue fetching instructions sequen- tially. If a branch instruction is discovered after be- ing decoded an entry in the branch target buffer is allocated for that branch, indexed by the accessing address. If the branch is conditional, a default static

branch predictor is used to predict the branch path by using static information. If the branch is a return, the address on the top of the return address stack is predicted as its return address. If the branch is an immediate unconditional branch, its target address can be calculated by adding the offset to the branch’s instruction address. If the branch is an indirect branch, its target address is not available until the source re-

gister value is calculated.

Speculative updates of branch history with predictions were suggested in [3] to provide the Two-Level Adaptive conditional branch predictor the most-recent

branch history. When a branch misprediction occurs, the branch history is updated with incorrect information. However, the predictions made in the speculative work down the wrong path of the mispredicted branch until the correct instructions are fetched do not affect the machine performance, because that work will be thrown away when the misprediction is repaired. Pat- tern table updates, on the other hand, are delayed until the branch results become ready. We simulate the update delay in this study to show that the delay has

negligible effect on performance. In addition, lookahead prediction is also suggested in [3] to make predicting a branch in one cycle possible. An extra prediction bit in the branch target buffer entry is used to

store the prefetched prediction from the pattern history table by using the speculatively-updated branch history. Therefore, the prediction for the next execution requires only one table access.

An example of the pipeline design of the instruction fetch mechanism with our Two-level Adaptive Branch Predictor is shown in Figure 2. The branch history update is done speculatively and the pattern history update is delayed until the branch is resolved and suc- cessfully retired. In order to make predictions for the same branch in back to back cycles, the prefetched prediction from the pattern history table needs to be able to bypass the branch history table to be the prediction of the next execution. This can be done by comparing the accessing address of the previous cycle with the accessing address of the current cycle. If the accessing addresses match, the prefetched prediction from the pattern history table is used without any delay. This

131

bypassing mechanism is important for executing tight loops. In a superscalar machine the entire iteration is often issued in one cycle, so predicting for the same branch in back to back cycles occurs frequently. If the loop executes only a few times before it exits, it is vital to predict the loop-exiting branch correctly, which can be achieved by providing the predictor with the most-recent branch history in time.

3 Branch Execution Penalty (BEP) and Delivered Machine Performance

Branch prediction accuracy is not used as a metric for performance comparisons between different instruction fetch mechanisms because it reflects only the efficiency of the part which predicts the path of the branch. There are different delays associated with correct branch path predictions depending on whether or not the branch target buffer is hit. Therefore, addi- tional factors such as the BTB access time, the instruction access time, the number of cycles needed for instruction decoding, and the average number of cycles wasted for an incorrect branch prediction need to be considered.

To evaluate the performance closely, we have introduced the notion of a branch execution penalty (BEP), which we define as the number of wasted cycles between a branch instruction and the next legitim- ate instruction in the pipeline. The ultimate goal of designing an instruction fetch mechanism is to achieve zero cycle penalty for every branch such that instructions can be fetched in every cycle without idling and no prefetched instructions get discarded. This would require the correct fetch address to be known every cycle.

The meaning of each parameter in the evaluation model is as follows:

CBR_Rc,ol”c: Average number of cycles for resolving a branch, measured from the cycle after a branch is fetched until the branch is resolved.

c Ejls.<ive_Addr: Number of cycles for generating the target address of a conditional branch or an immediate unconditional branch after the branch is fetched. (Returns and indirect branches require results from previous instructions; therefore, it is accounted for in cBR_R~,~~~~, and is not included herein CEJJcctive_Addre)

CI_C.dc_Asec‘*: Number of cycles for accessing the instruction cache, measured from the time the instruction fetch address is ready to the time the fetched instructions are ready to enter the machine pipe.

CBTB_Acce,, : Number of cycles for accessing BTB, measured from when the accessing address is ready un-

til the prediction and the target address of the branch are both ready.

Cln.orr_Pred: The average number of cycles wasted for an incorrect branch prediction, measured from the cycle after the start of a branch in the machine pipe until the cycle before the start of the correct instructions in the machine pipe when an incorrect branch prediction occurs. Therefore, c~,,~,,,~_~,~~ = CBR_R~,,,~~~

+ CI_Coche_Acccss - 1.

P(B&,.d_B,): Probability that a branch is a conditional branch.

p(BRFcerrrn): Probability that a branch is a return.

P(BRlmm_Br 1: Probability that a branch is an immediate branch.

P(BR~,,~_~~): Probability that a branch is an indirect branch.

P(CBf’~ovrs BTBMis, ): Probability that a conditional branch is predicted correctly and the BTB is missed.

P(CBPcorr, BTB~~+): Probability that a conditional branch is predicted correctly and the BTB is hit.

P(cBP~.,,): Probability that a conditional branch is predicted correctly. Therefore, P(cBP~,,,) =

p(CBpc.,,, BTR~i,.)+P(CRpcorr,RTRNi~).

P(CBRT~~~,, 1 CBPcorr,BTB~iss): Probability that a conditional branch is taken, given that the branch is correctly predicted and the BTB is missed.

P(CBRT,~,, I CEPc,,,, BTBHit): Probability that a conditional branch is taken, given that the branch is correctly-predicted and the BTB is hit.

P(RA&or,, BTBMi.8): Probability that a return address is predicted correctly by RAS and the BTB is missed.

J’(RA.%orr, BTBwit): Probability that a return address is predicted correctly by RAS and the BTB is hit.

P(RAs~.,,): Probability that a return address is predicted correctly by RAS. Therefore, P(RAs~,,,) =

p(RA.%,c, BTBM;..)SJ’(RASC~~~, BTBBil).

P(BTBHil I BRrmm_B, ): Probability that an immediate branch hits in the BTB.

p(mBnit I BR~,,~_~~): Probability that an indirect branch hits in the BTB.

C E jjcdiue_Addr f CI~ochc_Acce#~, and CBTB_A~~~.. de-

pend on the machine implementation. cBR_R.,,,lvc and c Incorr_Prcd depend not only on the machine implementation but also on the program behavior. The value of each probability described above can be derived from the instruction fetch mechanism simulations. Those probabilities are important in calculating the exact delay in deciding the correct instruction fetch address after a branch under different cases.

The penalties in the comprehensive instruction

132

Table 1: Execution penalties of conditional branches, returns, immediate unconditional branches, and indirect unconditional branches under different cases.

fetch mechanism for different cases of next fetch address predictions are listed in Table 1. Those cases are divided into groups of conditional branch execution penalty, return execution penalty, immediate unconditional branch execution penalty, and indirect unconditional branch execution penalty. Note that the branch target buffer is considered to be hit for an indirect branch target address prediction only when the target address stored in the entry is correct.

3.1 Average Branch Execution Penalty for Designs Using a Dynamic Branch Predictor with a Branch Target Buf- fer

The four different types of branches are weighted to derive the average number of penalty cycles for executing a branch instruction in the proposed instruction fetch mechanism. The equation is shown as follows:

= P(BRcond_Br) X Avg Cond_~r.nch Esccvtion ~cnalty + P(BRBcturn) X Aug Return Ezeculion Penalty+

P(BRI~~_B~) X Avg Imm_Bronch Ezocrtion Pcnolty +

P(BRlnd_~r) X Aug Ind_Branoh Esecvlion Penalty (I) = P(BRcond_Br) X P(CBPCorr,CBRNo*_TT.kcn) X 0 +

(P(BRC.,d_Br) X P(cBPcorr. CB%rolrcns BTBHiO +

P(BRR~<~w) X P(RAs~ovvs BTBHit) + P(BTBHit, BR~rnrn_~r ) +P(BTBH~Ls BRlnd_~r)) X

(CBTB_Acca,, - 1) + (P(BRc~~~_B~) X P(CBPcorrv ~BRT,~~~, BTB.Wise) +

P(B*R~:~~wJ X P(RAScorrs BTBMiss) +

P(BTB~~ea s BRZmm_Br)) X (CEffcctivc_Addt +

CI_C.chc_Accc,s - 1) +

(P(BRco,d_Br) X (1 - P(CBPc,,,)) + P(BRR~~~,~) X

(1 -PP(RASC~~~))+P(BTB~M(,,,BR~,~_B~)) X

CIncorr_Pred (2)

3.2 Average Branch Execution Penalty for Designs Using Static Branch Pre- dictors without a Branch Target Buf- fer

Since no BTB mechanism is used in the designs using static branch predictors, the average branch execution penalty is calculated differently.

In these designs, when a branch instruction is encountered, the instructions in the fall-through path are always prefetched before the branch is resolved. Therefore, for conditional branches, there is no penalty if the prediction of the branch is not taken and is correct. If the prediction made by the static predictor is taken and is correct, then the prefetched fall-through instructions are discarded. As a result, the branch execution penalty for this case is

CEff.dive_Addr+CI_C~ch._Acc.** - I cycles. When a conditional branch prediction is incorrect, the time after the branch starts in the machine pipe until the time before the correct subsequent instruction starts in the machine pipe is the execution penalty for the mispredicted branch. For return predictions, a return address stack is supported; however, the return instruction is not detected until the instruction is decoded, so the return execution penalty when a return address is predicted correctly by the RAS is

CEffcctr”._Addr + CI_Cach._Acc.ee - 1 cycles. For immediate branches, their target addresses are known after the instructions are decoded, so the execution pen-

alty is CEffcctiv._Addr+CI_Cachc_Acoe~, - I cycles. Indirect branches, on the other hand, have to wait for the target addresses generated by their previous instructions, SO the execution penalty is CBR_R~,~I~“+CI_C~~~~_A==~,.-~

cycles, which is comparable to the branch misprediction penalty. Since the machines modeled in this study have issue bandwidth higher than the average basic block size, the delayed branching is not applied. The

133

equation for calculating the average branch execution penalty is shown as follows:

Average Brorkch Ekrsrlion Penalty

= P(BRcond_Br) x Avg Cond_Bransh B.asrtion Penalty +

P(BRRctrrn) x Avg Retrrn Ezeerlion Penally+

P(BRI,,_B,) x Avg Imnr-Bransh Bzecrtion Penalty +

P(BRl,d_Br) x Avg InLBranch Esscrtion Penalt. (3)

= P(BRcond_Br) x P(CBPcorr. CBR,vos_Token) x 0 +

(P(BRcmd_B,) x P(CBP~o,,s CBRTak,,) +

P(BRR~~-~~) x P(RAScorr) + P(BRI~~_B~)) x

(CEjjsctiva_Addr + CI_Ccrfhc_Aaemn - I) +

(P(BRcond_Br) x (1 - P(CBPcorr)) + P(BRRet.,,,) x

(I - P(RASco,,)) +‘=‘(BRrnd_~r)) X =lncorr_Pr.d (1)

3.3 Delivered Machine Performance in Instructions per Cycle (IPC)

After the average BEP per branch is known, the delivered machine performance in IPC of a machine using a practical instruction fetch mechanism can be derived from the ideal IPC (IIPC). The ideal IPC is the performance of a machine when an omniscient instruction fetch mechanism is used. Let us assume the total number of dynamic instructions executed is N and the probability that an instruction is a branch is P(BR) which iS equal to P(BRC,,d_B,) + P(BRR.+,,,) +

P(BRr mm_Br) + P(BRInd_Br). The IIPC and P(BR) depend greatly on the programs and the superscalar machine architecture. This performance model considers only the branch execution effect. The delivered IPC can be derived as follows:

Number of Cycles Spent in Ihe Ezecvtion

N =

N/XIPC + BEP x N x P(BR)

IIPC =

I+ BEP x P(BR) x IIPC (5)

Benchmark Number of Benchmark Number of ’ Static Static Cond. Cond.

Branches Branches eqntott(eqn) 407 doduc(dod) 1073 espresso(esp) 1009 fPPPP(fPP) 643

gcc(gcc) 8044 matrix300(mat) 199 li(li) 499 spiceZg6(spi) 1266

tomcatv(tom) 296

Table 2: Number of static conditional branches in each benchmark.

Nine benchmarks from the original SPEC benchmark suite are used in this branch prediction study. Five are floating point benchmarks and four are integer benchmarks. The floating point benchmarks include doduc, fpppp, matrix300, spice2g6 and tomcatv

and the integer ones include eqntott, espresso, gee,

and li. Nasa7 is not included because it takes too long to capture the branch behavior of all seven ker- nels. The work load is the same as used in [3]. The profiling of the benchmarks is done by using a separate training data set from the one used in testing.

The number of static branches in the trace when each program is executed with the test data set is listed i:

Table 2.

n

4 Simulation Model Figure 3: Distribution of dynamic branch instructions.

Trace-driven simulations were used in this study. A Motorola 88100 instruction level simulator is used for generating instruction traces. The instruction and address traces are fed into the branch prediction simulator which decodes instructions, predicts branches, and verifies the predictions with the branch results to collect statistics for branch prediction accuracy.

4.1 Description of Traces

In the traces generated with the testing data sets, about 24 percent of the dynamic instructions for the integer benchmarks and about 5 percent of the dynamic instructions for the floating point benchmarks are branch instructions. Figure 3 shows that about 80 percent of the dynamic branch instructions are conditional branches; therefore, the prediction mechanism for conditional branches is the most important one among the prediction mechanisms for different classes

134

of branches.

4.2 Characterization of Simulated In- struction Fetch Mechanisms

Three instruction fetch mechanisms, PAg, JS, and PROF, are simulated in this study. The three mechanisms differ with respect to the different conditional branch predictors and the use of branch target buffers (BTB). PAg is our proposed mechanism which uses a combined 512-entry BTB, a la-bit per-address Two-level Adaptive conditional branch predictor, and a 32-entry return address stack (RAS). Its structure is shown in Figure 1. The combined BTB stores address and prediction information of conditional branches, unconditional branches, and returns. The JS uses a combined 512-entry BTB, J. Smith’s 2-bit up-down saturating counter scheme [18] for conditional branch prediction, and a 32-entry RAS. PROF is an example of instruction fetch mechanisms which use static conditional branch predictors requiring no hardware support. The target addresses of taken conditional

branches and immediate unconditional branches are not available until they are calculated from the decoded instructions. PROF still uses a 32-entry RAS, but the return address can not be used until the return is decoded.

For the PAg and JS mechanisms which use BTB’s, a BTB entry is allocated for every branch whose accessing address misses in the buffer. The BTB uses least recently-used (LRU) replacement algorithm and its associativity is four. The RAS discards older addresses when it overflows.

For the PAg mechanism, the 1Zbit per-address Two-level Adaptive conditional predictor achieves 97 percent average conditional branch prediction accur-

acy over the nine original SPEC benchmarks with the least hardware cost among the variations of Two-level Adaptive branch predictors. The detailed study was presented in [3].

Since there are more taken than not taken branches according to our simulation results, a history register in the branch history table is initialized to all l’s when an entry is allocated. The counters in the pattern history table entries are also initialized to 2 at the begin- ning of execution, so the branch is more likely to be predicted as taken. In addition, various static branch

prediction schemes used to predict for the branch target buffer misses were also simulated.

5 Simulation Results

5.1 Return Address Size

1

5 0.95

I % o.g 8 0 0.55

d

0.5

Figure 4: Correct prediction rates of return address stacks of different sizes.

Figure 4 shows the prediction accuracy of return address stacks (RAS) of different sizes. The prediction

accuracy achieved when the BTB is hit and when the BTB is missed are shown in two different regions in each column. An eight-entry RAS achieves,about 100 percent prediction accuracy for most of the programs. However, for call-intensive programs like li, a 32-entry RAS shows about 5 percent and 1.5 percent increase in accuracy over the &entry and 16-entry RAS’s. Since a 32-entry RAS does not cost much in hardware, we assume a 32-entry RAS in our design.

5.2 Branch Target Buffer Configuration

Figure 5 compares the hit/miss rates of the four classes of branches in four branch target buffer (BTB) configurations which include a combined 512-entry BTB, a combined 1024entry BTB, a 512-entry BTB with a separate 32-entry BTB for unconditional branches (UBTB), and a 512-entry BTB with a separate 12%entry UBTB. All the BTB’s are 4-way set- associative and all the UBTB’s are fully-associative.

The combined 512-entry BTB achieves close to 100 percent hit rate for most programs except for gee and li, because gee has many static conditional branches and li has many static unconditional branches. The combined 512-entry is used as the base configuration. The results show that the 1024-entry BTB improve the hit rates of every class of branches. The 512- entry BTB with a 32-entry UBTB improves the hit

135

# 0.9

ii:::

% ; 0.6

5 0.5

3 0.4

s 0.3

3 6.2 2 q 0.1

0

Figure 5: Hit/miss rates of the four classes of branches in different configurations of branch target buffers.

rate of conditional branches, but not the unconditional branches, because 32-entry UBTB is small compared to the available free entries in the combined 512-entry BTB. When a separate 12%entry UBTB is used, the hit rate of unconditional branches is improved closer to that achieved by the 1024-entry combined BTB. If 1024-entry combined BTB is too expensive to im- plement, the 512-entry BTB with a separate 12%entry UBTB is a good candidate to achieve similar performance. In our design, we use a combined 512-entry BTB to reduce the implementation complexity.

5.3 Pattern History Table Update Delay

0.6

Figure 6: Effects of pattern history update delay on prediction accuracy.

To show the effect of pattern history table (PHT) delays on the conditional branch prediction accur-

acy, the PHT delays were simulated with delays of one branch, five branches, and fifteen branches. The delays are counted in branches instead of machine cycles, because the exact cycle counts depend greatly on the machine architecture in which the instruction fetch mechanism is used. The results shown in Fig- ure 6 suggest that the prediction accuracy degradation when the PHT update delays exist is negligible. This is because either one history pattern does not reappear in a short time or if the history pattern does reappear in a short time, the prediction should be same, because a periodic history pattern is followed.

5.4 Predictions for Branches Which Cause Branch Target Buffer Misses

Figure 7: Prediction accuracy of different static branch prediction schemes for predicting branches causing branch target buffer misses.

When a dynamic branch predictor is used, if an accessing address misses in the BTB, there is no branch history information available for making predictions. Under this situation, several static branch prediction schemes can help to make the predictions. The efficiency of the static branch predictors used for making predictions for BHT misses is vital for programs which cause a lot of BHT misses, such as gee, transac- tion processing programs, and operating system code. We evaluated five schemes: always taken (TK), always fall-through (FT), backward taken/forward not- taken (BTFN), using last prediction (LP), and Pro- filing (PF). The striped areas in Figure 7 show their prediction accuracy for branches which cause misses in the 512-entry combined BTB. There are no striped areas for eqn, esp, fpppp, matrix300, spice2g6, and tom&v, because their BTB miss rates are extremely

136

low. However, as seen from gee, li, and do&c, the profiling scheme is most effective. Although the always- taken scheme tends to give better prediction accuracy than the always fall-through scheme, it can not take advantage of prefetching sequential instructions, so its performance actually tends to be worse.

5.5 Branch Execution Penalty and De- livered Machine Performance

Figure 8: Branch execution penalty of different instruction fetch mechanism designs.

Figure 8 shows the comparison between the branch execution penalties (BEP) of three instruction fetch mechanisms, PAg, JS, and PROF, with three different incorrect branch prediction penalties. While calculating the BEP’s, we made some assumptions: generating effective target address takes two cycle

(CEf fective_Addr = 2), the instruction cache access takes one cycle ~~~~~~~~~~~~~~~~ = l), and the access time of BTB is one cycle ~~~~~~~~~~~~ = 1). These timing assumptions are achievable using current processor technologies. The incorrect branch prediction penalty is varied with 6, 10, and 14 cycles to show its effect on the efficiency of an instruction fetch mechanism. The BEP of PAg and JS are calculated using Equation 2. The BEP of PROF is calculated using Equation 4.

When the incorrect branch prediction penalty is 6 cycles, the average BEP of PAg over the nine programs is about 57 percent of JS and is about 12 percent of PROF. As the incorrect branch prediction penalty increases, the relative ratio between PAg and JS stays about the same. The ratio between PAg and PROF, however, increases because the larger in-

correct prediction penalty affects mechanisms using

a dynamic branch predictor and a BTB more. The PROF suffers from the lack of a BTB, which can be seen from the BEP of spice2g6. Because there are a substantial number of indirect branches in the program and PROF has to take the full penalty to execute those branches, the BEP is high even though its conditional branch prediction accuracy is high.

4.5

0

TolH htH eq~ rtp gcs I RH dd fpp mal ap tom Men Mea Meal

Bnehmmk

Figure 9: Delivered machine performance of a 5 ideal IPC machine with different instruction fetch mechan-

isms.

Figure 9 shows the delivered machine performance (in IPC) in the machine which can execute five instructions per cycle ideally (IIPC=5). The delivered machine performance is translated from the branch execution penalties shown in Figure 8 and the branch probability in each program by using Equation 5. The IIPC is set to five, because the average basic block size is five instructions in our traces [3] and our issue mechanism is limited to issue at most one basic block per cycle.

The delivered performance of PAg and JS on floating point programs are close to the ideal performance due to their high conditional prediction accuracy, high return address stack (RAS) prediction accuracy, high BTB hit rate, and low branch probability. PROF loses a lot of performance on spice2g6 from not having a BTB to predict indirect branches; however, PROF performs well on fpppp. There are not many branches in the program, so the effect of not having a BTB is

minimized. The delivered performance of integer programs are

lower than those of floating point programs. Com- pared to floating point programs, integer programs execute more branches among which there are more unpredictable conditional branches, returns and unconditional branches. In the average integer perform-

137

ante (harmonic mean of all the integer program performance), PAg achieves speedups of 1.23, 1.33, and 1.40 over JS when the incorrect branch prediction penalties are 6, 10, and 14 cycles. When compared to PROF, PAg achieves speedups of 2.42, 2.37, and 2.34 when the incorrect branch prediction penalties are 6, 10, and 14 cycles. The speedup of PAg over PROF decreases as the incorrect prediction penalty gets larger because a large penalty degrades the performance of a dynamic branch predictor with a BTB more than it does a static branch predictor. There- fore, it is very important for a mechanism which uses a dynamic branch predictor to have high prediction accuracy because a mechanism with higher prediction accuracy can tolerate a higher incorrect prediction penalty. The BTB also plays an important role in generating effective fetch addresses for branches.

The overall average performance is about 4.19 IPC (3.61 IPC for average integer performance and 4.82 IPC for average floating point performance) for PAg with the incorrect branch prediction penalty equal to 6 cycles. The performance decreases to 3.48 IPC (2.67 IPC for integer performance and 4.61 IPC for floating point performance) when the penalty increases to 14 cycles. The performance difference is about 0.5 IPC (0.7 IPC for integer performance and 0.2 IPC for floating point performance) with that penalty change. Hence, it is also important to keep the incorrect branch prediction penalty low in a superscalar machine design.

6 Concluding Remarks

In this paper we have proposed an integrated instruction fetch mechanism supporting speculative execution in a superscalar processor. The integrated instruction fetch mechanism consists of a Two-Level Ad- aptive conditional branch predictor, a target address cache and a return address stack. Our objective was to incur no delay between instruction fetches and to provide correct instruction fetch addresses.

We identified a new metric, the Branch Execution Penalty (BEP), t o evaluate the performance of various instruction issue mechanisms. The metric meas- ures the average number of cycles lost due to execut-

ing a branch instruction. These cycles include both the cycles lost in generating the next instruction fetch address and the cycles lost in fetching wrong instructions.

Our simulations show that if the average branch misprediction penalty is 6 cycles, the average branch execution penalty of the mechanism in which the Two- Level Adaptive branch predictor is used is only 57 per-

cent of that of the mechanism in which J. Smith’s 2- bit saturating up-down counter scheme is used, and is only about 20 percent of that of the mechanism in which static Profiling with no branch target buffer support is used. We have identified several issues associated with the implementation of a Two-Level Ad- aptive branch predictor. Solutions have been provided and simulated.

Since the Two-level Adaptive branch predictor requires two levels of branch history information to make predictions, a large number of branch history table misses could result in serious prediction accuracy degradation. We studied the effects of various default

schemes when a branch history table miss occurs, and found Profiling to be the most effective.

A pipeline design of the instruction issue mechanism using Two-Level Adaptive Branch Prediction is proposed to satisfy the requirements of high- performance processors. By updating the branch history speculatively with the predictions and delaying the pattern history update until the branch result is ready, the instruction issue mechanism using Two- Level Adaptive Branch Prediction can predict with the most recent branch history for the same branch instruction in back to back consecutive cycles. Our simulations show that the delay in updating the pattern history table until the branch is resolved and suc- cessfully retired has negligible effect on performance. In addition, by using one extra bit to store the lookahead prediction for the Two-Level Adaptive Branch Prediction scheme using per-address branch history and splitting the access to the BHT for a prediction and the access to the PHT for a lookahead prediction into two cycles, the prediction of a branch can be made within one cycle. Therefore, in our design one branch prediction can be made every cycle, even if the same branch instruction occurs in back to back consecutive cycles.

We compared the use of a branch target buffer (BTB) for conditional branches and a separate branch target buffer (UBTB) for unconditional branches versus a combined BTB for both types of branch in-

structions. We show that a separate 128-entry fully- associative UBTB for unconditional branches can re-

duce the accessing conflicts in the buffer and achieves performance close to that achieved by a 1024-entry combined BTB.

The instruction issue mechanism which uses a PAg scheme with 12-bit branch history registers, a 512- entry 4-way set-associative combined BTB, and a 32- entry return address stack achieves an average of about 0.218 cycle delay for each branch executed. This

138

assumes generating effective target address takes two cycles, the incorrect branch prediction penalty is 6

cycles, the instruction cache access takes one cycle, and the BTB access takes one cycle. This design can improve the performance of a deep-pipelined, wide-

issue superscalar processor greatly. In a machine which can achieve 5 ideal IPC on the SPEC integer programs, the average speedups of using our proposed

issue mechanism are 1.23 (ranging from 1.06 to 1.51 for individual programs) over the mechanism using J. Smith’s 2-bit saturating up-down counter scheme and 2.41 (ranging from 1.71 to 2.95) over the mechanism using static Profiling without BTB support. This again assumes a branch misprediction penalty of 6 cycles. When the branch misprediction penalty is 14 cycles, the speedups change to 1.39 (ranging from 1.09

to 1.92) and 2.34 (ranging from 1.48 to 3.18) over the other two mechanisms, respectively.

Acknowledgement

This paper is one result of ongoing computer architecture research at the University of Michigan. The support of Intel, Motorola, NCR, Hewlett Pack- ard, Hal, and Scientific and Engineering Software is greatly appreciated. In addition, the authors wish to

acknowledge with gratitude the other HPS research

group members for the stimulating environment they provide, and in particular, for their comments and sug- gestions on this work. We are also grateful to Intel and Motorola Corporation for technical and financial support, and to NCR Corporation for the gift of an NCR Tower, Model No. 32, which was very useful in our work.

References

PI

PI

131

[41

T-Y Yeh and Y.N. Patt, “Two-Level Adaptive Branch

Prediction,” The 24th ACM/IEEE International Sym- posium and Workshop on Microarchitecture , (Nov. 1991),

pp. 51-61.

T-Y Yeh and Y.N. Patt, “Two-Level Adaptive Branch

Prediction,” Technical Report CSE-TR-117-91, Com- puter Science and Engineering Division, Department of

EECS, The Univessity of Michigan, (Nov. 1991).

T-Y Yeh and Y.N. Patt, “Alternative Implemenations of

Two-Level Adaptive Branch Prediction,” Proceedings of

the 19th Internation Symposium on Computer Awhitec- ture, (May 1992), pp. 124-134.

M. Butler, T-Y Yeh, Y.N. Patt, M. Alsup, H. Scales, and

M. Shebanow, “Instruction Level Paralielism is Greater

Than Two,” Proceedings of the 18th International Sym-

posium on Computer Architecture, (May. 1991), pp. 276

286.

151

PI

[71

PI

PI

PO1

1111

1121

P31

[I41

P51

WI

1171

WI

DR. Kaeii and P.G. Emma, “Branch History Table Pre-

diction of Moving Target Branches Due to Subroutine Re-

turns” , Proceedings of the 18th International Symposium

on Computer Architecture, (May 1991), pp. 34-42.

Motorola Inc., “MS8100 User’s Manual,” Phoenix, Ari- zona, (March 13, 1989).

W.W. Hwu, T.M.Conte, and P.P.Chang, “Comparing

Software and Hardware Schemes for Reducing the Cost

of Branches,” Proceedings of the 16th International Sym-

posium on Computer Architecture, (May 1989).

N.P. Jouppi and D. Wah, “Available Instruction-Level

Parallelism for Superscalar and Superpipeiined Ma-

chines,” Proceedings of the Third International Confer- ence on Architectural Support for Programming Lan- guagea and Operating Systema, (April 1989), pp. 272-282.

D.J. Lilja, “Reducing the Branch Penalty in Pipelined

Processors,” IEEE Computer, (July 1988), pp.47-55.

W.W. Hwu andY.N. Patt, “Checkpoint Repairfor Out-of-

order Execution Machines,” IEEE Transactions on Com-

puters, (December 1987), pp.1496-1514.

P.G. Emma and E.S. Davidson, “Characterization of

Branch and Data Dependencies in Programs for Evahrat-

ing Pipeline Performance” , IEEE Transactions on Com-

puters, (July 1987), pp.859-876.

J.A. DeRosa and H.M. Levy, “An Evaluation of Branch

Architectures,” Proceedings of the 14th International

Symposium on Computer Architecture, (June 1987),

pp.lO-16.

D.R. Ditzel and H.R. McLeiian, “Branch Folding in

the CRISP Microprocessor: Reducing Branch Delay to

Zero,” Proceedings of the 14th Intemzational Symposium

on Computer Awhitecture, (June 1987), pp.2-9.

S. McFarling and J. Hermessy, “Reducing the Cost of

Branches,” Proceedings of the 13th International Sym- poaium on Computer Architecture, (1986), pp.396-403.

J. Lee and A.J. Smith, “Branch Prediction Strategies and

Branch Target Buffer Design,” IEEE Computer, (January

1984) ~~6-22.

T.R. Gross and J. Hennessy, “Optimizing Delayed

Branches,” Proceedings of the 15th Annual Workshop on

MiCTOpTOgTam?7hZg, (Oct. 1982), pp.114-120.

J. Fisher, “Trace Scheduling: A Technique for Global Mi-

crocode Compaction,” IEEE fiansactions on Computer, (July 1981), C-30, pp.478-490.

J.E. Smith, “A Study of Branch Prediction Strategies,”

Proceedings of the 8th International Symposium on Com- puter Architecture, (May. 1981), pp.135-148.

139

a comprehensive instruction fetch mechanism for a processor...

Documents