abstract - university of california, berkeleykubitron/courses/cs15… · web viewabstract. the...

Gerald, Huifang, Kahn, Mei, & Yuchi

CS152 Lab 7 ReportBy QWX

Abstract

The purpose of this lab assignment is to add several sub-projects into our existing one-memory processor. Those sub-projects consist of 8-stage deep pipelining, branch prediction, victim cache, stream buffer, and write buffer.

Division of Labor:

8 stage deep pipelining: Yuchi & GeraldBranch predictor: Yuchi, Gerald, & MeiVictim cache: KhanStream buffer: HuifangWrite buffer: Mei

Detailed Strategy:

1. Implementation

Instruction cache

Stream Buffer

DRAM

8 stage pipeline processor

Datacache

Victim Cache

Write Buffer

Arbitor

1.1 Deep Pipelining Processor

Our 8-stage pipelining is broken into 2 IF, 1 ID, 2 EX, 2 MEM, and 1 WB: IF1: the PC calculation and the first stage of the I-cache access IF2: the second stage of the I-cache access, the branch prediction, the main controllerID: instruction decode and forwarding and stall control signal generatorsEX1: the only ex stage for logic operation and the first stage for adding, subtracting, and

shifting operationEX2: the second stage for adding, subtracting, and shifting operationMEM1: the lw/sw address calculationMEM2: the D-cache accessWB: the register write back

The main function units in each stage are: IF1: PC adder, J mux, PC src mux, I-cache, and stream bufferIF2: I-cache, extender, branch adder, main controller, the jr forwarding mux, and branch

predictorID: register file, forwarding, and stall controllers EX1: two 16bit adders, two 16bit subtractors, 1 16bit-enable shifter, rs and rt forwarding

muxes, Rt mux, register destination mux, and branch decision logic gatesEX2: one 16bit adder, one 16bit subtractor, one 16bit-enabled shifter, and SLT logic gatesMEM1: address adder, and address forwarding muxMEM2: sw forwarding mux, mem-to-reg mux, D-cache, write buffer, and victim cacheWB: register file

The critical path (slt followed by branch forwarding) in lab 5 is EX pipeline register, ALU source mux, ALU, SLT logic, the forwarding muxes in ID stage, the comparator, and J mux back in IF stage. In order to break down this critical path, we first decide to move the forwarding muxes back to EX1 as well as the branch decision logic. Since the branch would be available, this is feasible in terms of the MIPS one delay slot convention. In the consideration of the one delay slot for jump, the main controller needs to be in the second stage, IF2, for the jump decision. After doing so, the ALU has become our critical path. To break down the ALU as well as minimize our cycle time, we have implemented each ALU function individually. AND, OR, XOR, and the comparator are implemented in gate level. Also, 16bit adder, subtractor, and shifter implemented in VHDL are used to reduce the delay for 32bit ones.

In terms of the forwarding, because of two more stages added in between the ID and the WB stages, the results need to be forwarded over at most four stages. Thus, five input forwarding muxes are used and located in ID (for jr), EX1 (for ALU), MEM1 (for address calculation), and MEM2 (for sw).

As for the stall, at most three stalls are needed for lw followed by a dependent ALU-used instruction. One stalls are needed for EX2 used instruction followed by any dependent ALU-used instruction. One stall is needed for lw followed by any address dependent lw/sw instruction.

1.2 Branch Predictor

Since in our 8-stage deep pipeline, branch decision making is in the EX1, which the fourth stage, without a branch predictor, a branch instruction needs three delay slots. In order to comply the MIPS convention of one delay slot, the pipeline has to flush two instructions if the branch is taken. We decide to use a two-bit branch predictor to increase the rate of correct prediction. The FSM of the predictor consists of four states, strong taken, weak taken, strong not taken, weak not taken. The state transition diagram is as follows.

We reserve a two-bit entry for each instruction to record the branch history, which is selected by the lower part of PC. We have 256 entries in the prediction table. We set the initial value of each entry to be “weak taken”, so that the number of miss-prediction for a loop is only once.

The branch predictor works as follows.1. When a branch instruction enters the IF2 stage, the branch predictor will output an predicted value

of the branch, which will be passed to the IF1 stage to select the next PC.2. When the branch instruction and the predicted value are passed down to the EX1 stage, the real

branch decision will be made by a comparator. If the decision doesn’t agree with the predicted value, then a flush signal will flush the IF1 and IF2 stages, and select the right PC as the next PC.

3. The branch decision in the EX1 stage will be passed back to the branch predictor in the IF2 stage to update the branch predictor state according to the state transition diagram.

The block diagram of our branch predictor is as follows.

We run the lab6_mystery.s using our branch prediction. Since the quick sort only has five branches, and most of the branch decision is taken, the performance gain of our branch prediction is large. Compared to that without the prediction mechanism, we got a performance gain of approximately 15 per cent.

1.3 Stream Buffer (schematic in Viewdraw)

Instruction cache

Stream Buffer

4 word when requiredDRAM

Second 4 words

First 4 words

IF 1 IF 2 ID EX 1

BP

Prediction

Branch decision

Update BPT

Flush IF 1 & IF2

PC Comp

Instruction

Is Branch?

Update PC

Comp

Comp

The stream buffer connects with the memory and the instruction cache as indicated in the above diagram. It extends the bandwidth of the interface between the instruction cache and the memory. Each time when fetching instruction the DRAM sends the required 4 words to the instruction cache and the following 4 words are sent to the stream buffer. The content of stream buffer is copied to the instruction cache when instruction is miss in the cache and hit in the stream buffer. When fetching instructions the pipleline is only stalled during the period of the data transfer between the DRAM and the instruction cache.

In the final version the size of the stream buffer is 4 words. This is because of the performance consideration of the trade off between its effects of reducing the refill penalty in about half of the instruction misses, and increasing the stalls caused by the requests conflict between the instruction and data accesses to the memory. The 4 words buffer is chose instead of the 8 words buffer provided in the provided specs because a smaller size of stream buffer is expected to have better performance on a small test program. The details of this consideration are in the trade off section.

Actually the stream buffer was designed and implemented to have the instruction pre-fetch feature with buffer size of 8 words. With this feature there will also be 4 words sent to the stream buffer during instruction fetching. When the stream buffer controller finds that one of its two 4-words block is empty and the pipeline is working without memory access, it will send out request to the arbiter to fetch 4 words from the memory. Also the stream buffer pre-fetching should have lower priority than the instruction cache and data cache and is killed when either cache requests. This pre-fetch feature is disabled in the final processor debugging because it involves much more complicated cooperation in the memory system and the debugging time is limited. If there is another chance for processor improvement this feature will be finished and is believe to be helpful for the performance.

1.4 Victim Cache

The victim cache serves as a cache for the data cache. When dirty data are rewritten back to DRAM, they are also written to the victim cache for later reference. Data which are not dirty are also got written to the victim cache before they are overwritten in the data cache. The time to access data in the victim cache is smaller compared to the time to access data in the DRAM, therefore the victim cache helps to improve the performance of the processor.

In our design, the victim cache includes the block cell and the victim cache controller. The block cell is composed of 16 memory cells for four 4-word cache lines and 4 memory cells for the tags. Each cell stores 1 word or 4 bytes of data. The row decoder selects which cache line are being read or written. The column decoder selects a word in the cache line to read or write. Four comparators are used to compare the input tag with the tags of the cache lines. The output of the comparators (HIT0-3) are OR-ed together to determine a hit or miss in the victim cache. The HIT signal stops the arbitor from making a memory read request. The victim cache controller reads the HIT0-3 signals and use FIFO algorithm to determine which cache line to replace.

1.5 Write Buffer

We decide to add a write buffer between the data cache and memory in our design. Without the write buffer, each write miss will stall the pipeline to fetch four words from memory to the data cache according to our write-back policy. By adding a write buffer, the pipeline will write the word being stored to the write buffer, and the write buffer will write it back to memory afterwards, so the pipeline can proceed without being stalled, as long as the data cache is not being accessed during the write back process.

We design a five-word buffer for the write buffer, in which one word is for the word being stored, and the other four are for writing back the dirty cache line. The write buffer is quite simple by itself, which is just five word registers and two address registers, plus a couple of muxes. The write buffer is also controlled by the data cache controller, and this greatly complicates the controller design.

The revised data cache controller works as follows.1. Not dirty read miss: Stalls the pipeline, reads 4 words from memory, transfers the replaced cache

line to the victim cache in the waiting time, and sends one word to the processor, then release the pipeline. (10 cycle penalty)

2. Dirty read miss: Stalls the pipeline, sends memory read request to memory, transfers the dirty cache line to the write buffer and victim cache in the waiting time, writes a cache line, sends one word to the processor, then release the pipeline, write back from the write buffer to memory. (10 cycle penalty)

3. Not dirty write miss: Write the word being stored to the write buffer, sends memory read request to memory, transfers the replaced cache line to the victim cache in the waiting time, writes a cache line, then write back from the write buffer to memory. (no penalty!)

4. Dirty write miss: Write the word being stored to the write buffer, sends memory read request to memory, transfers the replaced cache line to the write buffer and victim cache in the waiting time, writes a cache line, then write back from the write buffer to memory. (no penalty!)

The state transition diagram for “sw” with a dirty cache miss is as follows.

Write the “sw” word into write buffer

Write the dirty cache line into write buffer

Load 4 words from memory into cache

Write the stored word into cache

Write the buffered cache line to memory

2 Trade-offs

2.1 Deep Pipelining ProcessorThere is always a big trade-off between the good performance and the design complexity, and the deep pipelining processor is not exceptional. The major trade-off in terms of the 8-stage processor itself is literally between the stall and the forwarding. In order to minimize the number of the stalls, more forwardings are required; thus, more functional units would be used in our processor. Two major trade-offs appear in our processor design.

1. 3 stalls versus 1 stall for lw followed by dependent lw/sw address instructionTo avoid three stalls, we have moved the address calculation in the MEM stage with one additional adder, and one forwarding mux for the address operand rs.

2. 3 stalls versus 0 stall for lw followed by swTo avoid unnecessary stalls in this case, we have a forwarding mux in the MEM2 stage for the rt in sw.

3. 1 stall versus 0 stall for adding, subtracting, shifting or slt operation followed by a dependent logic operationTo avoid this stall, we have implemented the logic operation in gate level, thus, having its result available in EX1.

4. 1 stall versus 0 stall for adding, subtracting, shifting or slt operation followed by dependent lw/sw address instructionBy moving the address calculation in MEM1 as in the first case, we have basically got rid of this penalty.

Moreover, in respect to the jr, we have forwarded the dependent result in the earliest possible stage, i.e., the ID stage. Yet, even with this forwarding, we still need to flush one instruction in order to meet the MIPS convention of one jump delay slot.


We have a large number of prediction table entries in order to reduce the possibility of instruction overlap for each entry. The trickiest part for the branch prediction design is the coordination with the pipeline stall and the nearby jump instructions. The predicted value needs to be blocked when the pipeline is stalled. For the combination of branch + some instruction + jump, the branch decision of EX1 and the jump decision of IF2 come to the IF1 to select the next PC at the same cycle. The next PC should be decided by the branch instruction, since it comes before the jump.

2.3 Stream Buffer

The performance enhancement the stream buffer can gain for the pipeline processor involves the trade off between the effects of reducing the refill penalty in about half of the instruction misses, and increasing the stalls caused by the requests conflict between the instruction and data accesses to the memory. The stream buffer design achieves less performance gain for small application programs because there is few instruction misses. The simulation results a stream buffer of 4 words size is shown as following:

Here we can see with program of small size the stream buffer even increases the execution time. With this consideration we chose to design the stream buffer of 4 words size, because the number of the stalls incurred by the stream buffer increasing with its size and that will lead to even worse performance for a small program.

2.4 Victim Cache In order to improve the performance of the pipeline processor, we choose to implement the victim cache. Due to the principle of locality, data replaced in the data cache might be refered in the near future. These data are stored temporarilty in the victim cache for later reference. We hope that the victim cache would reduce data cache miss and hence speedup our processor.

Instead of putting the victim cache in between the data cache and the arbitor, we choose to put it aside by the data cache and design the victim cache as an add-on module. There are three benefits for this decision. First, the victim cache can be used in our previous design without modifying the data cache or the arbitor. This helps us in testing phase because we were able to test the victim cache in our 5-stage pipeline (from lab6) before testing it with our 8-stage pipeline processor. Second, the victim cache can be disabled or enabled without affecting the operation of the pipepline (eventhough it might affect the performance of the pipepine). This feature helps us to narrow down the bug area, whether the bug is in the victim cache or in the other part of our design. Finally, we were able to measure the effect of the victim cache to the pipeline processor. The effect of the victim cache to the pipeline is the difference in performance when the victim cache is enabled and disabled, respectively.

2.5 Write Buffer

We design a 5-word write buffer instead of 4 words in the spec. This is for not stalling the pipeline for dirty write misses, because the number of word being transferred is five. For dirty read misses, we release the stall right after data cache gets the cache line from memory, and before writing back the replaced data. In this way, we save the penalty for dirty read misses from 19 cycles to 10 cycles. For memory reads, since we need to wait for 5 cycles between sending out request and getting the first data, we decide to “steal” 4 cycles from the waiting time for transferring a cache line from the data cache to write buffer. The performance gain of the write buffer is fairly large. For the dirty write miss case, the total number of stalled cycles will be reduced from 19 to zero. And for not dirty write miss, the number of cycles saved is 9. For a typical program in which store is 15%, the gain in CPI is 2.7, assuming that all the stored words is dirty.

4 k3.4%26642951Project Test Program

3 k-1.9%1108310877Lab6 Mystery Program

10 k8%26892445 Lab5 Mystery Program

Program Size(bytes)

Performance Gain

With Stream Buffer

(cycles)

w/o Stream Buffer

(cycles)

Application

3. Verification

This project is comprised of 5 components. The victim cache, write buffer and the stream buffer are parts of the cache system, while the branch predictor involves closely with the deep-pipelined processor. Thus in the first stage each of the components are tested independently, while in the second stage the verification is divided into 2 groups: the 3 cache features are added to the previous 5 stage pipeline and tested one after another, while the branch predictor is tested with the 8 stage deep pipeline. Finally the enhanced cache system is added to the pipeline with the branch predictor.

In a word, the verification of this project is divided into groups of dependent components, and is carried out from the basic level to the higher hierarchy. We used the lab5, lab6 and project released mystery program as a good sequence of testing program, together with the input vector command files and some simple test programs.

3.1 Deep Pipelining Processor

To make sure modified the original 5-stage pipelined processor, the 7-stage pipelined processor, the 8-stage pipelined processor, and the 8-stage pipelining processor with cache worked, we tested the whole processor using some critical testing files written by ourselves (Basically we tested the corner cases we came up with.) and lab5 mystery testing file. When we tested the 8-stage pipelining processor, we had to add delay slots into all testing files. To be more specific, three delay slots for all branch instructions and two for jump register instruction. Although it was much more complex to test the pipelining processor, we followed the same testing rules to do the testing for the complicated processor. Whenever we found an instruction which could not be executed as expected, we used both the schematic and the waveforms to trace the instruction to see whether we could get the expected forwarding results, computational results, control signals in each stage if applicable.

When we got new components from other members who were designing the branch predictor, stream buffer, victim cache, and write buffer, we added them into the 8-stage pipelined processor one by one instead of combining all of them together with the processor. This approach added complexity into the processor gradually, which made the testing relatively easy. After successfully finishing this testing, new component was added in.

We found it was very convenient to use the “break” command when debugging the processor. By using it, we could be directly led to the point in which we were interested in.


To make sure that the branch prediction and the flush mechanism works fine, we separate the testing process into two phases. For phase one, we disable the predictor, so that every predicted value is not taken. We also run a working 5-stage pipeline for the same testing program in order to make a comparison of the two, especially checking if the PC and instruction sequence is the same. In the second phase, we add in the predictor, and compare the two results again.

3.3 Stream Buffer

The stream buffer is first tested with input vectors written in command file, and then incorporated into the pipeline processor and tested with the mystery programs. Since it’s a functionally independent component with other features in the project, further testing with other features included didn’t have any trouble with the stream buffer.

3.4 Victim Cache

We perform the following test verification to make sure the victim cache function as designed: Since the victim cache communicate with the data cache controller and the arbitor, we test the victim

cache with the fake data cache controller and the fake arbitor which we modified from the vhdl file of the cache controller and the arbitor. The purpose of this test is to verify the controll signals between the victim cache and the cache controller and the arbitor

Testing the victim cache with our working 5-stage pipeline from lab6 Testing the victim cache in our new design 8-stage pipeline

The results from lab5_mystery.s, lab6_mystery.s, and partial_sum.s program shows that the victim cache works with our 5-stage and 8-stage pipeline processor.

3.5 Write Buffer

To make sure the revised cache controller works, we first use a simple testing program to check if “lw” and “sw” go through all the states according to the designed FSM. Next, we add the write buffer into the pipeline, and compare the cache contents with right results.

Results:

1. Performance Analysis

1.1 Deep Pipelining Processor In comparison with the 5-stage pipeline, the number of cycles increases because of the extra stalls. Yet, we have approximately reduced our cycle time into a half of the original.

Cycle time

Five-Stage Processor Eight-Stage ProcessorTwo Memory 43ns 20.5nsTwo Caches 52ns 30ns

The number of cycles with lab5 mystery

5-Stage Processor 8-Stage Processor(wo branch predictor)

Eight-Stage Processor(with victim cache, stream & write buffer)

Two Memory 1200 cycles 1390 cyclesTwo Caches 2700 cycle 6500 cycles 4600 cycles


We run the lab6_mystery.s using our branch prediction. Since the quick sort only has five branches, and most of the branch decision is taken, the performance gain of our branch prediction is large. Compared to that without the prediction mechanism, we got a performance gain of approximately 15 per cent.

1.3 Stream Buffer

The stream buffer achieves good performance gain for program of large size, while small and even negative gain for small size program. This is because there is few instruction misses in when executing small programs, and the extra stalls introduced by the transferring between the stream buffer and the arbiter hurt the performance. The comparison of running different programs before and after adding the stream buffer is shown as following:

1.4 Victim Cache

The results of our experiments on the performance of the victim cache with lab6_mystery.s program and our 5-stage pipeline processor are shown in the following table:

Test condition Number of hits Without victim cache With victim cache Speeduplab6_myster.s, 5-stage pipeline, only dirty data are written to victim cache

15 10,006 cycles 10,142 cycles -1.36%

In our experiment, the victim cache make the pipeline performance worse because: The cycles saved from the 15 hits does not compensate for other overhead that the victim

cache add to the pipeline, such as wait cycles The victim cache cause more read miss in the data cache as observed from the waveform of

the simulation. This can be explained as follows: the data cache use random algorithm to

3.4%26642951Project Test Program-1.9%1108310877Lab6 Mystery

Program

Lab5 Mystery Program

Application

1645

w/o Stream Buffer(cycles)

10%1589

Performance GainWith Stream Buffer(cycles)

determine which data to replace. However, our hardware can only generate pseudo-random sequence, ie. the sequence is the same for every simulation. The read hits in the victim cache change this sequence. For example, at clock cycles 100th the number generated is 0, so data X at lower half of the block is replaced while data Y at upper part of the block is still in the data cache. We don’t have a read miss in the data cache if Y is refered later. If there is a read hit in the victim cache before clock cycle 100th, then the data reference of the data cache occurs before clock cycle 100th (since a read hit in the victim cache saves some cycles), say 95th, and the number generated is 1. Data Y at upper half of the block is replaced. Later, when data Y is refered, we have a read miss in the data cache.

The result does not present the true effect of the victim cache on the performance of the pipelining processor.

One test is not enough to measure the performance of the victim cache There are very few hits in the victim cache. The 15 hits in 10,000 cycles are the number times

dirty data got written back to DRAM. Data that are not dirty are not written to the victim cache. There would be more victim cache hits if not dirty data are also written to the victim cache

The algorithm that data cache used to determined which data to replace depends on a sequence of 0 and 1, which in turn depend on when a read hit occurs in the victim cache.

1.5 Write Buffer

We run the lab6_mystery.s with write buffer. Since we save a lot of cycles for store and load word, so the performance gain is considerate. One problem is that if a successive memory access comes before the previous procedure ends, the pipeline needs to be stalled. Since the reads and writes in the testing program is frequent, so the stalls cannot be eliminated completely. We got a 15% performance gain with write buffer.

2. Critical path analysis:

The critical path in 8-stage pipelining processor with I-cache and D-cache--any instruction followed by dependent branchEX1 pipeline register + forwarding mux + branch comparator + branch logic gate + PCsrc mux + J mux= 3 + 3.5 + 5 + 5 + 2.5 + 1.5= 20.5 ns

The critical path in 8-stage pipelining processor with I-cache and D-cache--LW followed by JR MEM2 pipeline register + D-cache access time + mem-to-reg mux + jr forwarding mux + J mux= 3 + 19.5 + 1.5 + 3.5+ 2.5= 30 ns

Conclusion:

In this project we were aiming for a deep pipeline processor with 4 features of performance enhancement. Finally the pipeline works well with the other 3 memory system features, while the branch predictor is more complicated than we expected.

Again in this project we experienced the similar situation as in our lab6. Features are designed to be complicated for the performance goal without enough consideration for the testing and interface debugging time and effort. The most important lesson we learnt from this class is that the good testing methodology is very important for a good design, and also the schedule of design complexity according to the goal and design time is critical to make sure that the good design can be implemented on time.

We would like to thank to the professor and the TAs sincerely for the help during this class. We really appreciate the experience in the processor design throughout the semester. What we learnt is not only the computer architecture class material, but also the skill in designing and testing a system design.

Appendix I (schematics or block diagram): Deep Pipelining

8_stage

Branch PredictorBranch_predictor.1

Stream Bufferstream buffer.l

Victim Cache Victim Cache block diagrams.doc

Write Buffer writebuffer

Appendix II (VHDL files):The following is the VHDL source code for the added/changed components added in lab7.

Deep PipeliningEx1_forwarding.vhdMem1_forwarding.vhdMem2_forwarding.vhdStall.vhd

Branch PredictorBranch_predictor.vhd

Stream Bufferstream buffer control.vhd

Victim Cache

victim_cache_cntrl.vhd arbitor71.vhd fake_arbitor.vhd fake_cache_cntrl.vhd

Write BufferCache_ctrl.vhd

Appendix III(diagnostic programs): Deep Pipelining

Pipe.cmdPipe2.cmdPipe3.cmd

Branch Predictor Branch.cmd

Stream Buffer stream buffer.cmd lab5 mystery program lab6 mystery program partial sum program

Victim Cache victim_cache.cmd

v_d_a.cmd

Write BufferWirte_buffer.cmd

Appendix IV (online notebooks): Notebooks for Gerald: Gerald_notebook.txt Notebooks for Huifang: Huifang_notebook.doc Notebooks for Kahn: Kahn_notebook.doc Notebooks for Mei: Mei_notebook.txtNotebooks for Yuchi: Yuchi_notebook.txt

abstract - university of california, berkeleykubitron/courses/cs15… · web viewabstract. the...

Documents