cs 152 final projectkubitron/courses/cs15… · web viewthe cpi on the quicksort program was 3691...

CS 152 FINAL PROJECT

The “Bigmouth” Processor

The Team:James ChienThomas Lee

Danh NguyenJoe Suh

The Prof:Kubi

The Ta: Victor Wen

The Date:December 8, 1999

About the Processor

The “Bigmouth Processor” is a 4 stage pipelined processor which implements a subset of the MIPS instruction set. In addition to the basic functionality, the datapath features a 32 bit multiplier/divider unit. The processor also has an optimized memory system which features a 64-bit memory bus with fully associative cache lookup, content addressable memory and a 4 line victim cache and FIFO write buffer. In addition, the processor has a TLB which supports a virtual memory system of 64 words. The “Bigmouth Processor” team is also in the process of designing a 7 stage pipeline. Through the tracer test, we have determined that most of the instructions work insimple cases, but due to our limited time schedule the deep pipelined processor remains in testing.

Diagram of Processor

Performance SummaryOur processor runs at a minimum cycle time of 50 ns. (20 Mhz)The CPI on the Quicksort program was 3691 cycles/1596 instructions = 2.31.From Lab 6 to Lab 7, the number of cycles dropped by 17% on Lab6_mystery program.

Lab 6 vs. Lab 7 processor on a number of different programs.

IF/ID ID/EX EX/MEM

ALUMux

Mux

InstructionMemory

Register File

IF/ID ID/EX EX/MEM

DataMemor

y

ALUMux

Mux

Processor Features

Control

The Main Control unit is a VHDL component which accepts an instruction during the Instruction Decode Stage of our pipeline and outputs the proper signals depending on the instruction. There is also a hazard detection unit which detects data hazards in the pipeline as well as necessary forwarding logic to handle these cases.

Datapath

Currently, our working processor has a 5 stage pipeline which is similar to the design in Patterson and Henessy, but lacks a IX/MEM register to write to. So the memory writes take place the same cycle that the next instruction is being fetched.

Our team was also workgin on a 7 Stage Not So Deep Pipelining and was entering the testing phases but time was not permitting. The Pipeline has been tested and works for some of the less hazardous cases, but is not yet fully functioning.

Memory System

Fully Associative Cache with Content Adressable Memory

The cache is fully associative – meaning that a new entry could replace any old entry depending on the Replacement policy (LRU in our case). Although this improved the hit rate of the cache, it might have increased the cycle time also because we had to add several comparators to compare the incoming address with all the cache tags.

64 Bit Bus

We utilized spacial locality by incorporating a 64 bit Bus. On a cache miss, we load in 2 sets of data and put them both in the same block (of 2 words). This created a major (unforseen) problem. Writes only write to one memory location, but it also updates the tag for BOTH words in the block. Therefore the other word in the block is no longer correct (with the new tag). We had to correct this problem by introducing two additional sets of registers to keep track of which “odd” and “even” words in the cache are valid.

Victim Cache

We have two victim caches (one for instruction and one for data). Each has 4 lines and holds a total of 8 words. When the main cache is full, it bumps out a set of its data to the victim cache. On subsequent memory access, if there is a main cache miss, the victim cache is also checked to see if one of its tags matches with the address. If it does, the data from the victim cache will be outputted.

Write Buffer

The write buffer was used to enhance our write-through policy. Before we had a write-through buffer, we had to stall the pipeline on every sw instruction to write through to both the cache and the memory. Now with the write buffer, we can write the sw data into a FIFO buffer and leave it there until

the memory is ready to write the data in. Thus the pipeline need not be stalled and the store’s 3 cycles can be performed when the memory is ready. Our write buffer was 4 words in size. The write buffer has to stall when the 4 word blocks get filled up. The write buffer was most efficient for programs that do many stores via a loop. Especially if there are some other instructions like R-format instructions in between the sw’s to make sure the write-buffer can get cleared and doesn’t fill up. (ex merge sort and quick sort)

Translation-Lookaside Buffer (TLB)

The TLB holds up to 8 entries takes in the top 20 bits of the virtual address during the instruction fetch and uses 5 of these bits to check if the page exists in the main memory. If the page has not been accessed recently and does not have an entry sitting in the TLB, the datapath stalls to fetch the correct entry from the page table. This results in a penalty of abou 4 cycles in order to fetch the TLB entry from the main memory. However, it is certainly worth the penalty compared to not having a TLB and having to access memory between every instruction to test if the page was in memory. In fact, on the lab_6 mystery program, the TLB had an over 99% hit rate after 3000 cycles completed. To simplify things, we decided to implement a page table with a maximum of 32 pages, which meant using 5 bits for our Virtual page number. The bottom 6 bits of the Virtual address make up the page offset, so we support pages of 64 words. The translation is also a direct linear mapping, so the virtual page number corresponds linearly to the physical page number (The operating system would take care of this usually). We included the TLB for the instruction accesses, but it could be used for the data instructions as well and in this case, it holds a dirty bit, reference bit and uses a pseudo random page replacement algorithm which can send kicked out pages to be written to memory.

VERY IMPORTANT NOTE: TO MAKE THE DATAPATH (with tlb) WORK, THE PROGRAM MUST HAVE FOR ITS FIRST 32 INSTRUCTIONS THE NUMBERS (0, 1, 2, …. 31). (SO THE ACTUAL PROGRAM STARTS ON LINE 33). THIS IS NECESSARY BECAUSE THAT’S WHERE THE PAGE TABLE WOULD BE FILLED.

Extra Stuff

The Monitor module traces the number of cycles that pass. There are also a number of cache and hit/miss counters that record the hit/miss rates for the cache, write buffer and tlb.

Performance Summary

Top 3 Critical Paths

Path A: On a memory access (during either lw or sw), we need a 50ns clock because the DRAM request unit uses 50ns cycles. We did this because we wanted to keep RAS and CAS simple. We could have shrunk this time down by making the DRAM request unit have more states that uses shorter clock cycles instead.

Path B: The next runner up would be on an instruction that requires forwarding from the MEM stage and branches like bne. Forwarding from the MEM stage must go through the comparator in TLB (5ns), then through the comparator in the main cache (4ns), then some logic (including mux and a tristate buffer inside the cache to output the data – 5 ns). Then the data needs to go through another mux (1.5ns) and another mux (2.5 ns) that’s chosen by the forwarding unit. Finally, the output data goes through an adder (6ns) (assuming branch is taken). That new address needs to go to another mux (2.5) controlled by the

branch determination unit to see if branch is taken. So all together, this critical datapath needs about 26.5 ns. We realize that we could probably cut this down by calculating branch in the IF stage (and save 6ns from the adder). But since this isn’t our real critical path (memory is), we didn’t feel obligated to change it.

Path C: The last contestant for critical path is very similar to Path B. Except that the forwarded data is coming from the EXE stage. Therefore, the path goes this way: ALU in EXE stage (10ns), 2 muxes (5 ns total), and then the rest is the same in Path B. This is a total of 26 ns.

As can be seen from our 3 worst paths, the memory latencies are the bottlenecks of our pipeline.

Performance Analysis

The 2 main performance enhancements over lab 6 were the implementations of the victim cache and the write-buffer. Both reduced total number of cycles over the lab 6 processor

The following is a table of the number of cycles needed for certain programs to run with both the lab 6 processor and the memory-enhanced lab 7 processor:

Test Program Cycles for Lab 6 Processor Lab 7 (with victim cache and write buffer)

joetest.s 303 280lab 5 mystery 503 501lab 6 mystery 4498 3691lab 7 mystery 16100 13298harderstoretest 1790 1758

As can be seen from our data, the lab 7 enhancements reduced the number of cycles by more than 17% for both the quicksort and the mergesort programs. The other programs were also improved. The amount of improvement depends on the number of memory access operations performed by the program. This is why lab 5 mystery program didn’t improve much, since this program simply performs various random, fairly independent instructions which don’t; require a good caching system.

Cycle time for both processors was 50ns (20MHz clock) We weren’t surprised that there was no improvement because we already implemented the CAM-based cache in lab6, and we also made some of our components in schematics instead of VHDL if possible.

Since CPI = # of cycles / # of instructions, the CPI for each benchmark program can be calculated once the total number of instructions a program executes is known. This was done by creating a hitcounter associated with the instruction cache and run with each program:

Joetest.s = 69 instructionsLab 6 mystery program = 1596 instructionsLab 7 mystery program = 5889 instructionsHarderstoretest.s = 293 instructions

Test Program CPI for Lab 6 CPI Lab 7 (with victim cache and write buffer)

Joetest.s 4.39 4.058

lab 6 mystery 2.8183 2.313lab 7 mystery 2.734 2.2581Harderstoretest 6.109 6

Obviously the average number of cycles per instruction were lowered thanks to the victim cache and the write buffer.

Execution time = CPI * Instruction Count * cycle time = # of cycles * cycle time Execution time = # of cycles * 50ns

The following illustrates the execution time difference between the benchmark programs run on Lab 6 and Lab 7 processors:

Test Program Exec time for Lab 6 Processor

Exec time for Lab 7 processor

joetest.s 15.15us 14uslab 5 mystery 25.15us 25.05uslab 6 mystery 224.9us 184.55uslab 7 mystery .805ms .6649msHarderstoretest 89.5us 87.9us

Why Performance Improved The victim cache reduces the number of processor cycles by reducing the number of conflict misses The write buffer eliminates many of the stalls that the processor previoulsy underwent by storing all pending sw instructions in the 4 word write buffer.

Testing PhilosophyThese are some of the testing strategies we took to ensure our processor functioned correctly.

1) Individualy testing each component without including it in the datapath: This helped to isolate problems because we could make sure each piece of the processor was functioning correctly. Also, by tracing through VHDL code for individual components one line at a time, and looking at signal output of Vwaves, we were able to check that each component conformed to the correct timing, or performed the right logic.

2) Using Joetest, our tracer program: Joetest is a program which loads into memory an then tests each MIPS instruction individually. This helped us isolate which particular instructions were causing problems and modify our datapath our control signals accordingly. Similar to Joetest, Joetest2 was a program written specifically to make sure the individual combinations of instructions that might cause data hazards were handled correctly in our processor.

3) Looking at the contents of memory: After running a program like lab6_mystery program or lab5_mystery program, you can always get a good idea of whether the processor is doing the right thing by matching the contents of memory with what you expect the memory to output.

4) Using VHDL monitors to follow signals: We also used some VHDL written modules which periodically checked the signals of other VHDL components and signalled for an error if one occurred.

5) Looking at the currnet state of the processor in Viewdraw: Using this strategy combined with a debug program like Joetest helps to determine and isolate the cause of problems in our design.

6) Running lots of code: We wrote lots of different programs such as a multiplication test as well as Lab4_mystery, Lab5_mystery and Lab6_mystery programs. Since these older programs also ran correctly on our processor, we have strong confidence that our processor works.

Appendix Includes

Block Diagrams andSchematics: At the backExhibit1-4. Main Datapath pg (1-5)5. Cache System with CAM6. Victim Cache7. Write Buffer8. TLB9. Multiplier Unit10. Deep Pipelined Datapath (1-5) (work still in progress)

PS File Contains

Test Programs: Cmd files and .S files in ps fileVHDL Code: also in ps fileReferences : For Victim Cache: Kubi’s notes, Lecture #19For TLB: Patterson and Hennesy pg 580-602Online Logs: Thomas Lee, James Chien, Joe Suh

Exhibit 1 : Main Datapth Fetch Stages

Exhibit 2: Main Datapath Decode Stages

Exhibit 3: Main Datapath Execute Stage

Exhibit 4: Memory Stages

Exhibit 5: 64 bit Cache Syste

Exhibit 6: Victim Cache

Exhibit 7: Write Buffer

Exhibit 8: TLB

Multiplier and Divider

cs 152 final projectkubitron/courses/cs15… · web viewthe cpi on the quicksort program was 3691...

Documents