team antelope final presentation what doesn’t kill you, makes you stronger "the major...
TRANSCRIPT
Team AntelopeFinal Presentation
What doesn’t kill you, makes you stronger
"The major difference between a thing that might go wrong and a thing thatcannot possibly go wrong is that when a thing that cannot possibly go wrong
goes wrong it usually turns out to be impossible to get at or repair.“
James ZirkleJohn Lange
Peter JohnsonChris
Processor Overview
Despite all my rage I'm still just a rat in a cage --Bullet With Butterfly Wings
• 5 stage pipeline• 10 nanosecond clock• 128 bit memory• Split Caches
– Write back policy
• CLZ and Multiply simplified to 1 clock cycle• MicroSequencer used to handle complex
operations
Who did what
• James– Register File, Integration
• Jack– Cache, Memory, ALU
• Peter– Shifter, hazard detection unit
• Chris – Multiplier, CLZ, interrupts
ALU
“Quidquid latine dictum sit, altum viditur”
• Handles all 16 data processing instructions
• Determines PSR flag values
• 4 bit carry look ahead units, combined into 16 blocks
Shifter
• 32 bit Barrel Shifter
• Logical Shift Left/Right, Arithmetic Shift Right, Rotate Right, Rotate Right Extended
• Special Cases (LSR #0 encodes LSR #32, etc)
• Generates result by combining individual bit shifters
Barrel Shifter
Result propagated through bit shifters
Added 32-bit Shifters
16 Bit-Right Shifter
32-Bit Barrel Shifter
Carry In / Carry Out
-Carry in only used in RRX (rotate right extended) operations
-Carry out always computed, even though not needed in rotate operations
Carry Out Logic: Two Options• Separate logic computes Cout early using input
and shift amountPros:
-Cout signal ready much earlier, no need for propagation
-Simpler bit shifter designs
Cons:
-Many more gates needed
Carry Out Logic: Two Options• Individual bit shifters compute and propagate
Cout signalPros:
-Simpler overall design
-Fewer logic gates
Cons:
-Takes longer for Cout to be ready (propagation delay)
-More complicated bit shifters
Carry out: Conclusion• Went ahead and implemented Cout logic in the
bit shifters-Don’t really need the signal to be ready any earlier than the rest of the shifter output, especially not at the addition gate cost
-Each shifter computes Cout for its own shift amount and passes it on, or leaves Cout alone if it is disabled
Complete Shifter
Multiplier (MUL/MLA)• 32 additions in parallel
• Logarithmic time result
• 25 = 32, so time equals 5 adds
• Multiply w/accumulate inserted at the end with a multiplexor
Count leading zeros (CLZ)
• Output equals number of leading zeros on the input (Ex: 00010110 00000011)
• First step: 00010110 00011111
• Then, add one: 00011111 00100000
• Lastly, convert to binary. With a 32-bit input, output will have a 6-digit maximum.
• Timing: Only four gate delays.
Register File
• 37 Total Registers
• Different modes select between different registers.
• Registers r0-r7 and the PC (r15) are common to all modes
• PSR Mode bits select between different register banks
Register File, Continued
• 3 normal (r0-r15) register outputs.
• 1 input that can access r0-r15
• An input and an output dedicated to the PC
• An input and an output dedicated to the SPSR
Pipeline Design and Component Integration
“The manual for a ferrari 250 states that replacing the timing chain is a five-step process. Step one is the simple (?) instruction: ‘Invert motor on bench.’”
Pipeline Selection
• Selected a 5 stage pipeline design– Fetch: Instruction is retrieved from memory
– Decode: Instruction is processed, control signals sent
– Execute: ALU, Shift, Multiply and CLZ operations
– Memory: Data cache/memory access
– Writeback: Results are written back to the register file
Fetch->Decode->Execute->Memory->Writeback
Advantages
• Breaks datapath into logical operational blocks.
• Slower stages can be broken up to increase the clock speed.
• Results in higher throughput
Disadvantages
• More time consuming to implement.
• Data hazards appear, so must implement forwarding and stalls in certain circumstances. This further complicates the design.
Fetch
Decode
Execute
Memory
Writeback
Pipelined Datapath Construction
“Purpose—to drive you to insanity”
• Implemented simple single stage datapath first.
• Used D flip-flops to break up the datapath into the 5 different stages.
• Added memory and cache.– Stall the pipeline by holding the clock.
Fetch Stage
• Consists of Data Cache
• Runs almost every cycle.
• Stalled independently of the rest of the stages while the Sequencer is running.
Execute Stage
• Contains:– Shifter– ALU– Multiplier– CLZ unit– Conditional Execution unit– PSR and PSR control
Writeback
• Contains the interface to Data Cache
Memory
• Writes back to registers
Decode Stage, Continued
• Stage contains:– Register File– Sequencer– Branching logic
• 32 bit shift extender
• 32 bit full adder
• PC is output from the register file straight into the Instruction Cache address
Decode Stage
• Modular design, each instruction type has one module that is connected to a mux
• PLA takes instruction and outputs a 4 bit select signal that selects between all modules.
• Control is contained in a 32 bit bus that is piped through the entire processor.
Current Processor Implementation
Hazards
Read after Write:
FETCH DEC EXEC DATA WB
FETCH DEC EXEC DATA WB
1.
2. STALL
Hazards
Branch:
FETCH DEC EXEC DATA WB
FETCH DEC EXEC DATA WB
1.
2.
3.
4.
FETCH DEC EXEC DATA WB
FETCH DEC EXEC DATA WB(Branch Target)
Hazard Checking Logic
Checks to see if Rd (destination register) is read from in next 2
commands
Data ForwardingFETCH
DECODE
EXEC
DATA BUFFER
WRITEBACK
Result
Data
Overview
CONDITION EVALUATE
CPSR
Flags
Interrupt Handler
Component must handle the following seven cases:1. Reset (Highest Priority)
2. Data Abort
3. FIQ
4. IRQ
5. Prefetch Abort
6. Undefined Instruction
7. Software Interrupt (SWI) (Lowest Priority)
Implementation
• One ROM file handles memory addresses.
• 3-bit input leads to 32-bit address for PC.
• Second ROM file handles CPSR alterations.
• 4-bit input leads to lower 8 bits of CPSR.
• Priorities of the interrupts are handled with CLZ functionality.
• Lastly, no interrupts leads to “Active = 0”.
Memory
"Memory is like an orgasm. It's a lot better if you don't have to fake it.“
-- Seymour Cray
• 128 bit wide Main Memory• 32 bit Split cache system
– Data and Instruction
• Data Cache operates with Write Back Policy• 2 State Machines in charge of Memory Control
Main Memory Control
It wasn't very sporting, but what the hell.- Chuck Yeager on shooting down a landing Me-262
• Simulates memory latency with a delay component
• Implemented with a state machine– Enters a wait state while holding for memory to finish
• Operation order: – Data first, Instruction second
• Signals when data is valid, and when operation is finished
Memory State Machine
Caches
“I'm just here for moral support. Ignore the gun.”
• 128 bit lines separated into 32 bit blocks• Hits determined by using high address bits, as well
as a valid bit• Write strategy uses Dirty bit to signal when to
write to memory• On reset valid and dirty bits are cleared• Can operate in 128, 32, and 8 bit modes
– Necessary for memory and processor interface
Cache Reset
"A day without killing... is like a day without sunshine“
-John Wayne
• Cache reset controlled by two signals– RESET and MEM_CLEAR
• When MEM_CLEAR is pulsed a sequencer is engaged– Adder attached to a flip-flop
– Cycles through addresses, setting values to 0
– Asserts pipeline hold signal while running
• RESET clears all the state machines back to initial state
Memory System Control
"He spoke, I had no clue, it was a mutual relationship.“
• Implemented with a state machine• Interfaces I-Cache, D-Cache, Main Memory, and
Pipeline• During operation, pipeline hold signal is asserted• Autonomous operation, requires no special
datapath control• Took so much time, that it made my girlfriend
jealous
Memory Control Overview
Memory Control FSMs
Memory Control FSMs
Memory Control FSMs
Interrupts
“The nice thing about standards is that there are so many of them to choose from.”
Sequencer
• Built to handle complex operations– Interrupts, block load/store
• Is basically a clocked ROM file.– Has a start address and a start signal– Runs through a sequence of instructions in the
ROM file until sequence signals it is done.– One instruction per cycle is injected into
instruction stream, Fetch stage is stalled.
Instructions: Data Processing, Multiply and CLZ
• These instructions move linearly through the pipeline, and don’t require stalls as they are all single cycle in our implementation.
• Present some data hazard problems, but hazard detection and forwarding logic maintains linear execution.
Branch
• On decode, branch immediately adds the PC to the shifted offset and updates the PC.
• No stall necessary, since PC is updated before the next instruction is fetched.
• Branch w/link has r14 updated when branch finishes moving through the entire pipeline.
LDR, STR
• Used asynchronous logic to make LDR and STR single cycle. During the first part of the clock cycle, the updated base register is written, the writeback register is changed, and the value is loaded from memory into that register.
• Simplifies load and store logic greatly.
Multicycle Instructions
• Multiple Register Transfer• Swap
• Implemented with our sequencer:– Each of these instructions translates into a
sequence of single cycle instructions. These instructions are modified to correspond with the specific multicycle instruction.
Where are we now?
"Time commitment--eternity.“--CTEC
• All 5 stages and Memory/Cache integrated.• Data Processing, Multiply, CLZ, Shifting, Load,
Store, Branch, MRS, MSR• Not yet fully functional:
– Load/Store Multiple– Swap– Conditional execution (in regards to branch)– Interrupts