team antelope final presentation what doesn’t kill you, makes you stronger "the major...

Team AntelopeFinal Presentation

What doesn’t kill you, makes you stronger

"The major difference between a thing that might go wrong and a thing thatcannot possibly go wrong is that when a thing that cannot possibly go wrong

goes wrong it usually turns out to be impossible to get at or repair.“

James ZirkleJohn Lange

Peter JohnsonChris

Processor Overview

Despite all my rage I'm still just a rat in a cage --Bullet With Butterfly Wings

• 5 stage pipeline• 10 nanosecond clock• 128 bit memory• Split Caches

– Write back policy

• CLZ and Multiply simplified to 1 clock cycle• MicroSequencer used to handle complex

operations

Who did what

• James– Register File, Integration

• Jack– Cache, Memory, ALU

• Peter– Shifter, hazard detection unit

• Chris – Multiplier, CLZ, interrupts

ALU

“Quidquid latine dictum sit, altum viditur”

• Handles all 16 data processing instructions

• Determines PSR flag values

• 4 bit carry look ahead units, combined into 16 blocks

Shifter

• 32 bit Barrel Shifter

• Logical Shift Left/Right, Arithmetic Shift Right, Rotate Right, Rotate Right Extended

• Special Cases (LSR #0 encodes LSR #32, etc)

• Generates result by combining individual bit shifters

Barrel Shifter

Result propagated through bit shifters

Added 32-bit Shifters

16 Bit-Right Shifter

32-Bit Barrel Shifter

Carry In / Carry Out

-Carry in only used in RRX (rotate right extended) operations

-Carry out always computed, even though not needed in rotate operations

Carry Out Logic: Two Options• Separate logic computes Cout early using input

and shift amountPros:

-Cout signal ready much earlier, no need for propagation

-Simpler bit shifter designs

Cons:

-Many more gates needed

Carry Out Logic: Two Options• Individual bit shifters compute and propagate

Cout signalPros:

-Simpler overall design

-Fewer logic gates

Cons:

-Takes longer for Cout to be ready (propagation delay)

-More complicated bit shifters

Carry out: Conclusion• Went ahead and implemented Cout logic in the

bit shifters-Don’t really need the signal to be ready any earlier than the rest of the shifter output, especially not at the addition gate cost

-Each shifter computes Cout for its own shift amount and passes it on, or leaves Cout alone if it is disabled

Complete Shifter

Multiplier (MUL/MLA)• 32 additions in parallel

• Logarithmic time result

• 25 = 32, so time equals 5 adds

• Multiply w/accumulate inserted at the end with a multiplexor

Count leading zeros (CLZ)

• Output equals number of leading zeros on the input (Ex: 00010110 00000011)

• First step: 00010110 00011111

• Then, add one: 00011111 00100000

• Lastly, convert to binary. With a 32-bit input, output will have a 6-digit maximum.

• Timing: Only four gate delays.

Register File

• 37 Total Registers

• Different modes select between different registers.

• Registers r0-r7 and the PC (r15) are common to all modes

• PSR Mode bits select between different register banks

Register File, Continued

• 3 normal (r0-r15) register outputs.

• 1 input that can access r0-r15

• An input and an output dedicated to the PC

• An input and an output dedicated to the SPSR

Pipeline Design and Component Integration

“The manual for a ferrari 250 states that replacing the timing chain is a five-step process. Step one is the simple (?) instruction: ‘Invert motor on bench.’”

Pipeline Selection

• Selected a 5 stage pipeline design– Fetch: Instruction is retrieved from memory

– Decode: Instruction is processed, control signals sent

– Execute: ALU, Shift, Multiply and CLZ operations

– Memory: Data cache/memory access

– Writeback: Results are written back to the register file

Fetch->Decode->Execute->Memory->Writeback

Advantages

• Breaks datapath into logical operational blocks.

• Slower stages can be broken up to increase the clock speed.

• Results in higher throughput

Disadvantages

• More time consuming to implement.

• Data hazards appear, so must implement forwarding and stalls in certain circumstances. This further complicates the design.

Fetch

Decode

Execute

Memory

Writeback

Pipelined Datapath Construction

“Purpose—to drive you to insanity”

• Implemented simple single stage datapath first.

• Used D flip-flops to break up the datapath into the 5 different stages.

• Added memory and cache.– Stall the pipeline by holding the clock.

Fetch Stage

• Consists of Data Cache

• Runs almost every cycle.

• Stalled independently of the rest of the stages while the Sequencer is running.

Execute Stage

• Contains:– Shifter– ALU– Multiplier– CLZ unit– Conditional Execution unit– PSR and PSR control

Writeback

• Contains the interface to Data Cache

Memory

• Writes back to registers

Decode Stage, Continued

• Stage contains:– Register File– Sequencer– Branching logic

• 32 bit shift extender

• 32 bit full adder

• PC is output from the register file straight into the Instruction Cache address

Decode Stage

• Modular design, each instruction type has one module that is connected to a mux

• PLA takes instruction and outputs a 4 bit select signal that selects between all modules.

• Control is contained in a 32 bit bus that is piped through the entire processor.

Current Processor Implementation

Hazards

Read after Write:

FETCH DEC EXEC DATA WB


1.

2. STALL

Hazards

Branch:



1.

2.

3.

4.


FETCH DEC EXEC DATA WB(Branch Target)

Hazard Checking Logic

Checks to see if Rd (destination register) is read from in next 2

commands

Data ForwardingFETCH

DECODE

EXEC

DATA BUFFER

WRITEBACK

Result

Data

Overview

CONDITION EVALUATE

CPSR

Flags

Interrupt Handler

Component must handle the following seven cases:1. Reset (Highest Priority)

2. Data Abort

3. FIQ

4. IRQ

5. Prefetch Abort

6. Undefined Instruction

7. Software Interrupt (SWI) (Lowest Priority)

Implementation

• One ROM file handles memory addresses.

• 3-bit input leads to 32-bit address for PC.

• Second ROM file handles CPSR alterations.

• 4-bit input leads to lower 8 bits of CPSR.

• Priorities of the interrupts are handled with CLZ functionality.

• Lastly, no interrupts leads to “Active = 0”.

Memory

"Memory is like an orgasm. It's a lot better if you don't have to fake it.“

-- Seymour Cray

• 128 bit wide Main Memory• 32 bit Split cache system

– Data and Instruction

• Data Cache operates with Write Back Policy• 2 State Machines in charge of Memory Control

Main Memory Control

It wasn't very sporting, but what the hell.- Chuck Yeager on shooting down a landing Me-262

• Simulates memory latency with a delay component

• Implemented with a state machine– Enters a wait state while holding for memory to finish

• Operation order: – Data first, Instruction second

• Signals when data is valid, and when operation is finished

Memory State Machine

Caches

“I'm just here for moral support. Ignore the gun.”

• 128 bit lines separated into 32 bit blocks• Hits determined by using high address bits, as well

as a valid bit• Write strategy uses Dirty bit to signal when to

write to memory• On reset valid and dirty bits are cleared• Can operate in 128, 32, and 8 bit modes

– Necessary for memory and processor interface

Cache Reset

"A day without killing... is like a day without sunshine“

-John Wayne

• Cache reset controlled by two signals– RESET and MEM_CLEAR

• When MEM_CLEAR is pulsed a sequencer is engaged– Adder attached to a flip-flop

– Cycles through addresses, setting values to 0

– Asserts pipeline hold signal while running

• RESET clears all the state machines back to initial state

Memory System Control

"He spoke, I had no clue, it was a mutual relationship.“

• Implemented with a state machine• Interfaces I-Cache, D-Cache, Main Memory, and

Pipeline• During operation, pipeline hold signal is asserted• Autonomous operation, requires no special

datapath control• Took so much time, that it made my girlfriend

jealous

Memory Control Overview

Memory Control FSMs

Interrupts

“The nice thing about standards is that there are so many of them to choose from.”

Sequencer

• Built to handle complex operations– Interrupts, block load/store

• Is basically a clocked ROM file.– Has a start address and a start signal– Runs through a sequence of instructions in the

ROM file until sequence signals it is done.– One instruction per cycle is injected into

instruction stream, Fetch stage is stalled.

Instructions: Data Processing, Multiply and CLZ

• These instructions move linearly through the pipeline, and don’t require stalls as they are all single cycle in our implementation.

• Present some data hazard problems, but hazard detection and forwarding logic maintains linear execution.

Branch

• On decode, branch immediately adds the PC to the shifted offset and updates the PC.

• No stall necessary, since PC is updated before the next instruction is fetched.

• Branch w/link has r14 updated when branch finishes moving through the entire pipeline.

LDR, STR

• Used asynchronous logic to make LDR and STR single cycle. During the first part of the clock cycle, the updated base register is written, the writeback register is changed, and the value is loaded from memory into that register.

• Simplifies load and store logic greatly.

Multicycle Instructions

• Multiple Register Transfer• Swap

• Implemented with our sequencer:– Each of these instructions translates into a

sequence of single cycle instructions. These instructions are modified to correspond with the specific multicycle instruction.

Where are we now?

"Time commitment--eternity.“--CTEC

• All 5 stages and Memory/Cache integrated.• Data Processing, Multiply, CLZ, Shifting, Load,

Store, Branch, MRS, MSR• Not yet fully functional:

– Load/Store Multiple– Swap– Conditional execution (in regards to branch)– Interrupts

team antelope final presentation what doesn’t kill you, makes you stronger "the major...

Documents

cout logic

shifter computes cout

cout signalpros

cout signal ready

shifter output

input ex

pcan input

r0r15an input