1 presented by Şahin delİpinar simon moore,peter robinson,steve wilcox computer...

11

Presented By Şahin DELİPINARPresented By Şahin DELİPINAR

Simon Moore,Peter Robinson,Steve WilcoxSimon Moore,Peter Robinson,Steve Wilcox

Computer Labaratory,University Of CambridgeComputer Labaratory,University Of Cambridge

December 15, 1995December 15, 1995

Rotary Pipeline Processors

22/26/26

OUTLINESOUTLINES

• Abstract• Introduction• Rotary Pipeline Concept• Implementation Issues• Simulation• Relation to other approaches• Conclusions

33/26/26

ABSTRACTABSTRACT

• Rotary Pipeline Processors is a new Rotary Pipeline Processors is a new architecture for superscalar computingarchitecture for superscalar computing

• Registers flow around the pipelineRegisters flow around the pipeline

• Performance is only limited by data ratesPerformance is only limited by data rates

• Operation flows by the intervals of self-Operation flows by the intervals of self-time clocktime clock

44/26/26

INTRODUCTIONINTRODUCTION

• Most current designs uses parallel pipeline to Most current designs uses parallel pipeline to implement multiple instructions... implement multiple instructions...

• Synchronization problems decreasing Synchronization problems decreasing performance in pipelinesperformance in pipelines

• In Rotary Pipeline Instructions dispatched to In Rotary Pipeline Instructions dispatched to ALUs from the center of the pipeline. Data ALUs from the center of the pipeline. Data circulates in clockwise manner and processed circulates in clockwise manner and processed by ALUs and Memory Accessesby ALUs and Memory Accesses

55/26/26

ROTARY PİPELİNE CONCEPT

• Ovewiew : - A rotary pipeline

rotates the registers to processors around the ring. When registers comes to an functio unit to be processed it is used and result is reloaded

- Unused registers are not locked and continious to rotate

- ALU Operations occure in parallel

66/26/26

ROTARY PİPELİNE CONCEPT (Cont’d)

• Basic Pipeline Constructions :

A set of flip-flops are used to select which registers will be used and which will be left to cont.

77/26/26


• Adding A register File :

If the rotary pipeline is large and there are many Register Files then Multiported register File will be used to store waiting register files

Figure 3

88/26/26

• Rotary Bus Allocation :

Register files are dispatched to busses on the basis of first come first serve principle. If Ins. are independed then they continious to travel. when it is used only one unit then # of busses will increase (Figure 4 )


99/26/26

1010/26/26

• Instruction Issue : -Sequential Instructions are sent in the same directions so

overlapping and register dependencies are resolved - If an ıns. is not processed by a function unit simply

NOP issued resulting decrease in performance

- Dynamic Instruction reordering - Assume Load command followed by Add operation and first unit is ALU... - Only %3 performance is gained - Mispredicted Branch result decreasing in performans


1111/26/26

By the data driven nature of rotary pipeline Ins. By the data driven nature of rotary pipeline Ins. Ordering is not so important. Completion of the Ordering is not so important. Completion of the instructions are out of order. Figure 4...instructions are out of order. Figure 4...


1212/26/26


1313/26/26


• CONDITIONAL EXECUTION :CONDITIONAL EXECUTION :

Conditional execution of arithmetic and logical instruction Conditional execution of arithmetic and logical instruction may be handled by using an extra control logic at each may be handled by using an extra control logic at each ALU. This controls the writing of the results to the rotary ALU. This controls the writing of the results to the rotary pipeline by controlling the output switch network.pipeline by controlling the output switch network.

1414/26/26

• BRANCHES:BRANCHES:

Branches have always adverse effect on Branches have always adverse effect on the performans of the pipelines. the performans of the pipelines. Unconditional branches are easy to handle Unconditional branches are easy to handle and predicted before the operation begins and predicted before the operation begins but conditional branches are dependent but conditional branches are dependent upon the outcome of execution stage and upon the outcome of execution stage and difficult to handle. This can be solved by difficult to handle. This can be solved by the speculation execution technique.the speculation execution technique.


1515/26/26


• SPECULATIVE EXECUTION:SPECULATIVE EXECUTION:

- - If an execution is marked as speculativeIf an execution is marked as speculative

it could be revoked.it could be revoked.

- If the register file is used… (results not written to reg.)- If the register file is used… (results not written to reg.)

- If a larger register file is used… ( Temp. Reg. Files )- If a larger register file is used… ( Temp. Reg. Files )

- If a larger rotary pipeline is used…( Flip flops )- If a larger rotary pipeline is used…( Flip flops )

1616/26/26

IMPLEMENTATION

• Data encoding and completion detectionData encoding and completion detection:: -Determining of completion of evaluation for a logic-Determining of completion of evaluation for a logic

block;block;

1. Embedding the completion signal within the data1. Embedding the completion signal within the data

2. Localised timing using matched delays2. Localised timing using matched delays

1717/26/26

IMPLEMENTATION (Cont’d)

• Embedding the completion signal within the data is done by using Embedding the completion signal within the data is done by using 1 of 4 encoding technique. Here a completion signal is embedded 1 of 4 encoding technique. Here a completion signal is embedded within the data and as seen in Figure 5 a coding sheme is used. within the data and as seen in Figure 5 a coding sheme is used. But in bundled data binary encoding is usedBut in bundled data binary encoding is used

• Matched delays method subjected to change according to thermal Matched delays method subjected to change according to thermal effects and manufecturer toleranceeffects and manufecturer tolerance

Figure 5

1818/26/26


• Using Dynamic Logic : - Dynamic logic and inverted 1 - Dynamic logic and inverted 1

of 4 encoded data dovetail of 4 encoded data dovetail nicely because precharging nicely because precharging the logic depends upon the the logic depends upon the clearing 1 of 4 encoding clearing 1 of 4 encoding function before evaluation.function before evaluation.

- Completion detection - Completion detection process can be simplified by process can be simplified by using AND gates instead of using AND gates instead of C elements in the circuit.C elements in the circuit.

Figure 6

1919/26/26


• Outline Of a Stage in the Pipeline:Outline Of a Stage in the Pipeline:

A banks of transistors are usedA banks of transistors are used to download/upload data to registersto download/upload data to registers

Figure 7

2020/26/26


• Controlling The Pipeline :

Each Stage of the pipeline passes through the following stages:Each Stage of the pipeline passes through the following stages:

- Empty : ALU is prechared and flip-flops are reset- Empty : ALU is prechared and flip-flops are reset - Waiting for data : Precharge and reset are released- Waiting for data : Precharge and reset are released - Latching data : SR flip flops store the results- Latching data : SR flip flops store the results - Precharge : After latching data ALU precharge commence- Precharge : After latching data ALU precharge commence - Reset : Once the next stage issues completion, the latches of this - Reset : Once the next stage issues completion, the latches of this

stage may be resetstage may be reset - Empty : Completing cycle- Empty : Completing cycle

2121/26/26


Figure 8

2222/26/26

SIMULATIONSIMULATION

• Instruction Set Choice :Instruction Set Choice : ARM instructions are used for the convenience of ARM instructions are used for the convenience of

comparison with existing clock. comparison with existing clock. Characteristics of the Ins. ;Characteristics of the Ins. ; 1. conditionals: Every instruction can be conditionally

executed 2. PC : The program counter is one of the general

purpose registers and may be written to, thereby causing a branch;

3. Load and store multiple instructions in one register

2323/26/26

SIMULATION (Cont’d)SIMULATION (Cont’d)

• Initial Results :Initial Results : ARM Instruction sets and only store and compress ARM Instruction sets and only store and compress

benchmarks are used to test performancebenchmarks are used to test performance - Firstly ALU, Memory Access and Branch- Firstly ALU, Memory Access and Branch units takenunits taken - A number of ALU units added..- A number of ALU units added.. - Dynamic Instruction reordering increased the - Dynamic Instruction reordering increased the performance by %3 performance by %3 - Branch prediction and using larger memory register file - Branch prediction and using larger memory register file increased the performance (Figure 9)increased the performance (Figure 9) - But soon memory accesses will limit the performance- But soon memory accesses will limit the performance

2424/26/26

Figure 9

2525/26/26

RELATION TO OTHER APPROACHESRELATION TO OTHER APPROACHES

• Data transfer capability within the stagesData transfer capability within the stages In Rp, Data is passed throuh latches between pipeline In Rp, Data is passed throuh latches between pipeline

stages . Rotary pipeline is beter than clock applications stages . Rotary pipeline is beter than clock applications where data is only available after clock periodswhere data is only available after clock periods

• Amulet is a single processor which data is transparent at Amulet is a single processor which data is transparent at latches in situations of pipeline refillingslatches in situations of pipeline refillings

• CFPP , as data traversed along the pipeline register values CFPP , as data traversed along the pipeline register values filter down and at the end of the cycle , operands gathered filter down and at the end of the cycle , operands gathered at the very beginning of the pipelineat the very beginning of the pipeline

RP differs from other superscaler processors by avoiding RP differs from other superscaler processors by avoiding global Comm.global Comm.

2626/26/26

CONCLUSIONSCONCLUSIONS

• Rotary Pipelines are self timed structures which allows Rotary Pipelines are self timed structures which allows multiple instructions to be implemented at the same timemultiple instructions to be implemented at the same time

Variations:Variations: 1. Passing complete registers..1. Passing complete registers.. 2. Passing only active registers…2. Passing only active registers…• In Rotary Pipelines, structure emphisized on In Rotary Pipelines, structure emphisized on

performance rather than size and low power.performance rather than size and low power.• RPs have fewer busses comp. to other superscaler RPs have fewer busses comp. to other superscaler

processors processors • Suitable for self time circuits but not clocked Suitable for self time circuits but not clocked

implementationsimplementations

2727/26/26

Questions?...Questions?...

1 presented by Şahin delİpinar simon moore,peter robinson,steve wilcox computer...

Documents

parallel pipeline

register filesfigure

multiported register

larger register file

outcome of execution

multiple instructions

sequential instructions

function unit