1 pipeline and vector processing chapter # 9. 2 contents parallel processing pipelining arithmetic...

1

PIPELINE AND VECTOR PROCESSING

CHAPTER # 9

2

CONTENTS

Parallel Processing Pipelining Arithmetic Pipeline Instruction Pipeline RISC Pipeline Vector Processing Array Processors

3

Figure 9-1

Processor with multiple functional units

Integer multiply

Adder-sub tractor

Floating-pointmultiply

Floating-pointdivide

Floating-pointAdd-subtract

Incrementer

Logic unit

Shift unit

Processorregister

To memory

4

Instruction and stream.

Single instruction stream, single data stream (SISD).

Single instruction stream, multiple data stream (SIMD).

Multiple instruction stream, single data stream (MISD).

Multiple instruction stream, multiple data stream (MIMD).

5

Figure 9-2

Example of Pipelining.

Ai Bi Ci

R1 Ai , R2 Bi

Input Ai and Bi

R3 R1 * R2, R4 Ci Multiply and input Ci

R5 R3 + R4 Add Ci to product

R1 R2

Multiplier

R3 R4

Adder

R5

6

ClockPulsenumber

Segment1R1 R2

Segment2R3 R4

Segment3 R5

1 A1 B1 ---- ---- ----

2 A2 B2 A1*B1 C1 ---- 3 A3 B3 A2*B2 C2 A1*B1+C1

4 A4 B4 A3*B3 C3 A2*B2+C2

5 A5 B5 A4*B4 C4 A3*B3+C3

6 A6 B6 A5*B5 C5 A4*B4+C4

7 A7 B7 A6*B6 C6 A5*B5+C5

8 ---- ---- A7*B7 C7 A6*B6+C6

9 ---- ---- ---- ---- A7*B7+C7

Table 9-1

Content of registers in pipeline example.

7

Figure 9-3

Four segment pipeline.

Clock

Input R4S1 R3R2 S4S3S2R1

8

Figure 9-4

Space-time diagram for pipeline.

1 2 3 4 5 6 7 8 9

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

Clock cycleSegment

:1

2

3

4

9

Figure 9-5

Multiple functional units in parallel.

P4

Ii+3

P3

Ii+2

P2

Ii+1

P1

Ii

10

Arithmetic Pipeline

Compare the exponents. Align the mantissas. Add or subtract the mantissas. Normalize the result.

11

Mantissas Exponents

a b A B

Segment 1

Segment 2

Segment 3

Segment 4

R

CompareExponentBy subtraction

R R

Choose exponent

R

Adjust

Exponent

R

R

Align mantissas

R

R

Add or subtract

mantissas

Normalize

result

R

Difference

Figure 9-6

Pipeline for floating-point and subtraction.

12

Instruction Pipeline

Fetch the instruction from memory. Decode the instruction. Calculate the effective address. Fetch the operands from memory. Execute the instruction. Store the result in the proper place.

13

Figure 9-7

Four-segment CPU pipeline.

Segment 1

Segment 2

Segment 3

Segment 4

Decode instructionAnd calculateEffective address

Fetch instruction from memory

Branch?

Fetch operandFrom memory

Execute instruction

Interrupt?Interrupthandling

Update PC

Empty pipe

yes

no

yes

no

14

FI is the segment that fetches an instruction.

DA is the segment that decodes the instruction and calculate the effective address.

FO is the segment that fetches the operand.

EX is the segment that executes the instruction.

Segments and their purpose.

15

1 2 3 4 5 6 7 8 9 10 11 12 13

1

2

3

4

5

6

7

Step:

Instruction:

(Branch)

FI DA FO EX

FI DA FO

FO FI DA

EX

EX

EX

EX

EX

EX

FO

FO

FO

FO

DA

DA

DA

DA

FI

FI

FI

FI

FI -- --

-- -- --

Figure 9-8

Timing of instruction pipeline.

16

Pipeline Conflicts

Resource conflicts Data dependency conflicts Branch difficulties conflicts

17

Three-segment instruction pipeline

I: Instruction fetch A: ALU operation E: Execute instruction

18

Delayed Load

LOAD R1 M[address 1]

LOAD R2 M[address 2]

ADD R3 R1+R2

STORE M[address 3] R3

19

E

654321

I

Clock cycles

I

I

I

A

A

A

A

E

E

E

E

1. Load R1

2. Load R2

3. Add R1+R2

4. Store R3

Pipeline timing with data conflict

7654321 Clock cycles

1. Load R1

2. Load R2

3. No-operation

4. Add R1+R2

5. Store R3

I

I

I

I

I

A

A

A

A

A

E

E

E

E

Pipeline timing with delayed load

Figure 9-9

Three segment pipeline timing.

20

Figure 9-10

Examples of delayed branch.

I

Clock cycles

I

I

I

A

A

A

A

E

E

E

E

1. Load

2. Increment

3. Add

4. Subtract

10987654321

5. Branch to X

6. NO-operation

7. NO-operation

8. Instruction in X

I A

I A E

E

I A E

I A E

Using no-operation instructions

21

I

Clock cycles

I

I

A

A

A

E

E

E

1. Load

2. Increment

4. Add

5. Subtract

3. Branch to X

6. Instruction in X

I A

I A E

E

I A E

1 2 3 4 5 6 7 8

Figure 9-10

Examples of delayed branch.

Rearranging instruction

22

Application of Vector Processing

Long range weather forecasting. Petroleum explorations. Seismic data analysis. Medical diagnosis. Aerodynamics and space flight simulations.

23

Figure 9-11

Instruction format for vector processor

Operationcode

Base addressSource 1

Base addressSource 2

Base addressdestination

Vectorlength

24

Figure 9-12

Pipeline for calculating an inner product

SourceA

SourceB

Multiplier pipeline

Adder pipeline

25

Figure 9-13

Multiple module memory organization

AR AR AR AR

DRDRDRDR

Memoryarray

Memoryarray

Memoryarray

Memoryarray

Address bus

Data bus

26

Types of Array Processors

Attached Array Processor SIMD Array Processor

27

Figure 9-14

Attached Array Processor with host computer

General-Purposecomputer

input-outputinterface

Attached arrayprocessor

Local memoryMain memoryHigh-speed memory to

Memory bus

28

Figure 9-15

SIMD array processor organization

Master controlunit

Main memory

PE1

PE2

PE3

PEn

M1

M2

M3

Mn

1 pipeline and vector processing chapter # 9. 2 contents parallel processing pipelining arithmetic...

Documents

b segment

b i c i r

segment instruction

b i input

segment pipeline

input c i r

mantissas r r

exponent r r