vector computers

43
VECTOR COMPUTERS

Upload: vikas-mishra

Post on 23-Oct-2015

12 views

Category:

Documents


0 download

DESCRIPTION

Vector Computers

TRANSCRIPT

Page 1: Vector Computers

VECTOR COMPUTERS

Page 2: Vector Computers

VECTOR COMPUTERS• Vector Instruction Types

• Memory Access Schemes

• Vector Task Scheduling

Vector Instruction Types

Characterization of vector instructions for register based pipelined vector machines

Page 3: Vector Computers

VECTOR COMPUTERS(1) Vector-vector Instruction

Mappings defining vector-vector instructions:f1: Vj → Vif2: Vj + Vk → Vi

Vj Register Vk Register Vi Register

Functional

Unit

Page 4: Vector Computers

VECTOR COMPUTERSVector-vector Instruction

Example:

for f1: V1 = sin (V2) and

for f2: V3 = V1 + V2, where Vi is vector register(s) for i = 1, 2, 3

Page 5: Vector Computers

VECTOR COMPUTERS(2) Vector-scalar instruction

f3: s×Vk → Vi

Sj Register

Vk Register Vi Register

Functional

Unit

Page 6: Vector Computers

VECTOR COMPUTERS(3) Vector-memory instruction

f4: M → V Vector Load

f5: V → M Vector Store

Vi Register

Memory Path

Memory Path

Memory

Vector Load

Vector Store

Page 7: Vector Computers

VECTOR COMPUTERS(4) Vector reduction instructionf6: Vi → Sj; ex. Max., Min., Sum, Meanf7: VixVj → Sk; ex. Dot product(5) Gather and Scatter instructions ( to

Gather/Scatter randomly throughout memory)

f8: M → V1xV0 GatherGather: Operation that fetches from memory

the nonzero elements of a sparse vector using indices that themselves are indexed.

Page 8: Vector Computers

VECTOR COMPUTERSf9: V1xV0 → M Scatter

Scatter: Opposite of Gather, storing into memory a vector in a sparse vector whose nonzero entries are indexed.

Page 9: Vector Computers

VECTOR COMPUTERSGather Instruction

4

2

7

0

600

400

250

200

200 100

300 101

400 102

500 103

600 104

700 105

100 106

250 107

V0 Register(Index)

V1 Register(Data)

Memory

Address

VL Register

4

100

Vectorlength

MemoryAddress(Base)

Address = Base Address + Index

(Address = 100 + 4) = 104

Data

Load

Page 10: Vector Computers

VECTOR COMPUTERSScatter Instruction

4

2

7

0

200

300

450

500

500 100

300 101

300 102

500 103

200 104

700 105

100 106

450 107

V0 Register(Index)

V1 Register(Data)

Memory

Address

VL Register

4

100

Vectorlength

MemoryAddress(Base)

Address = Base Address + Index

(Address = 100 + 4 = 104)

Data

Store

Page 11: Vector Computers

VECTOR COMPUTERS(6) Masking Instructions

It uses a mask vector to compress or to expand a vector to a shorter or longer index vector, respectively

f10: V0xVm → V1

Page 12: Vector Computers

VECTOR COMPUTERS (Masking)

01

03

04

07

00 0

01 -1

02 0

03 5

04 -15

05 0

06 0

07 24

08 -7

09 13

10 0

11 -17

V0 Register(Tested)

V1 Register(Result)

Indices

4

010110011101

VL Register

VM Register

1 for nonzero

0 for zero in V0 For compressing a long vector into a short

index vectorUsed in Cray Y-MP

Page 13: Vector Computers

VECTOR ACCESS MEMORY SCHEMES

Simultaneous (S) Access Memory Organization

• Low-order Interleaved Memory• All memory modules accessed

simultaneously in a synchronized manner.

• The single access returns ‘m’ consecutive words simultaneously from ‘m’ memory modules.

• High-order (n-a) bits select the same offset word from each module.

Page 14: Vector Computers

VECTOR ACCESS MEMORY SCHEMES

• At the end of each memory cycle, m = 2a consecutive words are latched in the data buffers simultaneously.

• Low-order ‘a’ bits are used to select ‘m’ words, one for each minor cycle.

Page 15: Vector Computers

VECTOR ACCESS MEMORY SCHEMES

Module 0

Module 1

Module m-1

MUX

(n-a) High-order

address bits

Data Latch

Read/Write

Low-order Address Bits

a

Single word

Access

Fetch Cycle Access Cycle

Page 16: Vector Computers

Timing diagram for S- access configuration

Fetch1 Fetch2 Fetch3

Fetch1 Fetch2 Fetch3

Fetch1 Fetch2 Fetch3

Cycle1 Cycle2 Cycle3

Access1 Access2

Access2

Access2Access1

Access1

M0

M1

Mm-1

m words m words

Major cycles (θ) Minor cycles (τ)θ = m τ

Page 17: Vector Computers

VECTOR ACCESS MEMORY SCHEMES

• What is the degree of interleaving?

• Degree of interleaving = m

• How many cycles are required to fetch m-consecutive words?

• Number of cycles required to fetch m-consecutive words = 2 cycles

• Applications of S-Access Memory– Whenever a block of data is to be

fetched.

Page 18: Vector Computers

VECTOR ACCESS MEMORY SCHEMES

– S-Access is ideal for accessing a vector of data elements.

– For prefetching sequential instructions for a pipeline processor.

– To access a block of information for a pipeline processor with a cache.

NOTE

• For non-sequential access, memory performance deteriotes

Page 19: Vector Computers

VECTOR ACCESS MEMORY SCHEMES

•For non-sequential access–Use concurrency by providing an address latch for each memory module

That would give effective address (Hold Time ),

ta < memory cycle time

Thus, the group of M-memory modules can be multiplexed on an internal memory address bus, called a bank or a line.

Page 20: Vector Computers

C (Concurrent) Access Memory Organization

Number of memory modules

= m = 2a

Number of words in each module

= w = 2b words

Total Memory capacity

= m.w = 2a+b

Page 21: Vector Computers

C (Concurrent) access Memory Organization

m - 1

2m - 1

(mw-1)

MAB

MDB

Mm-11

m + 1

m(w-1)+1

MAB

MDB

M10

m

m(w-1)

MAB

MDB

M0

Data Bus

WAB

Word Module

b

AddressDecoder

a

Memory address

Most significant

Module Address Buffer

Memory Data BufferWord Address

Buffer

Page 22: Vector Computers

C (Concurrent) access Memory Organization

Mm-1

M2

M1

M0

Access 1

Access 2

Access 3

Access M

Access M+1

Access M+2

1 2 3 M M+1 M+2Output Word:

Time

Timing diagram for accesses to consecutive addresses

taTa

Page 23: Vector Computers

C (Concurrent) Access Memory Organization

Address cycle time, ta = Ta/MMemory-access time = Ta (major cycle)

ExampleVector of s elements = V[0:s – 1]Access every other element of V (ie skip distance = 2)Element V(i) is stored in module i (mod M), for 0 ≤ i ≤ s – 1.After the initial access, the access time for each sequential element = one every 2ta seconds

Page 24: Vector Computers

Timing diagram, skip distance, d = 2

Let number of memory modules, M = 8

M6

M4

M2

M0

V[0]

V[2]

V[4]

V[6]

V[8]

V[10]

0 2 4 10 12 14Output V[i] for i =

Time

Timing diagram for accessing the elements of V[0:s -1]

taTa

6 8

V[12]

V[14]

Page 25: Vector Computers

C/S Access Memory Organization

C/S access memory organization

It is a combination of S access and C access schemes.

The modules are organized in a two-dimensional array.

Application: Multiple pipeline processors.

Page 26: Vector Computers

Multiple Vector Task Dispatching

Architecture of a typical vector processor with multiple functional units

High Speed Main

Memory

Instruction Processing Unit [IPU]

Vector [VAC]Access

Controller

Vector [VIC]InstructionController

Scalar Registers

Vector Registers

Pipe 1Pipe 2

Pipe p

Pipe 1Pipe 2

Pipe m

Vector Processor

Scalar Processor

Page 27: Vector Computers

Multiple Vector Task Dispatching

IPU: Fetch and Decode scalar and vector instructions

Scalar instructions are dispatched to scalar processor

VIC: Receives vector instructions from IPU

Supervises executions of vector instructions (VI)

Page 28: Vector Computers

Multiple Vector Task Dispatching (VIC)

Decodes VIs

Calculates effective vector-operand addresses

Sets up VAC and Vector Processor

Monitors execution of VI

Partitions a vector task

Schedules different instructions to different functional pipelines

Page 29: Vector Computers

Multiple Vector Task Dispatching

VAC: Fetches vector operands

Scheduling Vector Task

Time required to complete the execution of a single vector task = t0 + τ

where t0 = pipeline overhead time due to startup and flush delays

τ = tl.L = production delay

tl = average latency between two operand pairs.

Page 30: Vector Computers

Multiple Vector Task Dispatching

Scheduling Vector Task

L = vector length

Typically, t0 » tl

Objective: Given a task, schedule the vector task among ‘m’ identical pipelines such that the total execution time is minimized.

Page 31: Vector Computers

Multiple Vector Task Dispatching

Assume

Equal overhead time, ‘t0’ for all vector tasks.

Characterization of a vector task system by a triple:

V = (T, <, τ)

where

Page 32: Vector Computers

Multiple Vector Task Dispatching

1. T ≡ {T1, T2, T3,…, Tn} is a set of ‘n’ vector tasks.

2. < ≡ partial ordering relation, specifying the precedence relationship among the tasks in the set T

3. τ: Τ → R+ ≡ time function defining the production delay τ(Ti) for each Ti in T.

Page 33: Vector Computers

Multiple Vector Task Dispatching

Let us denote τ(Ti) ≡ τi for i = 1, 2, 3, …, n.

Number of processors = mi.e., set of vector pipelines,

P = {P1, P2, …, Pm}Set of possible time intervals ≡ R2

Utilization of a pipeline Pi within the interval [x, y] = Pi (x, y)

Page 34: Vector Computers

Multiple Vector Task DispatchingResource SpaceThe set of all possible pipeline-utilization

patterns is called the Resource Space

≡ Cartesian Product PxR2

= {Pi (x, y) | Pi P and (x, y) R2}Parallel schedule ‘f’ for a vector task

system V = (T, <, τ)

Page 35: Vector Computers

Multiple Vector Task Dispatching2PxR22PxR2

f: T →

where is the power set of the Resource Space PxR2

Typically,

Mapping for each Ti T,

f(Ti) = {Pi1 (x1, y1), Pi2 (x2, y2), …., Pip (xp, yp)}

Page 36: Vector Computers

Multiple Vector Task Dispatching

Partitioning of task Ti

i.e., the task Ti is subdivided into p subtasks,

{Tij|j = 1,2, …, p} = Ti1, Ti2,…, Tip and

subtasks Tij will be executed by pipelines Pij for each j =1, 2, …, p.

Page 37: Vector Computers

Multiple Vector Task DispatchingConditions for multi-pipeline operations:1. (a) yj – xj > t0 for all intervals [xj, yj]j = 1, 2, …., p

(b) Total production delay,

τi = (yj – xj – t0) for all intervals [xj, yj],

j = 1, 2, … , p.

2. If Pij = Pil,

Then [xj, yj] ∩ [xl, yl] = Φ [i.e., each pipeline is

performing only one subtask at a time]

p1j

Page 38: Vector Computers

Multiple Vector Task Dispatching

Finish Time for each vector task Ti,

F(Ti) = max {y1, y2, …, yp}

Finish Time for a parallel schedule for

n-task system,

ω ≡ max {F(T1), F(T2), …, F(Tn)}

For good parallel schedule minimize ω.

Page 39: Vector Computers

Multiple Vector Task Dispatching

Example:

T = {T1, T2, T3, T4}, τ0 =1, τ2 = 2, τ3 = 6,

τ3 = 2.

Task GraphT1 T4

T2 T3

10 2

2 6τ0 =1

Schedule T = {T1, T2, T3, T4} on two [m = 2] pipelines

Page 40: Vector Computers

Multiple Vector Task Dispatching

Solution:

Partition vector tasks having large values of production delays

T1, and T3 having delays of 10 and 6, respectively, can be partitions.

Partition vector tasks so as to optimize pipeline utilization.

Partition (1) Vector task T1 into T11 and T12 with τ11 = 7 and τ12 = 3.

Page 41: Vector Computers

Multiple Vector Task Dispatching

(2) Vector task T3 into T31 and T32 with τ31 = 4 and τ32 = 2.

0 1 2 3 4 5 6 7 8 9 10 11 12

Τ0 Τ11 = 7 Τ0 Τ31 = 4

Τ0 Τ4 = 2 Τ0 Τ12 = 3 Τ0 Τ2 = 2 Τ0 Τ32 = 2

13

14

idle

P1

P2

Page 42: Vector Computers

Multiple Vector Task DispatchingMappingsF(T1) = { P1(0, 8), P2(3, 7)} with S(T1) = 0, and F(T1) =8F(T2) = {P2(8, 11)} with S(T2) = 8, and F(T2)

= 11F(T3) = {P1(8, 13), P2(11, 14)} with S(T3) = 8,

and F(T3) = 14F(T4) = {P2(0, 3)} with S(T4) = 0, and F(T4) =

3.Therefore, parallel schedule ‘f’ has Finish

Time, ω = 14 = F(T3).

Page 43: Vector Computers

Multiple Vector Task DispatchingFormal Statement of multiple-pipeline

scheduling Problem:Given (i) a vector task system V, (ii) a vector computer with (iii) m identical pipelines, and (iv) a deadline D, does there exist a parallel schedule ‘f’ for V

with Finish Time ω such that ω ≤ D?[It is a feasibility problem.]Desirable algorithm: Heuristic-scheduling

algorithm.