vector computers

VECTOR COMPUTERS

VECTOR COMPUTERS• Vector Instruction Types

• Memory Access Schemes

• Vector Task Scheduling

Vector Instruction Types

Characterization of vector instructions for register based pipelined vector machines

VECTOR COMPUTERS(1) Vector-vector Instruction

Mappings defining vector-vector instructions:f1: Vj → Vif2: Vj + Vk → Vi

Vj Register Vk Register Vi Register

Functional

Unit

VECTOR COMPUTERSVector-vector Instruction

Example:

for f1: V1 = sin (V2) and

for f2: V3 = V1 + V2, where Vi is vector register(s) for i = 1, 2, 3

VECTOR COMPUTERS(2) Vector-scalar instruction

f3: s×Vk → Vi

Sj Register

Vk Register Vi Register

Functional

Unit

VECTOR COMPUTERS(3) Vector-memory instruction

f4: M → V Vector Load

f5: V → M Vector Store

Vi Register

Memory Path

Memory Path

Memory

Vector Load

Vector Store

VECTOR COMPUTERS(4) Vector reduction instructionf6: Vi → Sj; ex. Max., Min., Sum, Meanf7: VixVj → Sk; ex. Dot product(5) Gather and Scatter instructions ( to

Gather/Scatter randomly throughout memory)

f8: M → V1xV0 GatherGather: Operation that fetches from memory

the nonzero elements of a sparse vector using indices that themselves are indexed.

VECTOR COMPUTERSf9: V1xV0 → M Scatter

Scatter: Opposite of Gather, storing into memory a vector in a sparse vector whose nonzero entries are indexed.

VECTOR COMPUTERSGather Instruction

4

2

7

0

600

400

250

200

200 100

300 101

400 102

500 103

600 104

700 105

100 106

250 107

V0 Register(Index)

V1 Register(Data)

Memory

Address

VL Register

4

100

Vectorlength

MemoryAddress(Base)

Address = Base Address + Index

(Address = 100 + 4) = 104

Data

Load

VECTOR COMPUTERSScatter Instruction

4

2

7

0

200

300

450

500

500 100

300 101

300 102

500 103

200 104

700 105

100 106

450 107

V0 Register(Index)

V1 Register(Data)

Memory

Address

VL Register

4

100

Vectorlength

MemoryAddress(Base)

Address = Base Address + Index

(Address = 100 + 4 = 104)

Data

Store

VECTOR COMPUTERS(6) Masking Instructions

It uses a mask vector to compress or to expand a vector to a shorter or longer index vector, respectively

f10: V0xVm → V1

VECTOR COMPUTERS (Masking)

01

03

04

07

00 0

01 -1

02 0

03 5

04 -15

05 0

06 0

07 24

08 -7

09 13

10 0

11 -17

V0 Register(Tested)

V1 Register(Result)

Indices

4

010110011101

VL Register

VM Register

1 for nonzero

0 for zero in V0 For compressing a long vector into a short

index vectorUsed in Cray Y-MP

VECTOR ACCESS MEMORY SCHEMES

Simultaneous (S) Access Memory Organization

• Low-order Interleaved Memory• All memory modules accessed

simultaneously in a synchronized manner.

• The single access returns ‘m’ consecutive words simultaneously from ‘m’ memory modules.

• High-order (n-a) bits select the same offset word from each module.


• At the end of each memory cycle, m = 2a consecutive words are latched in the data buffers simultaneously.

• Low-order ‘a’ bits are used to select ‘m’ words, one for each minor cycle.


Module 0

Module 1

Module m-1

MUX

(n-a) High-order

address bits

Data Latch

Read/Write

Low-order Address Bits

a

Single word

Access

Fetch Cycle Access Cycle

Timing diagram for S- access configuration

Fetch1 Fetch2 Fetch3



Cycle1 Cycle2 Cycle3

Access1 Access2

Access2

Access2Access1

Access1

M0

M1

Mm-1

m words m words

Major cycles (θ) Minor cycles (τ)θ = m τ


• What is the degree of interleaving?

• Degree of interleaving = m

• How many cycles are required to fetch m-consecutive words?

• Number of cycles required to fetch m-consecutive words = 2 cycles

• Applications of S-Access Memory– Whenever a block of data is to be

fetched.


– S-Access is ideal for accessing a vector of data elements.

– For prefetching sequential instructions for a pipeline processor.

– To access a block of information for a pipeline processor with a cache.

NOTE

• For non-sequential access, memory performance deteriotes


•For non-sequential access–Use concurrency by providing an address latch for each memory module

That would give effective address (Hold Time ),

ta < memory cycle time

Thus, the group of M-memory modules can be multiplexed on an internal memory address bus, called a bank or a line.

C (Concurrent) Access Memory Organization

Number of memory modules

= m = 2a

Number of words in each module

= w = 2b words

Total Memory capacity

= m.w = 2a+b

C (Concurrent) access Memory Organization

m - 1

2m - 1

(mw-1)

MAB

MDB

Mm-11

m + 1

m(w-1)+1

MAB

MDB

M10

m

m(w-1)

MAB

MDB

M0

Data Bus

WAB

Word Module

b

AddressDecoder

a

Memory address

Most significant

Module Address Buffer

Memory Data BufferWord Address

Buffer

C (Concurrent) access Memory Organization

Mm-1

M2

M1

M0

Access 1

Access 2

Access 3

Access M

Access M+1

Access M+2

1 2 3 M M+1 M+2Output Word:

Time

Timing diagram for accesses to consecutive addresses

taTa

C (Concurrent) Access Memory Organization

Address cycle time, ta = Ta/MMemory-access time = Ta (major cycle)

ExampleVector of s elements = V[0:s – 1]Access every other element of V (ie skip distance = 2)Element V(i) is stored in module i (mod M), for 0 ≤ i ≤ s – 1.After the initial access, the access time for each sequential element = one every 2ta seconds

Timing diagram, skip distance, d = 2

Let number of memory modules, M = 8

M6

M4

M2

M0

V[0]

V[2]

V[4]

V[6]

V[8]

V[10]

0 2 4 10 12 14Output V[i] for i =

Time

Timing diagram for accessing the elements of V[0:s -1]

taTa

6 8

V[12]

V[14]

C/S Access Memory Organization

C/S access memory organization

It is a combination of S access and C access schemes.

The modules are organized in a two-dimensional array.

Application: Multiple pipeline processors.

Multiple Vector Task Dispatching

Architecture of a typical vector processor with multiple functional units

High Speed Main

Memory

Instruction Processing Unit [IPU]

Vector [VAC]Access

Controller

Vector [VIC]InstructionController

Scalar Registers

Vector Registers

Pipe 1Pipe 2

Pipe p

Pipe 1Pipe 2

Pipe m

Vector Processor

Scalar Processor


IPU: Fetch and Decode scalar and vector instructions

Scalar instructions are dispatched to scalar processor

VIC: Receives vector instructions from IPU

Supervises executions of vector instructions (VI)

Multiple Vector Task Dispatching (VIC)

Decodes VIs

Calculates effective vector-operand addresses

Sets up VAC and Vector Processor

Monitors execution of VI

Partitions a vector task

Schedules different instructions to different functional pipelines


VAC: Fetches vector operands

Scheduling Vector Task

Time required to complete the execution of a single vector task = t0 + τ

where t0 = pipeline overhead time due to startup and flush delays

τ = tl.L = production delay

tl = average latency between two operand pairs.


Scheduling Vector Task

L = vector length

Typically, t0 » tl

Objective: Given a task, schedule the vector task among ‘m’ identical pipelines such that the total execution time is minimized.


Assume

Equal overhead time, ‘t0’ for all vector tasks.

Characterization of a vector task system by a triple:

V = (T, <, τ)

where


1. T ≡ {T1, T2, T3,…, Tn} is a set of ‘n’ vector tasks.

2. < ≡ partial ordering relation, specifying the precedence relationship among the tasks in the set T

3. τ: Τ → R+ ≡ time function defining the production delay τ(Ti) for each Ti in T.


Let us denote τ(Ti) ≡ τi for i = 1, 2, 3, …, n.

Number of processors = mi.e., set of vector pipelines,

P = {P1, P2, …, Pm}Set of possible time intervals ≡ R2

Utilization of a pipeline Pi within the interval [x, y] = Pi (x, y)

Multiple Vector Task DispatchingResource SpaceThe set of all possible pipeline-utilization

patterns is called the Resource Space

≡ Cartesian Product PxR2

= {Pi (x, y) | Pi P and (x, y) R2}Parallel schedule ‘f’ for a vector task

system V = (T, <, τ)

Multiple Vector Task Dispatching2PxR22PxR2

f: T →

where is the power set of the Resource Space PxR2

Typically,

Mapping for each Ti T,

f(Ti) = {Pi1 (x1, y1), Pi2 (x2, y2), …., Pip (xp, yp)}


Partitioning of task Ti

i.e., the task Ti is subdivided into p subtasks,

{Tij|j = 1,2, …, p} = Ti1, Ti2,…, Tip and

subtasks Tij will be executed by pipelines Pij for each j =1, 2, …, p.

Multiple Vector Task DispatchingConditions for multi-pipeline operations:1. (a) yj – xj > t0 for all intervals [xj, yj]j = 1, 2, …., p

(b) Total production delay,

τi = (yj – xj – t0) for all intervals [xj, yj],

j = 1, 2, … , p.

2. If Pij = Pil,

Then [xj, yj] ∩ [xl, yl] = Φ [i.e., each pipeline is

performing only one subtask at a time]

p1j


Finish Time for each vector task Ti,

F(Ti) = max {y1, y2, …, yp}

Finish Time for a parallel schedule for

n-task system,

ω ≡ max {F(T1), F(T2), …, F(Tn)}

For good parallel schedule minimize ω.


Example:

T = {T1, T2, T3, T4}, τ0 =1, τ2 = 2, τ3 = 6,

τ3 = 2.

Task GraphT1 T4

T2 T3

10 2

2 6τ0 =1

Schedule T = {T1, T2, T3, T4} on two [m = 2] pipelines


Solution:

Partition vector tasks having large values of production delays

T1, and T3 having delays of 10 and 6, respectively, can be partitions.

Partition vector tasks so as to optimize pipeline utilization.

Partition (1) Vector task T1 into T11 and T12 with τ11 = 7 and τ12 = 3.


(2) Vector task T3 into T31 and T32 with τ31 = 4 and τ32 = 2.

0 1 2 3 4 5 6 7 8 9 10 11 12

Τ0 Τ11 = 7 Τ0 Τ31 = 4

Τ0 Τ4 = 2 Τ0 Τ12 = 3 Τ0 Τ2 = 2 Τ0 Τ32 = 2

13

14

idle

P1

P2

Multiple Vector Task DispatchingMappingsF(T1) = { P1(0, 8), P2(3, 7)} with S(T1) = 0, and F(T1) =8F(T2) = {P2(8, 11)} with S(T2) = 8, and F(T2)

= 11F(T3) = {P1(8, 13), P2(11, 14)} with S(T3) = 8,

and F(T3) = 14F(T4) = {P2(0, 3)} with S(T4) = 0, and F(T4) =

3.Therefore, parallel schedule ‘f’ has Finish

Time, ω = 14 = F(T3).

Multiple Vector Task DispatchingFormal Statement of multiple-pipeline

scheduling Problem:Given (i) a vector task system V, (ii) a vector computer with (iii) m identical pipelines, and (iv) a deadline D, does there exist a parallel schedule ‘f’ for V

with Finish Time ω such that ω ≤ D?[It is a feasibility problem.]Desirable algorithm: Heuristic-scheduling

algorithm.

vector computers

Documents