vector computers
DESCRIPTION
Vector ComputersTRANSCRIPT
VECTOR COMPUTERS
VECTOR COMPUTERS• Vector Instruction Types
• Memory Access Schemes
• Vector Task Scheduling
Vector Instruction Types
Characterization of vector instructions for register based pipelined vector machines
VECTOR COMPUTERS(1) Vector-vector Instruction
Mappings defining vector-vector instructions:f1: Vj → Vif2: Vj + Vk → Vi
Vj Register Vk Register Vi Register
Functional
Unit
VECTOR COMPUTERSVector-vector Instruction
Example:
for f1: V1 = sin (V2) and
for f2: V3 = V1 + V2, where Vi is vector register(s) for i = 1, 2, 3
VECTOR COMPUTERS(2) Vector-scalar instruction
f3: s×Vk → Vi
Sj Register
Vk Register Vi Register
Functional
Unit
VECTOR COMPUTERS(3) Vector-memory instruction
f4: M → V Vector Load
f5: V → M Vector Store
Vi Register
Memory Path
Memory Path
Memory
Vector Load
Vector Store
VECTOR COMPUTERS(4) Vector reduction instructionf6: Vi → Sj; ex. Max., Min., Sum, Meanf7: VixVj → Sk; ex. Dot product(5) Gather and Scatter instructions ( to
Gather/Scatter randomly throughout memory)
f8: M → V1xV0 GatherGather: Operation that fetches from memory
the nonzero elements of a sparse vector using indices that themselves are indexed.
VECTOR COMPUTERSf9: V1xV0 → M Scatter
Scatter: Opposite of Gather, storing into memory a vector in a sparse vector whose nonzero entries are indexed.
VECTOR COMPUTERSGather Instruction
4
2
7
0
600
400
250
200
200 100
300 101
400 102
500 103
600 104
700 105
100 106
250 107
V0 Register(Index)
V1 Register(Data)
Memory
Address
VL Register
4
100
Vectorlength
MemoryAddress(Base)
Address = Base Address + Index
(Address = 100 + 4) = 104
Data
Load
VECTOR COMPUTERSScatter Instruction
4
2
7
0
200
300
450
500
500 100
300 101
300 102
500 103
200 104
700 105
100 106
450 107
V0 Register(Index)
V1 Register(Data)
Memory
Address
VL Register
4
100
Vectorlength
MemoryAddress(Base)
Address = Base Address + Index
(Address = 100 + 4 = 104)
Data
Store
VECTOR COMPUTERS(6) Masking Instructions
It uses a mask vector to compress or to expand a vector to a shorter or longer index vector, respectively
f10: V0xVm → V1
VECTOR COMPUTERS (Masking)
01
03
04
07
00 0
01 -1
02 0
03 5
04 -15
05 0
06 0
07 24
08 -7
09 13
10 0
11 -17
V0 Register(Tested)
V1 Register(Result)
Indices
4
010110011101
VL Register
VM Register
1 for nonzero
0 for zero in V0 For compressing a long vector into a short
index vectorUsed in Cray Y-MP
VECTOR ACCESS MEMORY SCHEMES
Simultaneous (S) Access Memory Organization
• Low-order Interleaved Memory• All memory modules accessed
simultaneously in a synchronized manner.
• The single access returns ‘m’ consecutive words simultaneously from ‘m’ memory modules.
• High-order (n-a) bits select the same offset word from each module.
VECTOR ACCESS MEMORY SCHEMES
• At the end of each memory cycle, m = 2a consecutive words are latched in the data buffers simultaneously.
• Low-order ‘a’ bits are used to select ‘m’ words, one for each minor cycle.
VECTOR ACCESS MEMORY SCHEMES
Module 0
Module 1
Module m-1
MUX
(n-a) High-order
address bits
Data Latch
Read/Write
Low-order Address Bits
a
Single word
Access
Fetch Cycle Access Cycle
Timing diagram for S- access configuration
Fetch1 Fetch2 Fetch3
Fetch1 Fetch2 Fetch3
Fetch1 Fetch2 Fetch3
Cycle1 Cycle2 Cycle3
Access1 Access2
Access2
Access2Access1
Access1
M0
M1
Mm-1
m words m words
Major cycles (θ) Minor cycles (τ)θ = m τ
VECTOR ACCESS MEMORY SCHEMES
• What is the degree of interleaving?
• Degree of interleaving = m
• How many cycles are required to fetch m-consecutive words?
• Number of cycles required to fetch m-consecutive words = 2 cycles
• Applications of S-Access Memory– Whenever a block of data is to be
fetched.
VECTOR ACCESS MEMORY SCHEMES
– S-Access is ideal for accessing a vector of data elements.
– For prefetching sequential instructions for a pipeline processor.
– To access a block of information for a pipeline processor with a cache.
NOTE
• For non-sequential access, memory performance deteriotes
VECTOR ACCESS MEMORY SCHEMES
•For non-sequential access–Use concurrency by providing an address latch for each memory module
That would give effective address (Hold Time ),
ta < memory cycle time
Thus, the group of M-memory modules can be multiplexed on an internal memory address bus, called a bank or a line.
C (Concurrent) Access Memory Organization
Number of memory modules
= m = 2a
Number of words in each module
= w = 2b words
Total Memory capacity
= m.w = 2a+b
C (Concurrent) access Memory Organization
m - 1
2m - 1
(mw-1)
MAB
MDB
Mm-11
m + 1
m(w-1)+1
MAB
MDB
M10
m
m(w-1)
MAB
MDB
M0
Data Bus
WAB
Word Module
b
AddressDecoder
a
Memory address
Most significant
Module Address Buffer
Memory Data BufferWord Address
Buffer
C (Concurrent) access Memory Organization
Mm-1
M2
M1
M0
Access 1
Access 2
Access 3
Access M
Access M+1
Access M+2
1 2 3 M M+1 M+2Output Word:
Time
Timing diagram for accesses to consecutive addresses
taTa
C (Concurrent) Access Memory Organization
Address cycle time, ta = Ta/MMemory-access time = Ta (major cycle)
ExampleVector of s elements = V[0:s – 1]Access every other element of V (ie skip distance = 2)Element V(i) is stored in module i (mod M), for 0 ≤ i ≤ s – 1.After the initial access, the access time for each sequential element = one every 2ta seconds
Timing diagram, skip distance, d = 2
Let number of memory modules, M = 8
M6
M4
M2
M0
V[0]
V[2]
V[4]
V[6]
V[8]
V[10]
0 2 4 10 12 14Output V[i] for i =
Time
Timing diagram for accessing the elements of V[0:s -1]
taTa
6 8
V[12]
V[14]
C/S Access Memory Organization
C/S access memory organization
It is a combination of S access and C access schemes.
The modules are organized in a two-dimensional array.
Application: Multiple pipeline processors.
Multiple Vector Task Dispatching
Architecture of a typical vector processor with multiple functional units
High Speed Main
Memory
Instruction Processing Unit [IPU]
Vector [VAC]Access
Controller
Vector [VIC]InstructionController
Scalar Registers
Vector Registers
Pipe 1Pipe 2
Pipe p
Pipe 1Pipe 2
Pipe m
Vector Processor
Scalar Processor
Multiple Vector Task Dispatching
IPU: Fetch and Decode scalar and vector instructions
Scalar instructions are dispatched to scalar processor
VIC: Receives vector instructions from IPU
Supervises executions of vector instructions (VI)
Multiple Vector Task Dispatching (VIC)
Decodes VIs
Calculates effective vector-operand addresses
Sets up VAC and Vector Processor
Monitors execution of VI
Partitions a vector task
Schedules different instructions to different functional pipelines
Multiple Vector Task Dispatching
VAC: Fetches vector operands
Scheduling Vector Task
Time required to complete the execution of a single vector task = t0 + τ
where t0 = pipeline overhead time due to startup and flush delays
τ = tl.L = production delay
tl = average latency between two operand pairs.
Multiple Vector Task Dispatching
Scheduling Vector Task
L = vector length
Typically, t0 » tl
Objective: Given a task, schedule the vector task among ‘m’ identical pipelines such that the total execution time is minimized.
Multiple Vector Task Dispatching
Assume
Equal overhead time, ‘t0’ for all vector tasks.
Characterization of a vector task system by a triple:
V = (T, <, τ)
where
Multiple Vector Task Dispatching
1. T ≡ {T1, T2, T3,…, Tn} is a set of ‘n’ vector tasks.
2. < ≡ partial ordering relation, specifying the precedence relationship among the tasks in the set T
3. τ: Τ → R+ ≡ time function defining the production delay τ(Ti) for each Ti in T.
Multiple Vector Task Dispatching
Let us denote τ(Ti) ≡ τi for i = 1, 2, 3, …, n.
Number of processors = mi.e., set of vector pipelines,
P = {P1, P2, …, Pm}Set of possible time intervals ≡ R2
Utilization of a pipeline Pi within the interval [x, y] = Pi (x, y)
Multiple Vector Task DispatchingResource SpaceThe set of all possible pipeline-utilization
patterns is called the Resource Space
≡ Cartesian Product PxR2
= {Pi (x, y) | Pi P and (x, y) R2}Parallel schedule ‘f’ for a vector task
system V = (T, <, τ)
Multiple Vector Task Dispatching2PxR22PxR2
f: T →
where is the power set of the Resource Space PxR2
Typically,
Mapping for each Ti T,
f(Ti) = {Pi1 (x1, y1), Pi2 (x2, y2), …., Pip (xp, yp)}
Multiple Vector Task Dispatching
Partitioning of task Ti
i.e., the task Ti is subdivided into p subtasks,
{Tij|j = 1,2, …, p} = Ti1, Ti2,…, Tip and
subtasks Tij will be executed by pipelines Pij for each j =1, 2, …, p.
Multiple Vector Task DispatchingConditions for multi-pipeline operations:1. (a) yj – xj > t0 for all intervals [xj, yj]j = 1, 2, …., p
(b) Total production delay,
τi = (yj – xj – t0) for all intervals [xj, yj],
j = 1, 2, … , p.
2. If Pij = Pil,
Then [xj, yj] ∩ [xl, yl] = Φ [i.e., each pipeline is
performing only one subtask at a time]
p1j
Multiple Vector Task Dispatching
Finish Time for each vector task Ti,
F(Ti) = max {y1, y2, …, yp}
Finish Time for a parallel schedule for
n-task system,
ω ≡ max {F(T1), F(T2), …, F(Tn)}
For good parallel schedule minimize ω.
Multiple Vector Task Dispatching
Example:
T = {T1, T2, T3, T4}, τ0 =1, τ2 = 2, τ3 = 6,
τ3 = 2.
Task GraphT1 T4
T2 T3
10 2
2 6τ0 =1
Schedule T = {T1, T2, T3, T4} on two [m = 2] pipelines
Multiple Vector Task Dispatching
Solution:
Partition vector tasks having large values of production delays
T1, and T3 having delays of 10 and 6, respectively, can be partitions.
Partition vector tasks so as to optimize pipeline utilization.
Partition (1) Vector task T1 into T11 and T12 with τ11 = 7 and τ12 = 3.
Multiple Vector Task Dispatching
(2) Vector task T3 into T31 and T32 with τ31 = 4 and τ32 = 2.
0 1 2 3 4 5 6 7 8 9 10 11 12
Τ0 Τ11 = 7 Τ0 Τ31 = 4
Τ0 Τ4 = 2 Τ0 Τ12 = 3 Τ0 Τ2 = 2 Τ0 Τ32 = 2
13
14
idle
P1
P2
Multiple Vector Task DispatchingMappingsF(T1) = { P1(0, 8), P2(3, 7)} with S(T1) = 0, and F(T1) =8F(T2) = {P2(8, 11)} with S(T2) = 8, and F(T2)
= 11F(T3) = {P1(8, 13), P2(11, 14)} with S(T3) = 8,
and F(T3) = 14F(T4) = {P2(0, 3)} with S(T4) = 0, and F(T4) =
3.Therefore, parallel schedule ‘f’ has Finish
Time, ω = 14 = F(T3).
Multiple Vector Task DispatchingFormal Statement of multiple-pipeline
scheduling Problem:Given (i) a vector task system V, (ii) a vector computer with (iii) m identical pipelines, and (iv) a deadline D, does there exist a parallel schedule ‘f’ for V
with Finish Time ω such that ω ≤ D?[It is a feasibility problem.]Desirable algorithm: Heuristic-scheduling
algorithm.