datorteknik f1 bild 1 higher level parallelism the pram model vector processors flynn classification...
TRANSCRIPT
Datorteknik F1 bild 1
Higher Level Parallelism
The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks Memory Architectures Synchronization
Datorteknik F1 bild 2
Amdahl’s Law
The performance gain by speeding up some operations is limited by the fraction of the time these (faster) operations are used
Speedup = Original T/Improved T Speedup = Improved Performance/Original
Performance
Datorteknik F1 bild 3
PRAM MODEL
All processors share the same memory space CRCW
– concurrent read, concurrent write
– resolution function on collision, (first/or/largest/error)
CREW– concurrent read, exclusive write
EREW– exclusive read, exclusive write
Datorteknik F1 bild 4
PRAM Algorithm
Same Program/Algorithm in All Processors Each Processor also have local memory/registers
Ex, Search for one value from in an array– Using p processor
– Array size m
– p=m
3 2 5 7 2 5 1 6
2 Search for the value 2 inthe array
Datorteknik F1 bild 5
Search CRCW p=m
3 2 5 7 2 5 1 6
2step1: concurrent read Athe same memory is accessedby all processors
P1 P2 P3 P4 P5 P6 P7 P82 2 2 2 2 2 2 2
step2: read Bdifferent memory addressesfor each processor
P1 P2 P3 P4 P5 P6 P7 P82 2 2 2 2 2 2 23 2 5 7 2 5 1 6
AB
AB
Datorteknik F1 bild 6
Search CRCW p=m
step3: concurrent writewrite 1 if A=B else 0
1
We use “or” resolution1: Value found0: Value not found
Complexity• All operations performed in constant time• Count only the cost of communication steps• In this case the number of steps is independent of m, (if enough processors)
• Search is done in constant time O(1) for CRCW and p=m
P1 P2 P3 P4 P5 P6 P7 P82 2 2 2 2 2 2 23 2 5 7 2 5 1 6
AB
Datorteknik F1 bild 7
Search CREW p=m
step3: compute 1 if A=B else 0P1 P2 P3 P4 P5 P6 P7 P8
0 1 0 0 1 0 0 0
step4.1: read Astep4.2: read Bstep4.3: compute A or B
2 2 2 2 2 2 2 23 2 5 7 2 5 1 6
P1011
011
P3P2
0
00
P4
0
00
P1 P2
P1
Same processors can be reusedin the next step!
log m
steps
2
Complexity• We need log m steps to “collect” the result• Operations done in constant time• O(log m) complexity
2
2
Datorteknik F1 bild 8
Search EREW p=m
P1
2
P1 P2
P1 P2 P3 P4
P1 P2 P3 P4 P5 P6 P7 P8
log m
steps
2
It takes log m steps to distribute the value, more complex?
NO, the algorithm is still in O( log m) only the constant differs
2
2
Datorteknik F1 bild 9
PRAM a Theoretical Model
CRCW– Very elegant
– Not of much practical use, (too hard to implement)
CREW– This model can be used to develop algorithms for parallel
computers, e.g. our search example p=1 (a single processor), check all elements give O(m) p=m (m processors), complexity O(log m), not O(1)
– From our example we conclude that even in theory we do not get a m-times “speedup” using m-processors
THAT IS ONE BIG PROBLEM WITH PARALLEL COMPUTERS
2
Datorteknik F1 bild 10
Parallelism so far By pipelineing several instructions (at different
stages) are executed simultaneously– Pipeline depth limited by hazards
SuperScalar designs provide parallel execution units
– Limited by instruction and machine level parallelism
– VLIW might improve over hardware instruction issuing
All limited by the instruction fetch mechanism– Called the FLYNN BOTTLENECK
– Only a very limited nr of instructions can be fetched each cycle
– That makes vector operations ineffective
Datorteknik F1 bild 11
Vector Processors
Taking Pipelineing to its limits for vector operations
– Sometimes referred as a SuperPipeline
The same operation is performed on a vector of data
– No data dependencies in the vector data
– Ex, add two vectors
Solves the FLYNN BOTTLENECK problem– A loop over a vector can be issued by a singe instruction
Proven to be very effective for scientific calculations
– CRAY-1, CRAY-2, CRAY-XMP, CRAY-YMP
Datorteknik F1 bild 12
Vector Processor (CRAY-1 like)
MAIN MEMORY
Vectorload/store
Vectorregisters
Scalar registers (like MIPS reg file)
FP add/subtract
FP multiply
FP divide
Integer
Logical
SuperPipelined Arithmetical units
Datorteknik F1 bild 13
Vector Operations
Fully Pipelined– CPI = 1, we produce one result each cycle when pipe full
Pipeline Latency– Startup cost = pipeline depth
Vector Add 6 cycles Vector Multiplication 6 cycles Vector Divide 20 cycles Vector Load 12 cycles
(depends on memory hierarchy)
Sustained rate– Time/element for a collection of related vector operations
Datorteknik F1 bild 14
Vector Processor Design
Vector length control – VLR register (Maximum Vector Length, MVL)
– Strip Mining in software (Vector > MVL causes a loop)
Stride– How to layout a vectors and matrixes in memory, such
that
– Memory banks can be accessed without collision
Vector Chaining– Forwarding between vector registers (minimize latency)
Vector Mask Register (Boolean valued)– Conditional writeback, (if 0 no writeback)
– Sparse matrixes and conditional execution
Datorteknik F1 bild 15
Programming
By use of language constructs the compiler is able to utilize the vector functions
FORTRAN is widely used for scientific calculations
– built in matrix and vector functions/commands
LINPACK– A library of optimized linear algebra functions
– Often used as a benchmark (but does it tell the whole truth?)
Some more (implicite) vectorization possible by advanced compilers
Datorteknik F1 bild 16
Flynn Classification
SISD (Single Instruction, Single Data)– The MIPS, and even the Vector Processor
SIMD (Single Instruction, Multiple Data)– Each instruction activates several execution units in
parallel
MISD (Multiple Instruction, Single Data)– The VLIW architecture might be considered but…. MISD
is a seldom used classification
MIMD (Multiple Instruction, Multiple Data)– Multiprocessor architectures
– Multi computers (communicating over a LAN), sometimes treated as a separate class of architectures
Datorteknik F1 bild 17
Communication
• Total Bandwidth = Link Bandwidth• Bisection Bandwidth = Link Bandwidth
• Total Bandwidth = P * Link Bandwidth• Bisection Bandwidth = 2 * Link Bandwidth
Bus
Ring
Fully Connected • Total Bandwidth = (P * P-1)/2 * Link Bandwidth
• Bisection Bandwidth = (P/2) * Link Bandwidth2
Datorteknik F1 bild 18
MultiStage Networks
P1P2P3P4
Crossbar Switch
P1 to P2,P3P2 to P4P3 to P1
P1P2P3P4P5P6P7P8
Omega Network
P1 to P6, but P2 to P8 not possible at the same time
log P2
Datorteknik F1 bild 19
Connection Machines CM-2 (SIMD)
16k 1-bit CPUs
512 FPAs
16k 1-bit CPUs
512 FPAs
16k 1-bit CPUs
512 FPAs
Front endSISD Sequencer
Data Vault (Disk Array)
3-cube
CM-2 uses a 12-cube for communicationbetween the chips
1024 * Chips512 FPAs
16 1-bit Fully Connected CPUs on each ChipEach CPU has 3 1-bit registers and 64 k-bit memory
Datorteknik F1 bild 20
SIMD Programming, Parallel sumsum=0for (i=0;i<65536;i=i+1) /* Loop over 65k elements */
sum=sum+A[Pn,i]; /* Pn is the processor number */
limit=8192; half=limit; /* Collect sum from 8192 processors */repeat
half=half/2 /* Split into sender/receiver */if (Pn>=half && Pn<limit) send(Pn/2-half,sum);if (Pn<half) sum=sum+receive();limit=half;
until (half==1) /* final sum */
limit
halfsend(1,sum)
01234
send(0,sum)sum=sum+Rsum=sum+R
limithalf
012
send(0,sum)sum=sum+R 0 Final sum
Datorteknik F1 bild 21
SIMD vs MIMD SIMD
– Single Instruction (one PC)
– All processors perform the same work (synchronized)
– Conditional execution (case/if etc) Each processor holds a enable bit
MIMD– Each processor has a PC
Possible to run different programs: BUT
– All may run the same program (SPMD), single Program ... Use MIMD style programming for conditional execution Use SIMD style programming for synchronized actions
Datorteknik F1 bild 22
Memory Architectures for MIMD– Centralized
We use a single bus for all main memory Uniform memory access, (after passing the local
cache)
– Distributed The sought address might be hosted by another
processor Non-uniform memory access, (dynamic “find” time) The Extreme, a cache only Memory
– Shared All processors shared the same address space Memory can be used for communication
– Private All processors have a unique address space Communication must be done by “message passing”
Datorteknik F1 bild 23
Shared Bus MIMD
Processor
CacheSnoopTag
Processor
CacheSnoopTag
Processor
CacheSnoopTag
…
Usually 2-32 P
MEMORY I/O Cache Coherency Protocol• Write Invalidate The first write to address A causes all other cached references of A to be invalidated• Write Update On write to address A all cached references of A is updated (high bus activity)• On a cache read miss when using WB caches
• The cache holding the valid data writes to memory • The cache holding the valid data writes directly to the cache requiring the data
Datorteknik F1 bild 24
Synchronization When using shared data we need to se that
only one processor can access the data when updating
We need an atomic operation for TEST&SET
loop: TEST&SET A.lockbeq A.go loopupdate Aclear A.lock
loop: TEST&SET A.lockbeq A.go loopupdate Aclear A.lock
Processor 1 Processor 2
Processor 1 gets the lock (A.go)updates the shared dataand finally clears the lock (A.lock)
Processor B spin-waits until lock releasedupdates shaded data and releases lock