datorteknik f1 bild 1 higher level parallelism the pram model vector processors flynn classification...

24
Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks Memory Architectures Synchronization

Upload: weston-knoop

Post on 02-Apr-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 1

Higher Level Parallelism

The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks Memory Architectures Synchronization

Page 2: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 2

Amdahl’s Law

The performance gain by speeding up some operations is limited by the fraction of the time these (faster) operations are used

Speedup = Original T/Improved T Speedup = Improved Performance/Original

Performance

Page 3: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 3

PRAM MODEL

All processors share the same memory space CRCW

– concurrent read, concurrent write

– resolution function on collision, (first/or/largest/error)

CREW– concurrent read, exclusive write

EREW– exclusive read, exclusive write

Page 4: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 4

PRAM Algorithm

Same Program/Algorithm in All Processors Each Processor also have local memory/registers

Ex, Search for one value from in an array– Using p processor

– Array size m

– p=m

3 2 5 7 2 5 1 6

2 Search for the value 2 inthe array

Page 5: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 5

Search CRCW p=m

3 2 5 7 2 5 1 6

2step1: concurrent read Athe same memory is accessedby all processors

P1 P2 P3 P4 P5 P6 P7 P82 2 2 2 2 2 2 2

step2: read Bdifferent memory addressesfor each processor

P1 P2 P3 P4 P5 P6 P7 P82 2 2 2 2 2 2 23 2 5 7 2 5 1 6

AB

AB

Page 6: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 6

Search CRCW p=m

step3: concurrent writewrite 1 if A=B else 0

1

We use “or” resolution1: Value found0: Value not found

Complexity• All operations performed in constant time• Count only the cost of communication steps• In this case the number of steps is independent of m, (if enough processors)

• Search is done in constant time O(1) for CRCW and p=m

P1 P2 P3 P4 P5 P6 P7 P82 2 2 2 2 2 2 23 2 5 7 2 5 1 6

AB

Page 7: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 7

Search CREW p=m

step3: compute 1 if A=B else 0P1 P2 P3 P4 P5 P6 P7 P8

0 1 0 0 1 0 0 0

step4.1: read Astep4.2: read Bstep4.3: compute A or B

2 2 2 2 2 2 2 23 2 5 7 2 5 1 6

P1011

011

P3P2

0

00

P4

0

00

P1 P2

P1

Same processors can be reusedin the next step!

log m

steps

2

Complexity• We need log m steps to “collect” the result• Operations done in constant time• O(log m) complexity

2

2

Page 8: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 8

Search EREW p=m

P1

2

P1 P2

P1 P2 P3 P4

P1 P2 P3 P4 P5 P6 P7 P8

log m

steps

2

It takes log m steps to distribute the value, more complex?

NO, the algorithm is still in O( log m) only the constant differs

2

2

Page 9: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 9

PRAM a Theoretical Model

CRCW– Very elegant

– Not of much practical use, (too hard to implement)

CREW– This model can be used to develop algorithms for parallel

computers, e.g. our search example p=1 (a single processor), check all elements give O(m) p=m (m processors), complexity O(log m), not O(1)

– From our example we conclude that even in theory we do not get a m-times “speedup” using m-processors

THAT IS ONE BIG PROBLEM WITH PARALLEL COMPUTERS

2

Page 10: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 10

Parallelism so far By pipelineing several instructions (at different

stages) are executed simultaneously– Pipeline depth limited by hazards

SuperScalar designs provide parallel execution units

– Limited by instruction and machine level parallelism

– VLIW might improve over hardware instruction issuing

All limited by the instruction fetch mechanism– Called the FLYNN BOTTLENECK

– Only a very limited nr of instructions can be fetched each cycle

– That makes vector operations ineffective

Page 11: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 11

Vector Processors

Taking Pipelineing to its limits for vector operations

– Sometimes referred as a SuperPipeline

The same operation is performed on a vector of data

– No data dependencies in the vector data

– Ex, add two vectors

Solves the FLYNN BOTTLENECK problem– A loop over a vector can be issued by a singe instruction

Proven to be very effective for scientific calculations

– CRAY-1, CRAY-2, CRAY-XMP, CRAY-YMP

Page 12: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 12

Vector Processor (CRAY-1 like)

MAIN MEMORY

Vectorload/store

Vectorregisters

Scalar registers (like MIPS reg file)

FP add/subtract

FP multiply

FP divide

Integer

Logical

SuperPipelined Arithmetical units

Page 13: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 13

Vector Operations

Fully Pipelined– CPI = 1, we produce one result each cycle when pipe full

Pipeline Latency– Startup cost = pipeline depth

Vector Add 6 cycles Vector Multiplication 6 cycles Vector Divide 20 cycles Vector Load 12 cycles

(depends on memory hierarchy)

Sustained rate– Time/element for a collection of related vector operations

Page 14: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 14

Vector Processor Design

Vector length control – VLR register (Maximum Vector Length, MVL)

– Strip Mining in software (Vector > MVL causes a loop)

Stride– How to layout a vectors and matrixes in memory, such

that

– Memory banks can be accessed without collision

Vector Chaining– Forwarding between vector registers (minimize latency)

Vector Mask Register (Boolean valued)– Conditional writeback, (if 0 no writeback)

– Sparse matrixes and conditional execution

Page 15: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 15

Programming

By use of language constructs the compiler is able to utilize the vector functions

FORTRAN is widely used for scientific calculations

– built in matrix and vector functions/commands

LINPACK– A library of optimized linear algebra functions

– Often used as a benchmark (but does it tell the whole truth?)

Some more (implicite) vectorization possible by advanced compilers

Page 16: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 16

Flynn Classification

SISD (Single Instruction, Single Data)– The MIPS, and even the Vector Processor

SIMD (Single Instruction, Multiple Data)– Each instruction activates several execution units in

parallel

MISD (Multiple Instruction, Single Data)– The VLIW architecture might be considered but…. MISD

is a seldom used classification

MIMD (Multiple Instruction, Multiple Data)– Multiprocessor architectures

– Multi computers (communicating over a LAN), sometimes treated as a separate class of architectures

Page 17: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 17

Communication

• Total Bandwidth = Link Bandwidth• Bisection Bandwidth = Link Bandwidth

• Total Bandwidth = P * Link Bandwidth• Bisection Bandwidth = 2 * Link Bandwidth

Bus

Ring

Fully Connected • Total Bandwidth = (P * P-1)/2 * Link Bandwidth

• Bisection Bandwidth = (P/2) * Link Bandwidth2

Page 18: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 18

MultiStage Networks

P1P2P3P4

Crossbar Switch

P1 to P2,P3P2 to P4P3 to P1

P1P2P3P4P5P6P7P8

Omega Network

P1 to P6, but P2 to P8 not possible at the same time

log P2

Page 19: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 19

Connection Machines CM-2 (SIMD)

16k 1-bit CPUs

512 FPAs

16k 1-bit CPUs

512 FPAs

16k 1-bit CPUs

512 FPAs

Front endSISD Sequencer

Data Vault (Disk Array)

3-cube

CM-2 uses a 12-cube for communicationbetween the chips

1024 * Chips512 FPAs

16 1-bit Fully Connected CPUs on each ChipEach CPU has 3 1-bit registers and 64 k-bit memory

Page 20: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 20

SIMD Programming, Parallel sumsum=0for (i=0;i<65536;i=i+1) /* Loop over 65k elements */

sum=sum+A[Pn,i]; /* Pn is the processor number */

limit=8192; half=limit; /* Collect sum from 8192 processors */repeat

half=half/2 /* Split into sender/receiver */if (Pn>=half && Pn<limit) send(Pn/2-half,sum);if (Pn<half) sum=sum+receive();limit=half;

until (half==1) /* final sum */

limit

halfsend(1,sum)

01234

send(0,sum)sum=sum+Rsum=sum+R

limithalf

012

send(0,sum)sum=sum+R 0 Final sum

Page 21: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 21

SIMD vs MIMD SIMD

– Single Instruction (one PC)

– All processors perform the same work (synchronized)

– Conditional execution (case/if etc) Each processor holds a enable bit

MIMD– Each processor has a PC

Possible to run different programs: BUT

– All may run the same program (SPMD), single Program ... Use MIMD style programming for conditional execution Use SIMD style programming for synchronized actions

Page 22: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 22

Memory Architectures for MIMD– Centralized

We use a single bus for all main memory Uniform memory access, (after passing the local

cache)

– Distributed The sought address might be hosted by another

processor Non-uniform memory access, (dynamic “find” time) The Extreme, a cache only Memory

– Shared All processors shared the same address space Memory can be used for communication

– Private All processors have a unique address space Communication must be done by “message passing”

Page 23: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 23

Shared Bus MIMD

Processor

CacheSnoopTag

Processor

CacheSnoopTag

Processor

CacheSnoopTag

Usually 2-32 P

MEMORY I/O Cache Coherency Protocol• Write Invalidate The first write to address A causes all other cached references of A to be invalidated• Write Update On write to address A all cached references of A is updated (high bus activity)• On a cache read miss when using WB caches

• The cache holding the valid data writes to memory • The cache holding the valid data writes directly to the cache requiring the data

Page 24: Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks

Datorteknik F1 bild 24

Synchronization When using shared data we need to se that

only one processor can access the data when updating

We need an atomic operation for TEST&SET

loop: TEST&SET A.lockbeq A.go loopupdate Aclear A.lock

loop: TEST&SET A.lockbeq A.go loopupdate Aclear A.lock

Processor 1 Processor 2

Processor 1 gets the lock (A.go)updates the shared dataand finally clears the lock (A.lock)

Processor B spin-waits until lock releasedupdates shaded data and releases lock