se-292 high performance computing intro. to concurrent programming & parallel architecture r....

SE-292 High Performance ComputingIntro. To Concurrent Programming &

Parallel Architecture

R. Govindarajan

govind@serc

mailto:govind@serc

2

PARALLEL ARCHITECTUREParallel Machine: a computer system with more

than one processor Special parallel machines designed to

make this interaction overhead less

Questions: What about Multicores? What about a network of machines?

Yes, but time involved in interaction (communication) might be high, as the system is designed assuming that the machines are more or less independent

3

Classification of Parallel MachinesFlynn’s Classification In terms of number of Instruction streams and

Data streams Instruction stream: path to instruction

memory (PC) Data stream: path to data memory SISD: single instruction stream single data

stream SIMD MIMD

4

PARALLEL PROGRAMMING

Recall: Flynn SIMD, MIMD Programming side: SPMD, MPMD Shared memory: threads Message passing: using messages

Speedup =

parallel

sequential

T

T

5

Assume that program is parallelized so that the remaining part (1 - s) is perfectly divided to run in parallel across n processors

How Much Speedup is Possible?

ns

s 1

1n

lim

Let s be the fraction of sequential execution time of a given program that can not be parallelized

Speedup = s

1

Maximum speedup achievable is limited by the sequential fraction of the sequential program

Amdahl’s Law

6

Understanding Amdahl’s Law

con

curr

ency

Time

p

1

p

n2 /p

n2 /p

con

curr

ency

(c) Parallel

n2/p n2

p

1con

curr

ency

(b) Naïve Parallel

Time

1

n2 n2

(a) Serial

Time

Compute : min ( |A[i,j]- B[i,i]| )

7

Classification 2:Shared Memory vs Message Passing Shared memory machine: The n processors

share physical address space Communication can be done through this shared

memory

The alternative is sometimes referred to as a message passing machine or a distributed memory machine

PP P P PP P

Interconnect

Main Memory

PP P P PP P

Interconnect

M MMMMMM

8

Shared Memory MachinesThe shared memory could itself be distributed

among the processor nodes Each processor might have some portion of the

shared physical address space that is physically close to it and therefore accessible in less time

Terms: Shared vs Private Terms: Local vs Remote Terms: Centralized vs Distributed Shared Terms: NUMA vs UMA architecture

Non-Uniform Memory Access

© RG@SERC,IISc 9

M

Network

° ° °

Centralized Shared Memory Uniform Memory Access (UMA)

M M

$P

$P

$P

° ° °

Network

Distributed Shared MemoryNon-Uniform Memory Access (NUMA)

M

$

P

M

$

P

° ° °

Shared Memory Architecture

© RG@SERC,IISc 10

MultiCore Structure

L2-Cache

C0 C2

L1$ L1$

L2-Cache

C4 C6

L1$ L1$

L2-Cache

C1 C3

L1$ L1$

L2-Cache

C5 C7

L1$ L1$

Memory

© RG@SERC,IISc 11

NUMA Architecture

L2$

C0 C2

L1$ L1$

C4 C6

L1$ L1$

L2$ L2$ L2$

L3-Cache

IMC QPI

L2$

C1 C3

L1$ L1$

C5 C7

L1$ L1$

L2$ L2$ L2$

L3-Cache

QPI IMC

Memory Memory

© RG@SERC,IISc 12

Distributed Memory Architecture

Network

M $

P ° ° °M $

P

M $

P

Message Passing Architecture Memory is private to each node Processes communicate by messages

Cluster

13

Parallel Architecture: Interconnections Indirect interconnects: nodes are connected

to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN

Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus,

hypercube Routing techniques: how the route taken by the

message from source to destination is decided

14

Indirect Interconnects

Shared bus Multiple bus

Crossbar switch

Multistage Interconnection Network

2x2 crossbar

15

Direct Interconnect Topologies

Linear RingStar

Mesh2D

Torus

Hypercube(binary n-cube)

n=2 n=3

16

X: 0

Shared Memory Architecture: Caches

X: 0

Read X Read X

X: 0

Read X

Cache hit: Wrong data!!

P1 P2

Write X=1

X: 1

X: 1

17

Cache Coherence Problem If each processor in a shared memory

multiple processor machine has a data cache Potential data consistency problem: the cache

coherence problem Shared variable modification, private cache

Objective: processes shouldn’t read `stale’ data

Solutions Hardware: cache coherence mechanisms Software: compiler assisted cache coherence

18

Example: Write Once Protocol Assumption: shared bus interconnect where

all cache controllers monitor all bus activity Called snooping

There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Corrective action could involve updating or

invalidating a cache block

19

X: 0

Invalidation Based Cache Coherence

X: 0

Read X Read X

X: 0

Read X

Invalidate

P1 P2

Write X=1

X: 1X: 1

X: 1

20

Snoopy Cache Coherence and Locks

Test&Set(L)

L: 1

Test&Set(L) Test&Set(L) Test&Set(L)

L: 1 L: 1 L: 1

L: 0

Test&Set(L)

Holds lock; in critical sectionRelease L;

L: 0

L: 1L: 0

21

Space of Parallel ComputingProgramming Models What programmer

uses in coding applns. Specifies synch. And

communication. Programming Models:

Shared address space

Message passing Data parallel

Parallel Architecture

Shared Memory Centralized shared

memory (UMA) Distributed Shared

Memory (NUMA)

Distributed Memory A.k.a. Message passing E.g., Clusters

22

Memory Consistency Model

Memory consistency model Order in which memory operations will appear

to execute Þ What value can a read return?

Contract between appln. software and system.

Þ Affects ease-of-programming and performance

23

Implicit Memory Model

Sequential consistency (SC) [Lamport] Result of an execution appears as if

• Operations from different processors executed in some sequential (interleaved) order

• Memory operations of each process in program order

MEMORY

P1 P3P2 Pn

© RG@SERC,IISc 24

Distributed Memory Architecture

Network

M $

P ° ° °M $

P

M $

P

Message Passing Architecture Memory is private to each node Processes communicate by messages

Proc.Node

Proc.Node

Proc.Node

© RG@SERC,IISc 25

NUMA Architecture

L2$

C0 C2

L1$ L1$

C4 C6

L1$ L1$

L2$ L2$ L2$

L3-Cache

IMC QPI

L2$

C1 C3

L1$ L1$

C5 C7

L1$ L1$

L2$ L2$ L2$

L3-Cache

QPI IMC

NIC

Memory MemoryIO-Hub

Multi-core Workshop© RG@SERC,IISc 27

Using MultiCores to Build Cluster



N/WSwitch

Node 0 Node 1

Node 3 Node 2

Send A(1:N)

se-292 high performance computing intro. to concurrent programming & parallel architecture r....

Documents