1 outline for intro to multiprocessing motivation & applications theory, history, & a...

43
1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models Shared Memory, Message Passing, Data Parallel Issues in Programming Models Function: naming, operations, & ordering Performance: latency, bandwidth, etc. Examples of Commercial Machines

Upload: preston-flynn

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

1

Outline for Intro to Multiprocessing

• Motivation & Applications

• Theory, History, & A Generic Parallel Machine

• Programming Models– Shared Memory, Message Passing, Data Parallel

• Issues in Programming Models– Function: naming, operations, & ordering

– Performance: latency, bandwidth, etc.

• Examples of Commercial Machines

Page 2: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

2

Applications: Science and Engineering

• Examples– Weather prediction

– Evolution of galaxies

– Oil reservoir simulation

– Automobile crash tests

– Drug development

– VLSI CAD

– Nuclear bomb simulation

• Typically model physical systems or phenomena

• Problems are 2D or 3D

• Usually require “number crunching”

• Involve “true” parallelism

Page 3: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

3

So Why Is MP Research Disappearing?

• Where is MP research presented?– ISCA = International Symposium on Computer Architecture

– ASPLOS = Arch. Support for Programming Languages and OS

– HPCA = High Performance Computer Architecture

– SPAA = Symposium on Parallel Algorithms and Architecture

– ICS = International Conference on Supercomputing

– PACT = Parallel Architectures and Compilation Techniques

– Etc.

• Why is the number of MP papers at ISCA falling?

• Read “The Rise and Fall of MP Research”– Mark D. Hill and Ravi Rajwar

Page 4: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

4

Outline

• Motivation & Applications

• Theory, History, & A Generic Parallel Machine

• Programming Models

• Issues in Programming Models

• Examples of Commercial Machines

Page 5: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

5

In Theory

• Sequential– Time to sum n numbers? O(n)

– Time to sort n numbers? O(n log n)

– What model? RAM

• Parallel– Time to sum? Tree for O(log n)

– Time to sort? Non-trivially O(log n)

– What model?

» PRAM [Fortune, Willie STOC78]

» P processors in lock-step

» One memory (e.g., CREW forconcurrent read exclusive write)

2<->31<->2 3<->4

2<->31<->2 3<->4

Page 6: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

6

But in Practice, How Do You

• Name a datum (e.g., all say a[2])?

• Communicate values?

• Coordinate and synchronize?

• Select processing node size (few-bit ALU to a PC)?

• Select number of nodes in system?

Page 7: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

7

Historical View

P

M

IO

P

M

IO

P

M

IO

I/O (Network)

Message Passing

P

M

IO

P

M

IO

P

M

IO

Memory

Shared Memory

P

M

IO

P

M

IO

P

M

IO

Processor

Data Parallel

Single-Instruction Multiple-Data (SIMD),

(Dataflow/Systolic)

Join At:

Program With:

Page 8: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

8

Historical View, cont.

• Machine Programming Model– Join at network so program with message passing model

– Join at memory so program with shared memory model

– Join at processor so program with SIMD or data parallel

• Programming Model Machine– Message-passing programs on message-passing machine

– Shared-memory programs on shared-memory machine

– SIMD/data-parallel programs on SIMD/data-parallel machine

• But– Isn’t hardware basically the same? Processors, memory, & I/O?

– Convergence! Why not have generic parallel machine& program with model that fits the problem?

Page 9: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

9

A Generic Parallel Machine

• Separation of programming models from architectures

• All models require communication

• Node with processor(s), memory, communication assist

Interconnect

CA

MemP

$CA

MemP

$

CA

MemP

$CA

MemP

$

Node 0 Node 1

Node 2 Node 3

Page 10: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

10

Outline

• Motivation & Applications

• Theory, History, & A Generic Parallel Machine

• Programming Models– Shared Memory

– Message Passing

– Data Parallel

• Issues in Programming Models

• Examples of Commercial Machines

Page 11: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

11

Programming Model

• How does the programmer view the system?

• Shared memory: processors execute instructions and communicate by reading/writing a globally shared memory

• Message passing: processors execute instructions and communicate by explicitly sending messages

• Data parallel: processors do the same instructions at the same time, but on different data

Page 12: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

12

Today’s Parallel Computer Architecture

• Extension of traditional computer architecture to support communication and cooperation

– Communications architecture

MultiprogrammingSharedMemory

MessagePassing

DataParallel

UserLevel

ProgrammingModel

Libraries and CompilersCommunicationAbstractionSystem

Level

Operating SystemSupport

Hardware/SoftwareBoundary

CommunicationHardware

Physical Communication Medium

Page 13: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

13

Simple Problem

for i = 1 to N

A[i] = (A[i] + B[i]) * C[i]

sum = sum + A[i]

• How do I make this parallel?

Page 14: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

14

Simple Problem

for i = 1 to N

A[i] = (A[i] + B[i]) * C[i]

sum = sum + A[i]

• Split the loops» Independent iterations

for i = 1 to N

A[i] = (A[i] + B[i]) * C[i]

for i = 1 to N

sum = sum + A[i]

Page 15: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

15

Programming Model

• Historically, Programming Model == Architecture

• Thread(s) of control that operate on data

• Provides a communication abstraction that is a contract between hardware and software (a la ISA)

Current Programming Models

1) Shared Memory

2) Message Passing

3) Data Parallel (Shared Address Space)

Page 16: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

16

Global Shared Physical Address Space

• Communication, sharing, and synchronization with loads/stores on shared variables

• Must map virtual pages to physical page frames

• Consider OS support for good mapping

Pn

P0

load

store

Private Portionof AddressSpace

Shared Portionof AddressSpace

Common PhysicalAddresses

Pn Private

P0 Private

P1 Private

P2 Private

Machine Physical Address Space

Page 17: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

17

Page Mapping in Shared Memory MP

Interconnect

NI

MemP

$NI

MemP

$

NI

MemP

$ NI

MemP

$

Node 0 0,N-1 Node 1 N,2N-1

Node 2 2N,3N-1 Node 3 3N,4N-1

Keep private data and frequently used shared data on same node as computation

Load by Processor 0 to address N+3 goes to Node 1

Page 18: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

18

Return of The Simple Problem

private int i,my_start,my_end,mynode;

shared float A[N], B[N], C[N], sum;

for i = my_start to my_end

A[i] = (A[i] + B[i]) * C[i]

GLOBAL_SYNCH;

if (mynode == 0)

for i = 1 to N

sum = sum + A[i]

• Can run this on any shared memory machine

Page 19: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

19

Message Passing Architectures

• Cannot directly access memory on another node

• IBM SP-2, Intel Paragon

• Cluster of workstations

Interconnect

CA

MemP

$CA

MemP

$

Node 0 0,N-1 Node 1 0,N-1

Node 2 0,N-1 Node 3 0,N-1

CA

MemP

$ CA

MemP

$

Page 20: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

20

Message Passing Programming Model

• User-level send/receive abstraction– local buffer (x,y), process (P,Q), and tag (t)

– naming and synchronization

Local ProcessAddress Space

address x address y

match

Process P Process Q

Local ProcessAddress Space

Send x, Q, t

Recv y, P, t

Page 21: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

21

The Simple Problem Again

int i, my_start, my_end, mynode;float A[N/P], B[N/P], C[N/P], sum;for i = 1 to N/P

A[i] = (A[i] + B[i]) * C[i]sum = sum + A[i]

if (mynode != 0)send (sum,0);

if (mynode == 0)for i = 1 to P-1

recv(tmp,i)sum = sum + tmp

• Send/Recv communicates and synchronizes• P processors

Page 22: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

22

Separation of Architecture from Model

• At the lowest level, shared memory sends messages– HW is specialized to expedite read/write messages

• What programming model / abstraction is supported at user level?

• Can I have shared-memory abstraction on message passing HW?

• Can I have message passing abstraction on shared memory HW?

• Some 1990s research machines integrated both– MIT Alewife, Wisconsin Tempest/Typhoon, Stanford FLASH

Page 23: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

23

Data Parallel

• Programming Model– Operations are performed on each element of a large (regular) data

structure in a single step

– Arithmetic, global data transfer

• Processor is logically associated with each data element

– SIMD architecture: Single instruction, multiple data

• Early architectures mirrored programming model– Many bit-serial processors

• Today we have FP units and caches on microprocessors

• Can support data parallel model on shared memory or message passing architecture

Page 24: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

24

The Simple Problem Strikes Back

Assuming we have N processors

A = (A + B) * C

sum = global_sum (A)

• Language supports array assignment

• Special HW support for global operations

• CM-2 bit-serial

• CM-5 32-bit SPARC processors– Message Passing and Data Parallel models

– Special control network

Page 25: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

25

Aside - Single Program, Multiple Data

• SPMD Programming Model– Each processor executes the same program on different data

– Many Splash2 benchmarks are in SPMD model

for each molecule at this processor {simulate interactions with myproc+1 and myproc-1;

}

• Not connected to SIMD architectures– Not lockstep instructions

– Could execute different instructions on different processors

» Data dependent branches cause divergent paths

Page 26: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

26

Review: Separation of Model and Architecture

• Shared Memory– Single shared address space

– Communicate, synchronize using load / store

– Can support message passing

• Message Passing– Send / Receive

– Communication + synchronization

– Can support shared memory

• Data Parallel– Lock-step execution on regular data structures

– Often requires global operations (sum, max, min...)

– Can support on either shared memory or message passing

Page 27: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

27

Review: A Generic Parallel Machine

• Separation of programming models from architectures

• All models require communication

• Node with processor(s), memory, communication assist

Interconnect

CA

MemP

$CA

MemP

$

CA

MemP

$CA

MemP

$

Node 0 Node 1

Node 2 Node 3

Page 28: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

28

Outline

• Motivation & Applications

• Theory, History, & A Generic Parallel Machine

• Programming Models

• Issues in Programming Models– Function: naming, operations, & ordering

– Performance: latency, bandwidth, etc.

• Examples of Commercial Machines

Page 29: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

29

Programming Model Design Issues

• Naming: How is communicated data and/or partner node referenced?

• Operations: What operations are allowed on named data?

• Ordering: How can producers and consumers of data coordinate their activities?

• Performance– Latency: How long does it take to communicate in a protected

fashion?

– Bandwidth: How much data can be communicated per second? How many operations per second?

Page 30: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

30

Issue: Naming

• Single Global Linear-Address-Space (shared memory)

• Multiple Local Address/Name Spaces (message passing)

• Naming strategy affects– Programmer / Software

– Performance

– Design complexity

Page 31: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

31

Issue: Operations

• Uniprocessor RISC– Ld/St and atomic operations on memory

– Arithmetic on registers

• Shared Memory Multiprocessor– Ld/St and atomic operations on local/global memory

– Arithmetic on registers

• Message Passing Multiprocessor– Send/receive on local memory

– Broadcast

• Data Parallel– Ld/St

– Global operations (add, max, etc.)

Page 32: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

32

Issue: Ordering

• Uniprocessor– Programmer sees order as program order

– Out-of-order execution

– Write buffers

– Important to maintain dependencies

• Multiprocessor– What is order among several threads accessing shared data?

– What affect does this have on performance?

– What if implicit order is insufficient?

– Memory consistency model specifies rules for ordering

Page 33: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

33

Issue: Order/Synchronization

• Coordination mainly takes three forms:– Mutual exclusion (e.g., spin-locks)

– Event notification

» Point-to-point (e.g., producer-consumer)

» Global (e.g., end of phase indication, all or subset of processes)

– Global operations (e.g., sum)

• Issues:– Synchronization name space (entire address space or portion)

– Granularity (per byte, per word, ... => overhead)

– Low latency, low serialization (hot spots)

– Variety of approaches

» Test&set, compare&swap, LoadLocked-StoreConditional

» Full/Empty bits and traps

» Queue-based locks, fetch&op with combining

» Scans

Page 34: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

34

Performance Issue: Latency

• Must deal with latency when using fast processors

• Options:– Reduce frequency of long latency events

» Algorithmic changes, computation and data distribution

– Reduce latency

» Cache shared data, network interface design, network design

– Tolerate latency

» Message passing overlaps computation with communication (program controlled)

» Shared memory overlaps access completion and computation using consistency model and prefetching

Page 35: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

35

Performance Issue: Bandwidth

• Private and global bandwidth requirements

• Private bandwidth requirements can be supported by:– Distributing main memory among PEs

– Application changes, local caches, memory system design

• Global bandwidth requirements can be supported by:– Scalable interconnection network technology

– Distributed main memory and caches

– Efficient network interfaces

– Avoiding contention (hot spots) through application changes

Page 36: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

36

Cost of Communication

Cost = Frequency x (Overhead + Latency + Xfer size/BW - Overlap)

• Frequency = number of communications per unit work

– Algorithm, placement, replication, bulk data transfer

• Overhead = processor cycles spent initiating or handling

– Protection checks, status, buffer mgmt, copies, events

• Latency = time to move bits from source to dest– Communication assist, topology, routing, congestion

• Transfer time = time through bottleneck– Comm assist, links, congestions

• Overlap = portion overlapped with useful work– Comm assist, comm operations, processor design

Page 37: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

37

Outline

• Motivation & Applications

• Theory, History, & A Generic Parallel Machine

• Programming Models

• Issues in Programming Models

• Examples of Commercial Machines

Page 38: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

38

Example: Intel Pentium Pro Quad

– All coherence and multiprocessing glue in processor module

– Highly integrated, targeted at high volume

– Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

Page 39: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

39

Example: SUN Enterprise (pre-E10K)

– 16 cards of either type: processors + memory, or I/O– All memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

Page 40: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

40

Example: Cray T3E

– Scale up to 1024 processors, 480MB/s links– Memory controller generates comm. request for non-local references– No hardware mechanism for coherence (SGI Origin etc. provide this)

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem

Page 41: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

41

Example: Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Supercomputer

Page 42: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

42

Example: IBM SP-2

– Made out of essentially complete RS6000 workstations

– Network interface integrated in I/O bus (bw limited by I/O bus)

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Page 43: 1 Outline for Intro to Multiprocessing Motivation & Applications Theory, History, & A Generic Parallel Machine Programming Models –Shared Memory, Message

43

Summary

• Motivation & Applications

• Theory, History, & A Generic Parallel Machine

• Programming Models

• Issues in Programming Models

• Examples of Commercial Machines