ece5610/csc6220 introduction to parallel and …sjiang/ece5610-fall-14/lecture-1.pdf1...

1

ECE5610/CSC6220

Introduction to Parallel and Distribution Computing

Instructor: Dr. Song Jiang

The ECE Department

[email protected]

http://www.ece.eng.wayne.edu/~sjiang/ECE5610-fall-14/ECE5610.htm

Lecture: Monday/Wednesday 6:00pm --- 7:20pm

0318 STAT

Office hours: Monday/Wednesday 3pm --- 4pm

Engineering Building, Room 3150

2

Outline� Introduction

�What is parallel computing?

�Why you should care?

� Course administration

�Course coverage

�Workload and grading

� Inevitability of parallel computing

�Application demands

�Technology and architecture trends

�Economics

� Convergence of parallel architecture

� Shared address space, message passing, data parallel, data flow

� A generic parallel architecture

3

What is Parallel Computer?

� “communicate and cooperate”

�Node and interconnect architecture

�Problem partitioning and orchestration

� “large problems fast”

�Programming model

�Match of model and architecture

� Focus of this course

� Parallel architecture

� Parallel programming models

� Interaction between models and architecture

“A parallel computer is a collection of processing elements that can communicate and cooperate to solve large problems fast” ------ Almasi/Gottlieb

4

What is Parallel Computer? (cont’d)

Some broad issues:

• Resource Allocation:– how large a collection?

– how powerful are the elements?

• Data access, Communication and Synchronization– how are data transmitted between processors?

– how do the elements cooperate and communicate?

– what are the abstractions and primitives for cooperation?

• Performance and Scalability– how does it all translate into performance?

– how does it scale?

5

Why Study Parallel Computing


� Fueled by application demand for performance• Scientific: weather forecasting, pharmaceutical design, and genomics

• Commercial: OLTP, search engine, decision support, data mining

• Scalable web servers

� Enabled by technology and architecture trends• limits to sequential CPU, memory, storage performance

o parallelism is an effective way of utilizing growing number of transistors.

• low incremental cost of supporting parallelism

� Convergence of parallel computer organizations

� driven by technology constraints and economies of scale • laptops and supercomputers share the same building block

� growing consensus on fundamental principles and design tradeoffs

6

Why Study Parallel Computing (cont’d)

• Parallel computing is ubiquitous:

� Multithreading

� Simultaneous multithreading (SMT) a.k.a. hyper-threading • e.g., Intel® Pentium 4 Xeon

�Chip Multiprocessor (CMP) a.k.a, multi-core processor• Intel® Core™ Duo, Xbox 360 (triple cores, each with SMTs), AMD

Quad-core Opteron.

• IBM Cell processor with as many as 9 cores used in Sony PlayStation 3, Toshiba HD sets, and IBM Roadrunner HPC.

� Symmetrical Multiprocessor (SMP) a.k.a, shared memory multiprocessor

• e.g. Intel® Pentium Pro Quad, motherboard with multiple sockets

� Cluster-based supercomputer • IBM Bluegene/L (65,536 modified PowerPC 400, each with two cores)

• IBM Roadrunner (6,562 dual-core AMD Opteron® chips and 12,240 Cell chips)

7

Course Coverage

• Parallel architectures

Q: which are the dominant architectures?

A: small-scale shared memory (SMPs), large-scale distributed memory

• Programming model

Q: how to program these architectures?

A: Message passing and shared memory models

• Programming for performance

Q: how are programming models mapped to the underlying architecture, and how can this mapping be exploited for performance?

8

Course Administration

• Course prerequisites

• Course textbooks

• Class attendance

• Required work and grading policy

• Late policy

• Extra credit

• Academic honesty

(see details on the syllabus)

9


�Why is parallel computing?



�Course coverage





�Economics


�Shared address space, message passing data parallel, data flow, systolic

�A generic parallel architecture

10

Inevitability of Parallel Computing• Application demands:

� Our insatiable need for computing cycles in challenge applications

• Technology Trends

�Number of transistors on chip growing rapidly

�Clock rates expected to go up only slowly

• Architecture Trends

�Instruction-level parallelism valuable but limited

�Coarser-level parallelism, as in MPs, the most viable approach

• Economics

�Low incremental cost of supporting parallelism

11

Application Demands: Scientific Computing

• Large parallel machines are a mainstay in many industries

�Petroleum

• Reservoir analysis

�Automotive

• Crash simulation, combustion efficiency

�Aeronautics

• Airflow analysis, structural mechanics, electromegnetism

�Computer-aided design

�Pharmaceuticals

• Molecular modeling

�Visualization

• Entertainment

• Architecture

�Financial modeling• Yield and derivative analysis

2,300 CPU years (2.8 GHz Intel Xeon) at a rate of approximately one hour per frame.

12

Simulation: The Third Pillar of Science

Traditional scientific and engineering paradigm:

1) Do theory or paper design.

2) Perform experiments or build system.

Limitations:

– Too difficult -- build large wind tunnels.

– Too expensive -- build a throw-away passenger jet.

– Too slow -- wait for climate or galactic evolution.

– Too dangerous -- weapons, drug design, climate experimentation.

Computational science paradigm:

3) Use high performance computer systems to simulatethe phenomenon– Based on known physical laws and efficient numerical methods.

13

Challenge Computation ExamplesScience

• Global climate modeling• Astrophysical modeling• Biology: genomics; protein folding; drug design• Computational chemistry• Computational material sciences and nanosciences

Engineering• Crash simulation• Semiconductor design• Earthquake and structural modeling• Computation fluid dynamics (airplane design)

Business• Financial and economic modeling

Defense• Nuclear weapons -- test by simulation• Cryptography

14

Units of Measure in HPCHigh Performance Computing (HPC) units are:

• Flop/s: floating point operations

• Bytes: size of data

Typical sizes are millions, billions, trillions…

Mega Mflop/s = 106 flop/sec Mbyte = 106 byte

(also 220 = 1048576)

GigaGflop/s = 109 flop/sec Gbyte = 109 byte

(also 230 = 1073741824)

Tera Tflop/s = 1012 flop/sec Tbyte = 1012 byte

(also 240 = 10995211627776)

Peta Pflop/s = 1015 flop/sec Pbyte = 1015 byte

(also 250 = 1125899906842624)

Exa Eflop/s = 1018 flop/sec Exa = 1018 byte

15

Global Climate Modeling Problem

Problem is to compute:

f(latitude, longitude, elevation, time) �

temperature, pressure, humidity, wind velocity

Approach:

• Discretize the domain, e.g., a measurement point every 1km

• Devise an algorithm to predict weather at time t+1 given t

Source: http://www.epm.ornl.gov/chammp/chammp.html

16

Example: Numerical Climate Modeling at NASA

• Weather forecasting over US landmass: 3000 x 3000 x 11 miles

• Assuming 0.1 mile cubic element ---> 1011 cells

• Assuming 2 day prediction @ 30 min ---> 100 steps in time scale

• Computation: Partial differential equation and finite element approach

• Single element computation takes 100 Flops

• Total number of flops: 1011 x 100 x 100 = 1015 (i.e., one peta-flops)

• Supposed uniprocessor power: 109 flops/sec (Giga-flops)

• It takes 106 seconds or 280 hours. (Forecast nine days late!)

• 1000 processors at 10% efficiency � around 3 hours

• IBM Roadrunner � 1 second ?!

• State of the art models require integration of atmosphere, ocean, sea-ice, land models, and more; Models demanding more computation resources will be applied.

High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL

18

Commercial Computing

• Parallelism benefits many applications�Database and Web servers for online transaction processing

�Decision support

�Data mining and data warehousing

�Financial modeling

• Scale not necessaily as large, but more widely used

• Computational power determines scale of business that can be handled.

19


�Why is parallel computing?



�Course coverage





�Economics


�Shared address space, message passing data parallel, data flow, systolic

�A generic parallel architecture

20

Tunnel Vision by Experts

“I think there is a world market for maybe five computers.”

– Thomas Watson, chairman of IBM, 1943.

“There is no reason for any individual to have a computer in their home”

– Ken Olson, president and founder of Digital Equipment Corporation, 1977.

“640K [of memory] ought to be enough for anybody.”– Bill Gates, chairman of Microsoft,1981.

21

Technology Trends: µµµµ-processor Capacity

Moore’s Law

Microprocessors have

become smaller, denser, and

more powerful.

Gordon Moore (co-founder of

Intel) predicted in 1965 that the

transistor density of semiconductor

chips would double roughly every

18 months. ��“Moore’s Law”

Slide source: Jack Dongarra

22

Technology Trends:Transistor Count

- 40% more functions can be performed by a CPU per year

Tra

nsis

tors

��

��

��

�

��

�

��

��

��

�

�

� �

��

�

�

��

��

��

��

�

��

��

��

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

19701975

19801985

19901995

20002005

i4004i8008

i8080

i8086

i80286i80386

R2000

Pentium R10000

R3000

23

Technology Trends: Clock Rate

• 30% per year ---> today’s PC is yesterday’s Supercomputer

��

�

��

�

��

��

�

��

�

�

�

�

�

�

��

�

�

�

�

�

��

��

�

��

�

��

�

��

�

�

��

��

��

��

�

��

��

�

�

0.1

1

10

100

1,000

19701975

19801985

19901995

20002005

Clo

ck r

ate

(M

Hz)

i4004i8008

i8080

i8086 i80286i80386

Pentium100

R10000

24

Technology TrendsP

erf

orm

an

ce

0.1

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

• Microprocessor exhibits astonishing progress!

• Natural building block for parallel computers are also state-of-art

microprocessors.

25

Architecture Trend: Role of Architecture

Clock rate increases 30% per year, while the overall CPU

performance increases 50% to 100% per year

Where is the rest coming from?�Parallelism likely to contribute more to performance improvements

26

Architectural Trends

Greatest trend in VLSI is an increase in the exploited parallelism

• Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit– slows after 32 bit

– adoption of 64-bit

• Mid 80s to mid 90s: Instruction Level Parallelism (ILP)

– pipelining and simple instruction sets (RISC)

– on-chip caches and functional units => superscalar execution

– Greater sophistication: out of order execution, speculation

• Nowadays:

– Hyper-threading

– Multi-core

27

Phase in VLSI Generation

i4004

i80286

i80386

i8080

i8086

R3000R2000

R10000

Pentium

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Year

Tra

ns

isto

rs

Bit-Level

Parallelism

Instruction-LevelParallelism

Thread-LevelParallelism

28

ILP Ideal Potential

– Limited parallelism inherent in one stream of instructions

�Pentium Pro: 3 instructions,

�PowerPC 604: 4 instructions

– Need to look across threads for more parallelism

0 1 2 3 4 5 6+0

5

10

15

20

25

30

�

�

��

0 5 10 150

0.5

1

1.5

2

2.5

3

Fra

ctio

n o

f to

tal cycle

s (

%)

Number of instructions issuedS

pe

ed

up

Instructions issued per cycle

29

Architectural Trends:Parallel Computers

No. of processors in fully configured commercial shared-memory systems

�

�

�

�

�

�

�

�

�

�

� �

� �

� �

�

�

�

��

�

�

�

�

�

�

0

10

20

30

40

CRAY CS6400

SGI Challenge

Sequent B2100

Sequent B8000

Symmetry81

Symmetry21

Power

SS690MP 140

SS690MP 120

AS8400

HP K400AS2100SS20

SE30

SS1000E

SS10

SE10

SS1000

P-ProSGI PowerSeries

SE60

SE70

Sun E6000

SC2000ESun SC2000SGI PowerChallenge/XL

SunE10000

50

60

70

1984 1986 1988 1990 1992 1994 1996 1998

Nu

mb

er

of p

roce

sso

rs

30

Technology Trend for Memory and Disk

• Divergence between memory capacity and speed more pronounced

� Capacity increased by 1000X from 1980-95, speed only 2X

� Larger memories are slower, while processors get faster � “memory wall”

– Need to transfer more data in parallel

– Need deeper cache hierarchies

– Parallelism helps hide memory latency

• Parallelism within memory systems too

� New designs fetch many bits within memory chip, followed with fast pipelined transfer across narrower interface

31

0.3 0.37587,0000.9

1.2451,807

0.72

560,000

2.5

11.66

1,666,666

1.25

37.5

5,000,000

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

5000000

CP

U C

yc

les

1980 1985 1990 1995 2000

Year

Latencies of Cache, DRAM and Disk in CPU Cycles

SRAM Access Time DRAM Access Time Disk Seek Time

Technology Trends: Unbalanced system improvements

The disks in 2000 are more than 57 times “SLOWER” than their ancestors in 1980.

� Redundant Inexpensive Array of Disk (RAID)

32

Why Parallel Computing: Economics

� Commodity means cost-effectiveness

� Development cost ($5 – 100M) amortized over volumes of millions

� Building block offers significant cost-performance benefits

� Multiprocessors being pushed by software vendors (e.g.

database) as well as hardware vendors

� Standardization by Intel makes small, bus-based SMPs commodity

� Multiprocessing on the desktop (laptop) is a reality

� Example: How economics affect platforms for scientific computing?

� Large-scale cluster systems replace vector supercomputers

� A supercomputer and a desktop share the same building block

33

Supercomputers

34

Supercomputers

TOP 500 Supercomputing Sites

released in June 2014

35

Supercomputers

Parallel Computing Today

IBM BlueGene/L

Japanese Earth Simulator machine

36

Supercomputers

TOP 500 Supercomputing Sites

released in June 2014

37

Evolution of Architectural Models

• Historically (1970s - early 1990s), each parallel machine was unique, along with its programming model and language

Architecture = prog. model + comm. abstraction + machine

organization

• Throw away software & start over with each new kind of machine

� Dead Supercomputer Society: http://www.paralogos.com/DeadSuper/

• Nowadays we separate the programming model from the underlying parallel machine architecture.

� 3 or 4 dominant programming models• Dominant: shared address space, message passing, data parallel

• Others: data flow, systolic arrays

38

Programming Model for Various Architectures

• Programming models specify communication and synchronization

� Multiprogramming: no communication/synchronization

� Shared address space: like bulletin board

� Message passing: like phone calls

� Data parallel: more regimented, global actions on data

• Communication abstraction: primitives for implementing the model

� Play the role like the instruction set in a uniprocessor computer.

� Supported by HW, by OS or by user-level software

• Programming models are the abstraction presented to programmers

� Write portably correct code that runs on many machines

� Writing fast code requires tuning for the architecture– Not always worthy of it – sometimes programmer time is more precious

39

Aspects of a parallel programming model

• Control

�How is parallelism created?

�In what order should operations take place?

�How are different threads of control synchronized?

• Naming

�What data is private vs. shared?

�How is shared data accessed?

• Operations

�What operations are atomic?

• Cost

�How do we account for the cost of operations?

40

Programming Models: Shared Address Space

St o r e

P1

P2

Pn

P0

L o a d

P0 p r i v a t e

P1 p r i v a t e

P2 p r i v a t e

Pn p r i v a t e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physical address space

Shared portionof address space

Private portionof address space

Common physicaladdresses

•Programming model�Process: virtual address space plus one or more threads of control;

�Portions of address spaces of processes are shared

�Writes to shared address visible to all threads (in other processes as well)

•Natural extension of uniprocess model:

� conventional memory operations for communication

� special atomic operations for synchronization

41

SAS Machine Architecture

• Motivation: Programming convenience

� Location transparency:

• Any processor can directly reference any shared memory location

• Communication occurs implicitly as result of loads and stores

� Extended from time-sharing on uni-processors– Processes can run on different processors

– Improved throughput on multi-programmed workloads

• Communication hardware also natural extension of uniprocessor

� Addition of processors similar to memory modules, I/O controllers

42

SAS Machine Architecture (Cont’d)

One representative architecture: SMP:

�Used to mean Symmetric MultiProcessor �All CPUs had equal capabilities in every area, e.g. in terms of I/O as well as memory access

�Evolved to mean Shared Memory Processor � non-message-passing machines (included crossbar as well as bus based systems)

�Now it tends to refer to bus-based shared memory machines �Small scale: < 32 processors typically

P1 P2 Pn

network

memory

43

Example: Intel Pentium Pro Quad

• All coherence and multiprocessing glued in processor module

• Highly integrated, targeted at high volume

• Low latency and high bandwidth

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I bus

PCII/O

cards

44

Scaling Up: More SAS Machine Architectures

• Dance-hall: �Problem: interconnect cost (crossbar) or bandwidth (bus)

�Solution: scalable interconnection network �Bandwidth scalable

� latencies to memory uniform, but uniformly large (Uniform Memory Access (UMA))

�Caching is key: coherence problem

“Dance hall” Distributed Shared memory

M M M° ° °

° ° ° M ° ° °M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

45

Scaling Up: More SAS Machine Architectures

• Distributed shared memory (DSM) or non-uniform memory access (NUMA)� Non-uniform time for the access to data in local memory and remote

memory

� Caching of non-local data is key

• Coherence cost

M M M° ° °

° ° ° M ° ° °M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

“Dance hall” Distributed Shared memory

46

Example: SUN Enterprise

• 16 cards of either type: processors + memory, or I/O

• All memory accessed over bus, so symmetric

• Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 address, 83 MHz)

SB

US

SB

US

SB

US

2 F

iberC

hannel

100bT,

SC

SI

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

47

Example: Cray T3E

• Scale up to 1024 processors, 480MB/s links

• Memory controller generates comm. request for nonlocal references

• No hardware mechanism for coherence (SGI Origin etc. provide this)

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem

48

• Programming model�Directly access only private address space (local memory), communicate

via explicit messages

• Send specifies data in a buffer to transmit to the receiving process

• Recv specifies sending process and buffer to receive data

�In the simplest form, the send/recv match achieves pair-wise synchronization

• Model is separated from basic hardware operations�Library or OS support for copying, buffer management, protection

�Potential high overhead: large messages to amortize the cost

Process QProcess P

Address Y

Send X, Q, t

Receive Y, P, t,

Match

Local Process

Address Space

Address XLocal Process

Address Space

Programming Models: Message Passing

49

Message Passing Architectures

• Complete processing node (computer) as building block, including I/O� Communication via explicit I/O operations

� Processor/Memory/IO form a processing node that cannot directly access another processor’s memory.

• Each “node” has a network interface (NI) for communication and synchronization.

interconnect

P1

memory

NI P2

memory

NI Pn

memory

NI

. . .

50

DSM vs Message Passing

• High-level block diagrams are similar

• Programming paradigms that theoretically can be supported on various parallel architectures;

• Implication of DSM and MP on architectures: �Fine-grained hardware supports for DSM;

• Communication integrated at I/O level for MP, needn’t be into memory system

�MP can be implemented as middleware (library);

�MP has better scalability.

• MP machines are easier to build than scalable address space machines

interconnect

P1

memory

NI P2

memory

NI Pn

memory

NI

. . .

51

Example: IBM SP-2

• Each node is a essentially complete RS6000 workstation;

• Network interface integrated in I/O bus (bw limited by I/O bus).

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectnetwork formed from8-port switches

NIC

52

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with each processing nodeattached to a switch

Sandia’s Intel Paragon XP/S-based Supercomputer

53

Toward Architectural Convergence

• Convergence in hardware organizations

�Tighter NI integration for MP

�Hardware SAS passes messages at lower level

�Cluster of workstations/SMP become the most popular parallel architecture for parallel systems

• Programming models distinct, but organizations converging

�Nodes connected by general network and communication assists

�Implementations also converging, at least in high-end machines

54

� Operations performed in parallel on each element of data structure

� Logically single thread of control (sequential program)

� Conceptually, a processor associated with each data element

� Coordination is implicit – statements executed synchronously

� Example:

Programming Model: Data Parallel

for (i=0; i<100; i++)

x[i] = x[i] + 1;� x = x + 1;

float x[100];

55

Programming Model: Data Parallel

• Architectural model:

�A control processor issues instructions

�Array of many simple cheap processors—processing element (PE)—each with little memory

�A interconnect network that broadcasts data to PEs, communication among PEs, and efficient synchronization.

• Motivation:

�Give up flexibility (different instructions in different processors) to allow a much larger number of processors;

�Target at limited scope of applications.

• Applications:�Finite differences, linear algebra.

�Document searching, graphics, image processing, .

56

A Case of DP: Vector Machine

• Vector machine:

�Multiple functional units

�All performing the same operation

�Instructions may be of very high parallelism (e.g., 64-way) but hardware executes only a subset in parallel at a time

• Historically important, but overtaken by MPPs in the 90s

• Re-emerging in recent years

� At a large scale in the Earth Simulator (NEC SX6) and Cray X1

� At a small sale in SIMD media extensions to microprocessors

– SSE (Streaming SIMD Extensions) , SSE2 (Intel: Pentium/IA64)

– Altivec (IBM/Motorola/Apple: PowerPC)

– VIS (Sun: Sparc)

• Enabling technique

�Compiler does some of the difficult work of finding parallelism

+

…vr2…vr1

…vr3

(logically, performs # elts

adds in parallel)

An example vector instruction

57

Flynn's Taxonomy

A classification of computer architectures based on the

number of streams of instructions and data:

• Single instruction/single data stream (SISD)

- a sequential computer.

• Multiple instruction/single data stream (MISD)

- unusual.

• Single instruction/multiple data streams (SIMD) –

- e.g. a vector processor.

• Multiple instruction/multiple data streams (MIMD)

- multiple autonomous processors simultaneously executing

different instructions on different data.

� Program model converges with SPMD

(single program multiple data)

58

Clusters have Arrived

59

What’s a Cluster?

• Collection of independent computer systems working together as if a single system.

• Coupled through a scalable, high bandwidth, low latency interconnect.

60

Clusters of SMPs

SMPs are the fastest commodity machine, so use them as a building block for a larger machine with a network

Common names:

• CLUMP = Cluster of SMPs

What is the right programming model?

• Treat machine as “flat”, always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy).

• Shared memory within one SMP, but message passing outside of an SMP.

61

• A generic modern multiprocessor

• Node: Processor(s), memory, plus communication assist

� Network interface and communication controller

• Scalable network

� Convergence allows lots of innovation, now within the

same framework

� integration of assist within node, what operation, how efficiently …

Convergence: Generic Parallel Architecture

Mem

° ° °

Network

P

$

Communicationassist (CA)

62

Lecture Summary

� Parallel computingA parallel computer is a collection of processing elements that can

communicate and cooperate to solve large problems fast

• Parallel computing has become central and mainstream

� Application demands

� Technology and architecture trends

� Economics

� Convergence in parallel architecture

� initially: close coupling of programming model and architecture

• Shared address space, message passing, data parallel

� now: separation and identification of dominant models/architectures

• Programming models: shared address space message passing, and data parallel

• Architectures: small-scale shared memory, large-scale distributed memory, large-scale SMP cluster.

ece5610/csc6220 introduction to parallel and …sjiang/ece5610-fall-14/lecture-1.pdf1...

Documents