introduction to parallel and distributed computing

114
한한한한한한한한한한한 Introduction to Parallel Computing 2013.10.6 Sayed Chhattan Shah, PhD Senior Researcher Electronics and Telecommunications Research Institute, Korea

Upload: sayed-chhattan-shah

Post on 06-May-2015

1.985 views

Category:

Education


4 download

DESCRIPTION

Introduction Parallel Computer Memory Architectures Parallel Programming Models Design Parallel Programs Distributed Systems

TRANSCRIPT

Page 1: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Introduction to Parallel Comput-

ing

2013.10.6

Sayed Chhattan Shah, PhD

Senior Researcher

Electronics and Telecommunications Research Institute, Korea

Page 2: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원2

Acknowledgements

Blaise Barney, Lawrence Livermore Na-tional Laboratory, “Introduction to Parallel Computing”

Jim Demmel, Parallel Computing Lab at UC Berkeley, “Parallel Computing”

Page 3: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Outline

Introduction

Parallel Computer Memory Architectures

Shared Memory

Distributed Memory

Hybrid Distributed-Shared Memory

Parallel Programming Models

Shared Memory Model

Threads Model

Message Passing Model

Data Parallel Model

Design Parallel Programs

Page 4: Introduction to Parallel and Distributed Computing

Overview

Page 5: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

What is Parallel Computing?

Traditionally, software has been written

for serial computation

To be run on a single computer having a single Central

Processing Unit

A problem is broken into a discrete series of instructions

Instructions are executed one after another

Only one instruction may execute at any moment in

time

Page 6: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

What is Parallel Computing?

Page 7: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

What is Parallel Computing?

Parallel computing is the simultaneous use

of multiple compute resources to solve a

computational problem

Accomplished by breaking the problem into indepen-

dent parts so that each processing element can exe-

cute its part of the algorithm simultaneously with the

others

Page 8: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

What is Parallel Computing?

Page 9: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

What is Parallel Computing?

The computational problem should be: Solved in less time with multiple compute resources

than with a single compute resource

The compute resources might be A single computer with multiple processors

Several networked computers

A combination of both

Page 10: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

What is Parallel Computing?

LLN Lab Parallel Computer: Each compute node is a multi-processor parallel com-

puter

Multiple compute nodes are networked together with an

InfiniBand network

Page 11: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

What is Parallel Computing?

The Real World is Massively Parallel In the natural world, many complex, interrelated events

are happening at the same time, yet within a temporal

sequence

Compared to serial computing, parallel computing is

much better suited for modeling, simulating and un-

derstanding complex, real world phenomena

Page 12: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

What is Parallel Computing?

Page 13: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Uses for Parallel Computing

  Science and Engineering

Historically, parallel computing has been used to model

difficult problems in many areas of science and engi-

neering• Atmosphere, Earth, Environment, Physics, Bioscience, Chemistry, Me-

chanical Engineering, Electrical Engineering, Circuit Design, Microelec-tronics, Defense, Weapons

Page 14: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Uses for Parallel Computing

  Industrial and Commercial

Today, commercial applications provide an equal or

greater driving force in the development of faster com-

puters

• Data mining, Oil exploration, Web search engines, Medical imaging and

diagnosis, Pharmaceutical design, Financial and economic modeling,

Advanced graphics and virtual reality, Collaborative work environments

Page 15: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Why Use Parallel Computing?

 Save time and money 

In theory, throwing more resources at a task will

shorten its time to completion, with potential cost sav-

ings

Parallel computers can be built from cheap, commodity

components

Page 16: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Why Use Parallel Computing?

Solve larger problems  Many problems are so large and complex that it is im-

practical or impossible to solve them on a single com-

puter, especially given limited computer memory

• en.wikipedia.org/wiki/Grand_Challenge problems requiring

PetaFLOPS and PetaBytes of computing resources.

Web search engines and databases processing millions

of transactions per second

Page 17: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Why Use Parallel Computing?

Provide concurrency  A single compute resource can only do one thing at a

time. Multiple computing resources can be doing many

things simultaneously

• For example, the Access Grid (www.accessgrid.org) pro-

vides a global collaboration network where people from

around the world can meet and conduct work "virtually”

Page 18: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Why Use Parallel Computing?

Use of non-local resources 

Using compute resources on a wide area network, or

even the Internet when local compute resources are in-

sufficient

• SETI@home (setiathome.berkeley.edu) over 1.3 million

users, 3.4 million computers in nearly every country in the

world. Source: www.boincsynergy.com/stats/ (June, 2013).

• Folding@home (folding.stanford.edu) uses over 320,000

computers globally (June, 2013)

Page 19: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Why Use Parallel Computing?

Limits to serial computing

Transmission speeds • The speed of a serial computer is directly dependent upon

how fast data can move through hardware.• Absolute limits are the speed of light and the transmission

limit of copper wire

• Increasing speeds necessitate increasing proximity of pro-

cessing elements

Limits to miniaturization • Processor technology is allowing an increasing number of

transistors to be placed on a chip • However, even with molecular or atomic-level components,

a limit will be reached on how small components can be

Page 20: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Why Use Parallel Computing?

Limits to serial computing Economic limitations

• It is increasingly expensive to make a single processor

faster

• Using a larger number of moderately fast commodity pro-

cessors to achieve the same or better performance is less

expensive

Current computer architectures are increasingly relying

upon hardware level parallelism to improve perfor-

mance

• Multiple execution units

• Pipelined instructions

• Multi-core

Page 21: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

The Future

 Trends indicated by ever faster networks, distrib-

uted systems, and multi-processor computer archi-

tectures clearly show that parallelism is the fu-

ture of computing

Page 22: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

The Future

 There has been a greater than 1000x increase in super-computer performance, with no end currently in sight

The race is already on for Exascale Computing! 1 exaFlops = 1018, floating point operations per second

Page 23: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Who is Using Parallel Computing?

Page 24: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Who is Using Parallel Computing?

Page 25: Introduction to Parallel and Distributed Computing

Concepts and Terminology

Page 26: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

von Neumann Architecture

Named after the Hungarian mathematician John von Neumann who first authored the general re-quirements for an electronic computer in his 1945 papers

Since then, virtually all computers have followed this basic design

Comprises of four main components:

RAM is used to store both program instructions and data

Control unit fetches instructions or data from memory,

Decodes instructions and then sequentially coordinates operations

to accomplish the programmed task.

Arithmetic Unit performs basic arithmetic operations

Input-Output is the interface to the human operator

Page 27: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Flynn's Classical Taxonomy

Different ways to classify parallel computers Examples available HERE

Widely used classifications is called Flynn's Taxon-

omy Based upon the number of concurrent instruction and

data streams available in the architecture

Page 28: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Single Instruction, Single Data A sequential computer which exploits no parallelism in ei-

ther the instruction or data streams

• Single Instruction: Only one instruction stream is being

acted on by the CPU during any one clock cycle

• Single Data: Only one data stream is being used as input

during any one clock cycle

• Can have concurrent processing characteristics:

Pipelined execution

Flynn's Classical Taxonomy

Page 29: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Flynn's Classical Taxonomy

Older generation mainframes, minicomputers, workstations,

modern day PCs

Page 30: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Flynn's Classical Taxonomy

Single Instruction, Multiple Data

A computer which exploits multiple data streams against

a single instruction stream to perform operations 

• Single Instruction: All processing units execute the same

instruction at any given clock cycle

• Multiple Data: Each processing unit can operate on a differ-

ent data element

Page 31: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Flynn's Classical Taxonomy

Single Instruction, Multiple Data

Vector Processor implements a instruction set contain-

ing instructions that operates on 1D arrays of data called

vectors

Page 32: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Multiple Instruction, Single Data Multiple Instruction: Each processing unit operates on

the data independently via separate instruction streams

Single Data: A single data stream is fed into multiple processing units

Some conceivable uses might be:

• Multiple cryptography algorithms attempting to crack a single coded message

Flynn's Classical Taxonomy

Page 33: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Multiple Instruction, Multiple Data Multiple autonomous processors simultaneously execut-

ing different instructions on different data

• Multiple Instruction: Every processor may be executing a

different instruction stream

• Multiple Data: Every processor may be working with a dif-

ferent data stream

Flynn's Classical Taxonomy

Page 34: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Flynn's Classical Taxonomy

Most current supercomputers, networked parallel com-puter clusters, grids, clouds, multi-core PCs

Many MIMD architectures also include SIMD execution sub-components

Page 35: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Parallel computing Using parallel computer to solve single prob-

lems faster

Parallel computer Multiple-processor or multi-core system sup-

porting parallel programming

Parallel programming Programming in a language that supports

concurrency explicitly

Supercomputing or High Performance Computing Using the world's fastest and largest comput-

ers to solve large problems

Page 36: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Task A logically discrete section of computational work.

A task is typically a program or program-like set of instruc-tions that is executed by a processor

A parallel program consists of multiple tasks running on mul-tiple processors

Shared Memory Hardware point of view, describes a computer architecture

where all processors have direct access to common physical memory

In a programming sense, it describes a model where all paral-lel tasks have the same "picture" of memory and can directly address and access the same logical memory locations re-gardless of where the physical memory actually exists

Page 37: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Symmetric Multi-Processor (SMP) Hardware architecture where multiple proces-

sors share a single address space and access to all resources; shared memory computing

Distributed Memory In hardware, refers to network based memory

access for physical memory that is not com-mon

As a programming model, tasks can only logi-cally "see" local machine memory and must use communications to access memory on other machines where other tasks are execut-ing

Page 38: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Communications Parallel tasks typically need to exchange data. There are several

ways this can be accomplished, such as through a shared memory bus or over a network, however the actual event of data exchange is commonly referred to as communications regardless of the method employed

Synchronization Coordination of parallel tasks in real time, very often associated with

communications

Often implemented by establishing a synchronization point within an application where a task may not proceed further until another task(s) reaches the same or logically equivalent point

Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase

Page 39: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Parallel Overhead The amount of time required to coordinate

parallel tasks, as opposed to doing useful work.

Parallel overhead can include factors such as:• Task start-up time• Synchronizations• Data communications• Software overhead imposed by parallel languages, li-

braries, operating system, etc.• Task termination time

Massively Parallel Refers to the hardware that comprises a given parallel system - hav-

ing many processors.

The meaning of "many" keeps increasing, but currently, the largest parallel computers can be comprised of processors numbering in the hundreds of thousands

Page 40: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Embarrassingly Parallel Solving many similar, but independent tasks

simultaneously; little to no need for coordina-tion between the tasks

Scalability A parallel system's ability to demonstrate a

proportionate increase in parallel speedup with the addition of more resources

Page 41: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Multi-core processor A single computing component with two or

more independent actual central processing units

Page 42: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Potential program speedup is defined by the frac-

tion of code (P) that can be parallelized

If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup).

If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice

as fast.

Limits and Costs of Parallel Programming

Page 43: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Introducing the number of processors N performing the parallel fraction of work P, the relationship can be mod-eled by:

It soon becomes obvious that there are limits to the scal-ability of parallelism. For example

Limits and Costs of Parallel Programming

Page 44: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Limits and Costs of Parallel Programming

Page 45: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Complexity Parallel applications are much more complex

than corresponding serial applications

• Not only do you have multiple instruction streams executing at the same time, but you also have data flowing between them

The costs of complexity are measured in pro-grammer time in virtually every aspect of the software development cycle:• Design• Coding• Debugging• Tuning• Maintenance

Page 46: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Portability Portability issues with parallel programs are

not as serious as in years past due to stan-dardization in several APIs, such as MPI, POSIX threads, and OpenMP

All of the usual portability issues associated with serial programs apply to parallel pro-grams

• Hardware architectures are characteristically highly variable and can affect portability

Page 47: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Resource Requirements Amount of memory required can be greater

for parallel codes than serial codes, due to the need to replicate data and for overheads as-sociated with parallel support libraries and subsystems

For short running parallel programs, there can actually be a decrease in performance com-pared to a similar serial implementation• The overhead costs associated with setting up the

parallel environment, task creation, communications and task termination can comprise a significant por-tion of the total execution time for short runs

Page 48: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Some General Parallel Terminology

Scalability Ability of a parallel program's performance to

scale is a result of a number of interrelated factors. Simply adding more processors is rarely the answer

Algorithm may have inherent limits to scala-bility

Hardware factors play a significant role in scalability. Examples:• Memory-CPU bus bandwidth on an SMP machine• Communications network bandwidth• Amount of memory available on any given machine

or set of machines• Processor clock speed

Page 49: Introduction to Parallel and Distributed Computing

Parallel Computer Memory Ar-

chitectures

Page 50: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

All processors access all memory as global ad-

dress space Multiple processors can operate independently but

share the same memory resources

Changes in a memory location effected by one proces-sor are visible to all other processors

Shared Memory

Page 51: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Shared memory machines are classified

as UMA and NUMA, based upon memory ac-

cess times

Uniform Memory Access (UMA)

• Most commonly represented today by Symmetric Multipro-cessor (SMP) machines

• Identical processors

• Equal access and access times to memory

• Sometimes called CC-UMA - Cache Coherent UMA

Cache coherent means if one processor updates a location in shared memory, all the other processors know about the up-date. Cache coherency is accomplished at the hardware level

Shared Memory

Page 52: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Non-Uniform Memory Access (NUMA)

• Often made by physically linking two or more SMPs

• One SMP can directly access memory of another SMP

• Not all processors have equal access time to all memories

• Memory access across link is slower

• If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA

Shared Memory

Page 53: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Advantages Global address space provides a user-friendly pro-

gramming perspective to memory Data sharing between tasks is both fast and uniform

due to the proximity of memory to CPUs

Disadvantages Primary disadvantage is the lack of scalability between

memory and CPUs • Adding more CPUs can geometrically increases traffic on

the shared memory-CPU path, and for cache coherent sys-tems, geometrically increase traffic associated with cache/memory management

Programmer responsibility for synchronization con-structs that ensure "correct" access of global memory

Shared Memory

Page 54: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Processors have their own local memory Changes to processor’s local memory have no effect on

the memory of other processors When a processor needs access to data in another pro-

cessor, it is usually the task of the programmer to ex-plicitly define how and when data is communicated

Synchronization between tasks is likewise the pro-grammer's responsibility

The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet

Distributed Memory

Page 55: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Advantages Memory is scalable with the number of processors.

• Increase the number of processors and the size of memory increases proportionately

Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain global cache coherency.

Cost effectiveness: can use commodity, off-the-shelf processors and networking

Disadvantages The programmer is responsible for many of the details

associated with data communication between proces-sors.

Non-uniform memory access times - data residing on a remote node takes longer to access than node local data.

Distributed Memory

Page 56: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Hybrid Distributed-Shared Memory

The largest and fastest computers in the world today employ both shared and distributed mem-ory architectures The shared memory component can be a shared mem-

ory machine or graphics processing units

The distributed memory component is the networking of multiple shared memory or GPU machines

Page 57: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Advantages and Disadvantages

Whatever is common to both shared and distributed memory architectures

Increased scalability is an important advantage

Increased programmer complexity is an important dis-advantage

Hybrid Distributed-Shared Memory

Page 58: Introduction to Parallel and Distributed Computing

Parallel Programming Models

Page 59: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Parallel Programming Model

Programming model provides an abstract view of computing system

Abstraction above hardware and memory architectures

Value of a programming model is usually judged on its generality

• how well a range of different problems can be expressed and

• how well they execute on a range of different architectures

The implementation of a programming model can take several forms such as libraries invoked from traditional sequential languages, language extensions, or com-plete new execution models

Page 60: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Parallel Programming Model

Parallel programming models in common use: Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)

These models are NOT specific to a particular type of machine or memory architecture Any of these models can be implemented on any un-

derlying hardware

Page 61: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Parallel Programming Model

SHARED memory model on a DISTRIBUTED mem-ory machine Machine memory was physically distributed across net-

worked machines, but appeared to the user as a single shared memory (global address space).

This approach is referred to as virtual shared memory

DISTRIBUTED memory model on a SHARED mem-ory machine The SGI Origin 2000 employed the CC-NUMA type of

shared memory architecture, where every task has direct access to global address space spread across all ma-chines.

However, the ability to send and receive messages using MPI, as is commonly done over a network of distributed memory machines, was implemented and commonly used

Page 62: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Shared Memory Model - Without Threads

Tasks share a common address space Efficient means of passing data between programs Various mechanisms such as locks / semaphores may be

used to control access to the shared memory Programmer's point of view

• The notion of data "ownership" is lacking, so there is no need to specify explicitly the communication of data between tasks

• Program development can often be simplified

Disadvantage in terms of performance It becomes more difficult to understand and manage

data locality:• Keeping data local to the processor that works on it con-

serves memory accesses, cache refreshes and bus traffic that occurs when multiple processors use the same data

Page 63: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Shared Memory Model - Without Threads

Implementations

Native compilers or hardware translate user program variables into actual memory addresses, which are global

• On stand-alone shared memory machines, this is straightfor-ward.

• On distributed shared memory machines, memory is physi-cally distributed across a network of machines, but made global through specialized hardware and software

Page 64: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Threads Model

Type of shared memory programming model

A single "heavy weight" process can have multiple "light weight", concurrent execution paths

Main program a.out is scheduled by native OS

a.out loads and acquires all of the necessary system and user resources to run. This is the "heavy weight" process

a.out performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently

Page 65: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Threads Model

Each thread has local data, but also, shares the entire resources of a.out

This saves the overhead associated with replicating a program's resources for each thread ("light weight"). Each thread also benefits from a global memory view because it shares the memory space of a.out

Threads communicate with each other through global memory (updating address locations). This requires syn-chronization constructs to ensure that more than one thread is not updating the same global address at any time

Threads can come and go, but a.out remains present to provide the necessary shared resources until the applica-tion has completed

Page 66: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Threads Model

Implementations

•  POSIX Threads  Library based; requires parallel coding C Language only Commonly referred to as Pthreads. Most hardware vendors now offer Pthreads in addition to their pro-

prietary threads implementations. Very explicit parallelism; requires significant programmer attention

to detail.

• OpenMP Compiler directive based; can use serial code Portable / multi-platform, including Unix and Windows platforms Available in C/C++ and Fortran implementations Can be very easy and simple to use

Page 67: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Distributed Memory / Message Passing Model

A set of tasks that use their own local memory dur-ing computation

Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines.

Tasks exchange data through communications by send-ing and receiving messages.

Data transfer usually requires cooperative operations to be performed by each process. For example, a send op-eration must have a matching receive operation.

Page 68: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Distributed Memory / Message Passing Model

Implementations

From a programming perspective, message passing im-plementations usually comprise a library of subroutines

Calls to these subroutines are imbedded in source code

MPI specifications are available on the web at http://www.mpi-forum.org/docs/

MPI implementations exist for virtually all popular paral-lel computing platforms

Page 69: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Address space is treated globally

A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure On shared memory architectures, all tasks may have ac-

cess to the data structure through global memory On distributed memory architectures the data structure

is split up and resides as "chunks" in the local memory of each task

Data Parallel Model

Page 70: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Implementations Unified Parallel C (UPC): an extension to the C pro-

gramming language for SPMD parallel programming. Compiler dependent. More information: http://upc.lbl.gov/

Global Arrays: provides a shared memory style pro-gramming environment in the context of distributed ar-ray data structures. Public domain library with C and For-tran77 bindings. More information: http://www.emsl.pnl.gov/docs/global/

X10: a PGAS based parallel programming language be-ing developed by IBM at the Thomas J. Watson Research Center. More information: http://x10-lang.org/

Data Parallel Model

Page 71: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Hybrid Model

A hybrid model combines more than one of the previously described programming models

Page 72: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Single Program Multiple Data

High level programming model that can be built upon any combination of the previously mentioned parallel programming models

SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data

parallel or hybrid.

MULTIPLE DATA: All tasks may use different data

Page 73: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Multiple Program Multiple Data

High level programming model that can be built upon any combination of the previously mentioned parallel programming models

MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid.

MULTIPLE DATA: All tasks may use different data

Page 74: Introduction to Parallel and Distributed Computing

Designing Parallel Programs

Page 75: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Determine whether the problem can be parallelized

Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation

This problem can be solved in parallel.

• Each of the molecular conformations is independently deter-minable

• The calculation of the minimum energy conformation is also a parallelizable problem.

Understand the Problem and the Program

Page 76: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Calculation of the series (0,1,1,2,3,5,8,13,21,...) by use of the for-mula: F(n) = F(n-1) + F(n-2)

This problem can't be solved in parallel

• Calculation of the Fibonacci sequence as shown would entail dependent calculations rather than independent ones.

• The calculation of the F(n) value uses those of both F(n-1) and F(n-2). These three terms cannot be calculated indepen-dently and therefore, not in parallel

Understand the Problem and the Program

Page 77: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Understand the Problem and the Program

Identify the program's hotspots Know where most of the real work is being done

• The majority of scientific and technical programs usually ac-complish most of their work in a few places

• Profilers and performance analysis tools can help here

Focus on parallelizing the hotspots and ignore those sec-tions of the program that account for little CPU usage

Page 78: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Understand the Problem and the Program

Identify bottlenecks in the program Are there areas that are disproportionately slow, or

cause parallelizable work to halt or be deferred?

• For example, I/O is usually something that slows a program down

May be possible to restructure the program or use a dif-ferent algorithm to reduce or eliminate unnecessary slow areas

Identify inhibitors to parallelism One common class of inhibitor is data dependence

Page 79: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Break the problem into discrete "chunks" of work that can be distributed to multiple tasks

Two basic ways to partition computational work among parallel tasks:  Domain decomposition 

• Data associated with a problem is decomposed • Each parallel task then works on a portion of the data

Partitioning

Page 80: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Partitioning

Functional  decomposition 

• Problem is decomposed according to the work

• Each task then performs a portion of the overall work

• Functional decomposition lends itself well to problems that can be split into different tasks

Page 81: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Partitioning

Ecosystem Modeling  Each program calculates the population of a given group

• Each group's growth depends on that of its neighbors

• As time progresses, each process calculates its current state, then exchanges information with the neighbor populations

• All tasks then progress to calculate the state at the next time step

Page 82: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Partitioning

Climate Modeling   Each model component can be thought of as a separate

task

Arrows represent exchanges of data between compo-nents during computation:

• the atmosphere model generates wind velocity data that are used by the ocean model, the ocean model generates sea surface temperature data that are used by the atmosphere model, and so on

Page 83: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Communications

Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data For example, imagine an image processing operation

where every pixel in a black and white image needs to have its color reversed

The image data can easily be distributed to multiple tasks that then act independently of each other to do their portion of the work

Most parallel applications require tasks to share data with each other For example, a 3-D heat diffusion problem requires a

task to know the temperatures calculated by the tasks that have neighboring data

Changes to neighboring data has a direct effect on that task's data

Page 84: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Communications

Factors to Consider when designing your pro-gram's inter-task communications Cost of communications

• Inter-task communication virtually always implies overhead• Communications frequently require some type of synchro-

nization between tasks• Competing communication traffic can saturate the available

network bandwidth, further aggravating performance prob-lems

Latency vs. Bandwidth• latency is the time it takes to send a message from point A

to B• bandwidth is the amount of data that can be communicated

per unit of time• Sending many small messages can cause latency to domi-

nate communication overheads • Often it is more efficient to package small messages into a

larger message, thus increasing the effective communica-tions bandwidth

Page 85: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Communications

Visibility of communications• With the Message Passing Model, communications are explicit

and generally quite visible and under the control of the pro-grammer

• With the Data Parallel Model, communications often occur transparently to the programmer, particularly on distributed memory architectures

• The programmer may not even be able to know exactly how inter-task communications are being accomplished

Synchronous vs. asynchronous communications• Synchronous communications are often referred to

as blocking communications since other work must wait un-til the communications have completed

• Asynchronous communications allow tasks to transfer data independently from one another

• Interleaving computation with communication is the single greatest benefit for using asynchronous communications

Page 86: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Communications

Scope of communications• Knowing which tasks must communicate with each other is

critical during the design stage of a parallel code • Point-to-point - involves two tasks with one task acting as

the sender/producer of data, and the other acting as the re-ceiver/consumer

• Collective - involves data sharing between more than two tasks, which are often specified as being members in a com-mon group

• Some common variations

Page 87: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Communications

Efficiency of communications Which implementation for a given model should be

used? • Using the Message Passing Model as an example, one MPI

implementation may be faster on a given hardware platform than another

• What type of communication operations should be used? • asynchronous communication operations can improve overall

program performance

Page 88: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Communications

Overhead and Com-plexity

Page 89: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Synchronization

Types of Synchronization Barrier

• Each task performs its work until it reaches the barrier

• When the last task reaches the barrier, all tasks are synchro-nized

Lock

• Typically used to protect access to global data or code sec-tion

• Only one task at a time may use the lock

Synchronous communication operations

• When a task performs a communication operation, some form of coordination is required with the other task participat-ing in the communication

• For example, before a task can perform a send operation, it must first receive an acknowledgment from the receiving task that it is OK to send

Page 90: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

A dependence exists between program statements when the order of statement execution affects the results of the program

Dependencies are important to parallel programming be-cause they are one of the primary inhibitors to parallelism

Loop carried data dependence: The value of A(J-1) must be computed before the value of A(J), therefore A(J) exhibits a data dependency on A(J-1)

How to Handle Data Dependencies

Distributed memory architectures - communicate required data at syn-chronization points

Shared memory architectures -synchronize read-write operations be-tween tasks

Data Dependencies

Page 91: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Practice of distributing approximately equal amounts of work among tasks

Important to parallel programs for performance reasons

For example, if all tasks are subject to a barrier synchronization point, the slowest task will determine the overall performance.

Load Balancing

Page 92: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

How to Achieve Load Balance Equally partition the work each task receives

• For array operations where each task performs similar work, evenly distribute the data set among the tasks

• For loop iterations where the work done in each iteration is similar, evenly distribute the iterations across the tasks

• If a heterogeneous mix of machines, use performance analy-sis tool to detect any load imbalances and adjust work ac-cordingly

Use dynamic work assignment• When the amount of work each task will perform is intention-

ally variable, or is unable to be predicted, it may be helpful to use a scheduler - task pool approach

As each task finishes its work, it queues to get a new piece of work.

• It may become necessary to design an algorithm which de-tects and handles load imbalances as they occur dynamically within the code

Load Balancing

Page 93: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

The number of tasks into which a problem is decom-posed determines granularity Fine-grain Parallelism – Decomposition into large no of tasks

• Individual tasks are relatively small in terms of code size and execution time

• The data is transferred among processors frequently

• Facilitates load balancing

• Implies high communication overhead

Coarse-grain Parallelism – Decomposition into small no of tasks

• Individual tasks are relatively small in terms of code size and execution time

• Implies more opportunity for performance increase

• Harder to load balance efficiently

Granularity

Page 94: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Array Processing

Example demonstrates calculations on 2-dimensional ar-ray elements, with the computation on each array ele-ment being independent from other array elements

Serial program calculates one element at a time in se-quential order

Parallel Examples : Array Processing

Page 95: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Array Processing - Parallel Solution 1

Independent calculation of array elements ensures there is no need for communication between tasks

Each task executes the portion of the loop corresponding to the data it owns

Parallel Examples : Array Processing

Page 96: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

One Possible Solution

Implement as a Single Program Multiple Data model

• Master process initializes array, sends info to worker pro-cesses and receives results

• Worker process receives info, performs its share of computa-tion and sends results to master

• Static load balancing

• Each task has a fixed amount of work to do

• May be significant idle time for faster or more lightly loaded pro-cessors - slowest tasks determines overall performance

Parallel Examples : Array Processing

Page 97: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Pool of Tasks Scheme Two processes are employed

• Master Process

• Holds pool of tasks for worker processes to do

• Sends worker a task when requested

• Collects results from workers

• Worker Process

• Gets task from master process

• Performs computation

• Sends results to master

• Worker processes do not know before runtime which portion of array they will handle or how many tasks they will perform

• Dynamic load balancing occurs at run time

• the faster tasks will get more work to do

Parallel Examples : Array Processing

Page 98: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Multicore computing A Multicore processor is a processor that includes mul-

tiple execution units ("cores") on the same chip

Symmetric multiprocessing A computer system with multiple identical processors

that share memory and connect via a bus than node local data.

Distributed System A software system in which components located on

networked computers communicate and coordinate their actions by passing messages

An important goal and challenge of distributed systems is location transparency

Classes of Parallel Computers

Page 99: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Cluster computing A cluster is a group of loosely coupled computers that

work together closely, so that in some respects they can be regarded as a single computer

Massive parallel processing A massively parallel processor (MPP) is a single com-

puter with many networked processors. MPPs have many of the same characteristics as clus-

ters, but MPPs have specialized interconnect networks 

Grid computing Compared to clusters, grids tend to be more loosely

coupled, heterogeneous, and geographically dispersed

Classes of Parallel Computers

Page 100: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Exascale computing Computer system capable of at least one exaFlops

• 10^18 floating point operations per second

Volunteer Computing Computer owners donate their computing resources to

one or more projects 

Peep-to-peer Network A type of decentralized and distributed network archi-

tecture in which individual nodes in the network act as both suppliers and consumers of resources

Classes of Parallel Computers

Page 101: Introduction to Parallel and Distributed Computing

Distributed Systems

Page 102: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

A collection of independent computers that ap-pear to the users of the system as a single com-puter

A collection of autonomous computers, connected through a network and distribution middleware which enables computers to coordinate their ac-tivities and to share the resources of the system, so that users perceive the system as a single, in-tegrated computing facility

Distributed Systems

Page 103: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Example - Internet

Page 104: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Example - ATM

Page 105: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Example - Mobile devices in a Distributed System

Page 106: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Why Distributed Systems?

Page 107: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Transparency

Scalability

Fault tolerance

Concurrency

Openness

These challenges can also be seen as the goals or desired properties of a distributed system

Basic problems and challenges

Page 108: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Concealment from the user and the application pro-grammer of the separation of the components of a dis-tributed system Access Transparency - Local and remote resources are ac-

cessed in same way

Location Transparency - Users are unaware of the location of resources

Migration Transparency - Resources can migrate without name change

Replication Transparency - Users are unaware of the exis-tence of multiple copies of resources

Failure Transparency - Users are unaware of the failure of in-dividual components

Concurrency Transparency - Users are unaware of sharing re-sources with others

Transparency

Page 109: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Addition of users and resources without suffering a no-ticeable loss of performance or increase in administra-tive complexity

Adding users and resources causes a system to grow: Size - growth with regards to the number of users or re-

sources

• System may become overloaded

• May increase administration cost

Geography - growth with regards to geography or the dis-tance between nodes

• Greater communication delays

Administration – increase in administrative cost

Scalability

Page 110: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Whether the system can be extended in various ways without disrupting existing system and ser-vices Hardware extensions

• adding peripherals, memory, communication interfaces

Software extensions

• Operating System features

• Communication protocols

Openness is supported by: Public interfaces

Standardized communication protocols

Openness

Page 111: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

In a single system several processes are inter-leaved

In distributed systems - there are many systems with one or more processors

Many users simultaneously invoke commands or applications, access and update shared data

Mutual exclusion Synchronization No global clock

Concurrency

Page 112: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Hardware, software and networks fail

Distributed systems must maintain availability even at low levels of hardware, software, network reliability

Fault tolerance is achieved by

Recovery

Redundancy

Issues Detecting failures

Masking failures

Recovery from failures

Redundancy

Fault tolerance

Page 113: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Fault tolerance

Omission and Arbitrary Failures

Page 114: Introduction to Parallel and Distributed Computing

한국해양과학기술진흥원

Performance issues Responsiveness

Throughput

Load sharing, load balancing

Quality of service Correctness

Reliability, availability, fault tolerance

Security

Performance

Adaptability

Design Requirements