parallel and distributed computing

43
Parallel and Distributed Computing References Introduction to Parallel Computing, Second Edition Ananth Grama , Anshul Gupta , George Karypis , Vipin Kumar Publisher: Addison Wesley , ISBN: 0-201-64865-2 Publication Date: January 16, 2003 Tutorials about Cluster , Grid, Cloud Computing, Internet Sites

Upload: raja

Post on 05-Jan-2016

67 views

Category:

Documents


6 download

DESCRIPTION

Parallel and Distributed Computing. References Introduction to Parallel Computing, Second Edition Ananth Grama , Anshul Gupta , George Karypis , Vipin Kumar Publisher: Addison Wesley , ISBN: 0-201-64865-2 Publication Date: January 16, 2003 - PowerPoint PPT Presentation

TRANSCRIPT

Page 2: Parallel and Distributed Computing

Main Objective of this chapter is to provide a survey about innovation on parallel computing with respect to the architectures

Chapter 1Introduction to Parallel

Computing

Page 3: Parallel and Distributed Computing

von Neumann Architecture• Common machine

model– for over 40 years

• Stored-program concept• CPU executes a stored

program• A sequence of read and

write operations on the memory (RAM)

• Order of operations is sequential

Page 4: Parallel and Distributed Computing

A More Detailed Architecture based on von Neumann

Model

Page 5: Parallel and Distributed Computing

Old von Neumann Computer

Page 6: Parallel and Distributed Computing

CISC von Neumann Computer• CISC ( Complex Instruction Set Computer

with a single bus system

• Harvard (RISC) architecture utilizes two buses, – separate data bus and address bus

• RISC ( Reduced Instruction Set Computer)

• They are SISD machines – Single Instruction Stream Single on Data

Stream

Page 7: Parallel and Distributed Computing

John von Neumann• December 28, 1903 –

February 8, 1957

• Hungarian mathematician

• Got his Ph.D. at 23

Page 8: Parallel and Distributed Computing

Motivations for Parallel Computing

• Fundamental limits on single processor speed

• Disparity between CPU & memory speed– Performance Mismatch Problem

• Distributed data communications

• Need for very large scale computing platforms

Page 9: Parallel and Distributed Computing

Fundamental Limits – Cycle Speed

• Cray 1: 12ns 1975

• Cray 2: 6ns 1986

• Cray T-90 2ns 1997

• Intel PC 1ns 2000

• Today’s PC 0.3ns 2006 (P4)

• Speed of light: 30cm in 1ns

• Signal travels about 10 times slower

Page 10: Parallel and Distributed Computing

Moore’s Law• Moore’s observation in 1965:

– number of transistors per square inch on integrated circuits had doubled every year

• Moore’s revised observation in 1975:– the space slowed down a bit, but data density

had doubled approximately every 18 months

• How about the future?– (price of computing power falls by a half every

18 months?)

Page 11: Parallel and Distributed Computing

Moore’s Law – Held for Now

Page 12: Parallel and Distributed Computing

CPU and Memory Speeds

• In 20 years, CPU speed (clock rate) has increased by a factor of 10001000

• DRAM speed has increased only by a factor of smaller than 44

• How to feed data faster enough to keep CPU busy?

• CPU speed: 1-2 ns

• DRAM speed: 50-60 ns

• Cache: 10 ns

Page 13: Parallel and Distributed Computing

Memory Access and CPU Speed

Page 14: Parallel and Distributed Computing

CPU, Memory, and Disk Speed

Page 15: Parallel and Distributed Computing

Possible Solutions• A hierarchy of successively fast memory

devices (multilevel caches)

• Location of data reference (data Locality)

• Efficient programming can be an issue

• Parallel systems may provide1. larger aggregate cache

2. higher aggregate bandwidth to the memory system

Page 16: Parallel and Distributed Computing

Distributed Data Communications

• Data may be collected and stored at different locations

• It is expensive to bring them to a central location for processing

• Many computing assignments many be inherently parallel

• Privacy issues in data mining and other large scale commercial database manipulations

Page 17: Parallel and Distributed Computing

Distributed Data Communications

Page 18: Parallel and Distributed Computing

Why Use Parallel Computing

• Save time – wall clock time – many processors work together

• Solve larger problems – larger than one processor’s CPU and memory can handle

• Provide concurrency – do multiple things at the same time: online access to databases, search engine

• Google’s 4,000 PC servers are one of the largest in clusters the world

Page 19: Parallel and Distributed Computing

Other Reasons for Parallel Computing

• Taking advantages of non-local resources – using computing resources on a wide area network, or even internet (grid & cloud computing)– Remote Access Resources

• Cost savings – using multiple “cheap” computing resources instead of a high-end CPU

• Overcoming memory constraints – for large problems, using memories of multiple computers may overcome the memory constraint obstacle

Page 20: Parallel and Distributed Computing

Need for Large Scale Modeling

• Weather forecasting

• Ocean modeling

• Oil reservoir simulations

• Car and airplane manufacture

• Semiconductor simulation

• Pollution tracking

• Large commercial databases

• Aerospace NASA microgravity modeling)

Page 21: Parallel and Distributed Computing

Issues in Parallel Computing (main issues of the course)

• Design of parallel computers• Design of efficient parallel algorithms• Methods for evaluating parallel algorithms• Parallel computer languages• Parallel programming tools• Portable parallel programs• Automatic programming of parallel computers• Education of parallel computing philosophy

Page 22: Parallel and Distributed Computing

*PARALLEL ARCHITECTURES• A parallel computer can be characterized as a system

where multiple processing elements cooperate in executing one or more tasks.

• The numerous existing parallel architectures and their different approaches require some kind of classification.

Flynn’s Taxonomy• the design of a computer is characterized by:

– flow (or stream) of instructions, and – flow (or stream) of data.

• Flynn’s taxonomy classifies according to the multiplicity of the instruction and the data flows.

Page 23: Parallel and Distributed Computing

PARALLEL ARCHITECTURESSISD (Single Instruction Single Data)

architecture• Single instruction: only one instruction stream

is being acted on by the CPU during any one clock cycle

• Single data: only one data stream is being used as input during any one clock cycle

• Deterministic execution • Corresponds to the conventional sequential

computer. – oldest and even today, the most common

type of computer • Examples: older generation mainframes,

minicomputers and workstations; most modern day PCs.

Page 24: Parallel and Distributed Computing

PARALLEL ARCHITECTURES MISD (Multiple Instruction Single Data) A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent

instruction streams. Few type of this class of parallel computer have ever existed.

The experimental Carnegie-Mellon C.mmp computer (1971). Some uses might be:

– multiple frequency filters operating on a signal stream – multiple cryptography algorithms attempting to crack a single coded

message. Although MISD does not seem to be meaningful

pipeline architectures, as found in all modern processors, can be considered MISD

Page 25: Parallel and Distributed Computing

Parallel & Distributed ComputingHARDWARE PLATFORMS

• Three principal classes of parallel computers used today: – SIMD machines – MIMD machines

• Shared-memory multiprocessors,

• Distributed memory multicomputers, and

• Networked Workstations

Hardware Platforms

Parallel Computers Distributed Computers

Networked Workstations

SIMD MachinesDistributed Memory

(Multi-computers)

SharedMemory

(Multiprocessors)

Taxonomy of Parallel and Distributed Computers

Page 26: Parallel and Distributed Computing

SIMD Machines• Main characteristic of SIMD

machines is that:– all processors must execute the same

instruction (on a different data element ) at any instant in the program's execution.

– These machines execute in "lock-lock-stepstep" synchronous to a global clock

• All processors must complete execution of the current instruction before any is allowed to proceed to the next instruction.

Page 27: Parallel and Distributed Computing

SIMD Machines• SIMD machines typically contain more,

simpler, processors (Processing Elements (PE))

Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing.

Two varieties: Processor Arrays and Vector Pipelines:

Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV

Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10

• Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD instructions and execution units.

Page 28: Parallel and Distributed Computing

Processor Arrays- ILLIAC IV

Page 29: Parallel and Distributed Computing
Page 30: Parallel and Distributed Computing

VECTOR PROCESSOR Vs ARRAY PROCESSOR

• Vector processor is a computer with built-in instructions that – perform multiple calculations on a vector of dataperform multiple calculations on a vector of data

• (One dimensional arrays)

• Vector processor is used to solve the same or similar problems as an array processor.

• A vector processor passes vector of data to functional units.

• An array processor passes each element of a vector to a different arithmetic unit.

Page 31: Parallel and Distributed Computing

MIMD Machines• most common type of parallel

computer, and modern computers fall into this category

• Every Processor Element (PE) has its own Control Unit (CU).

Examples– most current supercomputers,

networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs.

Page 32: Parallel and Distributed Computing

MIMD Machines

• PEs operate independently of each other and execute independent instructions on different data streams

Execution can be deterministicdeterministic or non-non-deterministicdeterministic non-deterministicnon-deterministic

multiple ways of processing the same input, without any specification of which one will be taken to arrive at outcomes

Page 33: Parallel and Distributed Computing

MIMD Machines• A parallel execution of a global task (i.e., the

collaboration of PEs is achieved through synchronizationsynchronization or asynchronousasynchronous and data exchange between the PEs via the interconnection network

• An MIMDMIMD architecture can simulate an SIMDSIMD architecture by executing the same program on all the processors, which is called SPMDSPMD (Single Program Multiple Data) mode.– deterministicdeterministic or non-deterministic execution?!non-deterministic execution?!

Page 34: Parallel and Distributed Computing

Memory Architectures• For both the design and the programming model of a parallel

system, the memory organization is a very important issue – not considered by Flynn’s taxonomy

• The memory organization of a parallel system can divide into two aspects:– location and – access policy of the memory.

• Regarding the location, memory is either centralizedcentralized or distributeddistributed with the processors.

• systems with a common memory, distributed or not, are called Shared-memorShared-memory machines– all processors have full access to the memory.

• systems where there is no such shared memory, are called Distributed-memoryDistributed-memory machines– processors have to use explicit means of

communication like message passingmessage passing.

Page 35: Parallel and Distributed Computing

Memory Architectures• With these two aspects of memory

organization, two common memory organizations:– Centralized Memory

Multiprocessor• a common memory,

– Distributed Memory Multiprocessor• Shared Memory (distributed-Shred

Memory)• Message Passing

» Multi-computer

Symmetric Multiprocessor (SMP)Symmetric Multiprocessor (SMP)

Distributed Shard Multiprocessor (DSM)Distributed Shard Multiprocessor (DSM)

Page 36: Parallel and Distributed Computing

Distributed Shard Multiprocessor (DSM)Distributed Shard Multiprocessor (DSM)

Shared Memory Multi-Processor Systems• With these two aspects of memory organization, two

common memory organizations:– Centralized Memory Multiprocessor

• a common memory,

• Two different types of shared memory multi-processor systems: – Symmetric Multi-Processors (SMPs) – Distributed Shared Memory (DSM).

• In both cases every processor can read and write to any portion of the system’s memory (Shared Variable).

• Cache coherence protocolsCache coherence protocols are needed in both cases

Symmetric Multiprocessor (SMP)Symmetric Multiprocessor (SMP)

Page 37: Parallel and Distributed Computing

Shared Memory Multi-Processor Systems

• SMP system; a single pool of memory, which is equally fast for all processors, – referred to as a Uniform Memory Access (UMA).Uniform Memory Access (UMA).

• DSM system; multiple pools of memory and the latency to access memory depends on the relative position of the processor and memory – referred to as cache coherent Non-Uniform Non-Uniform

Memory Access (NUMA).Memory Access (NUMA).– each processor has local memory (the lowest

latency) while everything else is classified as remote memory and is slower to access.

Page 38: Parallel and Distributed Computing

Symmetric Multi-Processors (SMPs) • Shared bus provides a single point of arbitration • Cache coherency is relatively straight forward

– When a processor needs to broadcast a cache coherency message, it simply sends that information over the bus, and all other processors can receive it

This is the model that Intel has pursued since the advent of the Pentium Pro, and continues today with the entire Xeon line.

Page 39: Parallel and Distributed Computing

Distributed Shared Memory (DSM) Scale more effectively because local memory can be accessed rapidly Point to point interconnectsPoint to point interconnects are a natural fit, because the bandwidth

grows proportionally to the number of processors and amount of memory in the system.

The biggest downside of distributed memory is that they only work well if the Operating System is “NUMA-aware” and can efficiently place memory and processes. – intelligent OS would place data and schedule processes such that

each processor only accesses local memory and never needs remote data.

Within DSM, variety of different topologies can be used: crossbar, fat crossbar, fat tree, torus, ring, hypercube, etctree, torus, ring, hypercube, etc. Moreover, these topologies can be combined into hybrid topologies: a tree of rings, a crossbar of meshes, etc.

Page 40: Parallel and Distributed Computing

Memory Hierarchies of Shared Memory Multi-Processor Systems

Page 41: Parallel and Distributed Computing

Distributed-Memory Multicomputers• Multi-computers do not support shared variables. Rather, all

communications between processors must occur via message passing.message passing.• Message-passing libraries are provided to send and receive messages

between processors. – Examples

• the Cray T3D, NCube/Ten, and Intel Paragon. • Large multicomputers may contain hundreds of processors

– no need for cache coherence protocol• Network Of Workstation (NOW) {Cluster}:

– Each processor is stand alone computer– Using LAN network to connect computers– Homogenous & Heterogeneous

Page 42: Parallel and Distributed Computing

Distinction between Shared-Memory Multiprocessors and Multicomputers

• Distinction because – Common address spaceCommon address space provided by shared

memory machines allows global data structures referenced by more than one processor to be used

– Distributed address spaceDistributed address space provided by distributed memory machines allows local data structures.

– Many memory management techniques may be used in shared-memory machines.

Page 43: Parallel and Distributed Computing

“Parallel” Computing• Traditional supercomputers

– SIMD, MIMD, pipelines– Tightly coupled shared memory (SMP, DSM)– Loosely CoupledLoosely Coupled Distributed Memory (Multicomputer)– Bus level connections– Expensive to buy and to maintain

• Very high starting cost– Expensive hardware– Expensive software

• High maintenance• Expensive to upgrade• Cooperating networks of computers