arquitectura de computadores ii paulo marques departamento de eng. informática universidade de...
TRANSCRIPT
![Page 1: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/1.jpg)
Arquitectura de Computadores II
Paulo MarquesDepartamento de Eng. InformáticaUniversidade de [email protected]
2004
/200
5
6. Multi-Processamento6.1. Introdução
![Page 2: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/2.jpg)
2
Nota Importante
A apresentação desta parte da matéria é largamente baseada num curso internacional leccionado no DEI, em Set/2003 sobre “Cluster Computing and Parallel Programming”. Os slides originais podem ser encontrados em:
http://eden.dei.uc.pt/~pmarques/courses/best2003/pmarques_best.pdf
Para além desses materiais, é principalmente utilizado o Cap. 6 do [CAQA] e o Cap. 9 do “Computer Organization and Design”
![Page 3: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/3.jpg)
3
Motivation
I have a program that takes 7 days to execute, which is far too long for practical use. How do I make it run in 1 day?
Work smarter!(i.e. find better algorithms)
Work faster!(i.e. buy a faster processor/memory/machine)
Work harder!(i.e. add more processors!!!)
![Page 4: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/4.jpg)
4
Motivation
We are interested in the last approach: Add more processors!
(We don’t care about being too smart or spending too much $$$ in bigger faster machines!)
Why? It may no be feasible to find better algorithms Normally, faster, bigger machines are very expensive There are lots of computers available in any institution
(especially at night) There are computer centers from where you can buy
parallel machine time Adding more processors enables you not only to run
things faster, but to run bigger problems
![Page 5: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/5.jpg)
5
Motivation
“Adding more processors enables you not only to run things faster, but to run bigger problems”?!
“9 women cannot have a baby in 1 month, but they can have 9 babies in 9 months”
This is called the Gustafson-Barsis law (informally)
What the Gustafson-Barsis law tell us is that when the size of the problem grows, normally there’s more parallelism available
![Page 6: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/6.jpg)
Arquitectura de Computadores II
Paulo MarquesDepartamento de Eng. InformáticaUniversidade de [email protected]
2004
/200
5
6. Multi-Processamento6.2. Arquitectura das Máquinas
![Page 7: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/7.jpg)
7
von Neumann Architecture
Based on the fetch-decode-execute cycle The computer executes a single sequence of
instructions that act on data. Both program and data are stored in memory.
Flow of instructions
Data
ABC
![Page 8: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/8.jpg)
8
Flynn's Taxonomy
Classifies computers according to… The number of execution flows The number of data flows
Number of data flows
Number of execution
flows
SISDSingle-Instruction
Single-Data
SIMDSingle-Instruction
Multiple-Data
MISDMultiple-Instruction
Single-Data
MIMDMultiple-Instruction
Multiple-Data
![Page 9: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/9.jpg)
9
Single Instruction, Single Data (SISD)
A serial (non-parallel) computer Single instruction: only one instruction stream is
being acted on by the CPU during any one clock cycle
Single data: only one data stream is being used as input during any one clock cycle
Most PCs, single CPU workstations, …
![Page 10: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/10.jpg)
10
Single Instruction, Multiple Data (SIMD)
A type of parallel computer Single instruction: All processing units execute the
same instruction at any given clock cycle Multiple data: Each processing unit can operate on
a different data element Best suited for specialized problems characterized
by a high degree of regularity, such as image processing.
Examples: Connection Machine CM-2, Cray J90, Pentium MMX instructions
1 3 4 5 21 3 3 5
32 43 2 46 87 65 43 32
V1
V2
V3
ADD V3, V1, V2
![Page 11: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/11.jpg)
11
The Connection Machine 2 (SIMD)
The massively parallel Connection Machine 2 was a supercomputer produced by Thinking Machines Corporation, containing 32,768 (or more) processors of 1-bit that work in parallel.
![Page 12: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/12.jpg)
12
Multiple Instruction, Single Data (MISD)
Few actual examples of this class of parallel computer have ever existed
Some conceivable examples might be: multiple frequency filters operating on a single signal
stream multiple cryptography algorithms attempting to crack a
single coded message the Data Flow Architecture
![Page 13: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/13.jpg)
13
Multiple Instruction, Multiple Data (MIMD)
Currently, the most common type of parallel computer
Multiple Instruction: every processor may be executing a different instruction stream
Multiple Data: every processor may be working with a different data stream
Execution can be synchronous or asynchronous, deterministic or non-deterministic
Examples: most current supercomputers, computer clusters, multi-processor SMP machines (inc. some types of PCs)
![Page 14: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/14.jpg)
14
IBM BlueGene/L DD2 Department of
Energy's, Lawrence Livermore National Laboratory (California, USA)
Currently the fastest machine on earth (70TFLOPS)
Some Facts- 32768x 700MHz PowerPC440 CPUs (Dual Processors)- 512MB RAM per node, total = 16TByte of RAM- 3D Torus Network; 300MB/sec per node.
![Page 15: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/15.jpg)
15
IBM BlueGene/L DD2
Chip(2 processors)
Com pute Card(2 ch ips, 2x1x1)
Node Board(32 ch ips, 4x4x2)
16 Com pute C ards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 G F/s4 M B
5.6/11.2 G F/s0.5 G B DDR
90/180 G F/s8 G B DDR
2.9/5.7 TF/s256 G B DDR
180/360 TF/s16 TB D DR
![Page 16: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/16.jpg)
16
What about Memory?
The interface between CPUs and Memory in Parallel Machines is of crucial importance
The bottleneck on the bus, many times between memory and CPU, is known as the von Neumann bottleneck
It limits how fast a machine can operate: relationship between computation/communication
![Page 17: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/17.jpg)
17
Communication in Parallel Machines
Programs act on data. Quite important: how do processors access each
others’ data?
Shared Memory ModelMessage Passing Model
Memory CPU Memory CPU
Memory CPU Memory CPU
network
CPU
CPUCPU
CPU
Memory
![Page 18: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/18.jpg)
18
Shared Memory
Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as a global address space
Multiple processors can operate independently but share the same memory resources
Changes in a memory location made by one processor are visible to all other processors
Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA
![Page 19: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/19.jpg)
19
Shared Memory (2)
Fast MemoryInterconnect
UMA: Uniform Memory Access
Single 4-processorMachine
CPU
CPUCPU
CPU
Memory CPU
Memory
CPU
Memory
CPU
Memory
NUMA: Non-Uniform Memory Access
A 3-processorNUMA Machine
![Page 20: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/20.jpg)
20
Uniform Memory Access (UMA)
Most commonly represented today by Symmetric Multiprocessor (SMP) machines
Identical processors Equal access and access times to memory Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a
location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level.
Very hard to scale
![Page 21: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/21.jpg)
21
Non-Uniform Memory Access (NUMA)
Often made by physically linking two or more SMPs. One SMP can directly access memory of another SMP.
Not all processors have equal access time to all memories
Sometimes called DSM – Distributed Shared Memory
Advantages User-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to the proximity of
memory and CPUs More scalable than SMPs
Disadvantages Lack of scalability between memory and CPUs Programmer responsibility for synchronization constructs that ensure
"correct" access of global memory Expensive: it becomes increasingly difficult and expensive to design and
produce shared memory machines with ever increasing numbers of processors
![Page 22: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/22.jpg)
22
UMA and NUMA
The Power MAC G5 features2 PowerPC 970/G5 processorsthat share a common central memory (up to 8Gbyte)
SGI Origin 3900:- 16 R14000A processors per brick, each brick with 32GBytes of RAM. - 12.8GB/s aggregated memory bw(Scales up to 512 processors and1TByte of memory)
![Page 23: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/23.jpg)
23
Distributed Memory (DM)
Processors have their own local memory. Memory addresses in one processor do not map to
another processor (no global address space) Because each processor has its own local memory,
cache coherency does not apply Requires a communication network to connect inter-
processor memory When a processor needs access to data in another
processor, it is usually the task of the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is the programmer's responsibility
Very scalable Cost effective: use of off-the-shelf processors and
networking Slower than UMA and NUMA machines
![Page 24: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/24.jpg)
24
Distributed Memory
CPU
Memory
Computer
CPU
Memory
Computer
CPU
Memory
Computer
network interconnect
TITAN@DEI, a PC clusterinterconnected by FastEthernet
![Page 25: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/25.jpg)
25
Hybrid Architectures
Today, most systems are an hybrid featuring shared distributed memory. Each node has several processors that share a central memory A fast switch interconnects the several nodes In some cases the interconnect allows for the mapping of
memory among nodes; in most cases it gives a message passing interface
fast network interconnect
MemoryCPUCPU
CPUCPU
MemoryCPUCPU
CPUCPU
MemoryCPUCPU
CPUCPU
MemoryCPUCPU
CPUCPU
![Page 26: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/26.jpg)
26
ASCI White at theLawrence Livermore National Laboratory
Each node is an IBM POWER3 375 MHz NH-2 16-way SMP (i.e. 16 processors/node)
Each node has 16GB of memory A total of 512 nodes, interconnected by a 2GB/sec
network node-to-node The 512 nodes feature a total of 8192 processors,
having a total of 8192 GB of memory It currently operates at 13.8 TFLOPS
![Page 27: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/27.jpg)
27
Summary
Architecture CC-UMA CC-NUMA Distributed/Hybrid
Examples - SMPs - Sun Vexx - SGI Challenge - IBM Power3
- SGI Origin - HP Exemplar - IBM Power4
- Cray T3E - IBM SP2
Programming - MPI - Threads - OpenMP - Shmem
- MPI - Threads - OpenMP - Shmem
- MPI
Scalability <10 processors <1000 processors ~1000 processors
Draw Backs - Limited mem bw- Hard to scale
- New architecture - Point-to-point communication
- Costly system administration - Programming is hard to develop and maintain
Software Availability
- Great - Great - Limited
![Page 28: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/28.jpg)
28
Summary (2)
Plot of top 500 supercomputer sites over a decade
![Page 29: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/29.jpg)
Arquitectura de Computadores II
Paulo MarquesDepartamento de Eng. InformáticaUniversidade de [email protected]
2004
/200
5
6. Multi-Processamento6.3. Modelos de Programação e Desafios
![Page 30: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/30.jpg)
30
Warning
We will now introduce the main ways how you can program a parallel machine.
Don’t worry if you don’t immediately visualize all the primitives that the APIs provide. We will cover that latter. For now, you just have to understand the main ideas behind each paradigm.
In summary: DON’T PANIC!
![Page 31: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/31.jpg)
31
The main programming models…
A programming model abstracts the programmer from the hardware implementation
The programmer sees the whole machine as a big virtual computer which runs several tasks at the same time
The main models in current use are: Shared Memory Message Passing Data parallel / Parallel Programming Languages
Note that this classification is not all inclusive. There are hybrid approaches and some of the models overlap (e.g. data parallel with shared memory/message passing)
![Page 32: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/32.jpg)
32
Shared Memory Model
Processor Thread
A
Processor Thread
B
Processor Thread
C
Processor Thread
D
double matrix_A[N];
double matrix_B[N];
double result[N];
Globally Accessible Memory (Shared)
![Page 33: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/33.jpg)
33
Shared Memory Model
Independently of the hardware, each program sees a global address space
Several tasks execute at the same time and read and write from/to the same virtual memory
Locks and semaphores may be used to control access to the shared memory
An advantage of this model is that there is no notion of data “ownership”. Thus, there is no need to explicitly specify the communication of data between tasks.
Program development can often be simplified An important disadvantage is that it becomes more
difficult to understand and manage data locality. Performance can be seriously affected.
![Page 34: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/34.jpg)
34
Shared Memory Modes
There are two major shared memory models: All tasks have access to all the address space
(typical in UMA machines running several threads) Each task has its address space. Most of the address
space is private. A certain zone is visible across all tasks. (typical in DSM machines running different processes)
MemoryB
Memory
(all the tasks sharethe same address space)
MemoryA
A B C A B
MemoryB
Shared memory
![Page 35: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/35.jpg)
35
Shared Memory Model –Closely Coupled Implementations
On shared memory platforms, the compiler translates user program variables into global memory addresses
Typically a thread model is used for developing the applications POSIX Threads OpenMP
There are also some parallel programming languages that offer a global memory model, although data and tasks are distributed
For DSM machines, no standard exists, although there are some proprietary implementations
![Page 36: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/36.jpg)
36
Shared Memory – Thread Model
A single process can have multiple threads of execution
Each thread can be scheduled on a different processor, taking advantage of the hardware
All threads share the same address space From a programming perspective, thread
implementations commonly comprise: A library of subroutines that are called from within parallel
code A set of compiler directives imbedded in either serial or
parallel source code Unrelated standardization efforts have resulted in
two very different implementations of threads: POSIX Threads and OpenMP
![Page 37: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/37.jpg)
37
POSIX Threads
Library based; requires parallel coding Specified by the IEEE POSIX 1003.1c standard
(1995), also known as PThreads C Language Most hardware vendors now offer PThreads Very explicit parallelism; requires significant
programmer attention to detail
![Page 38: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/38.jpg)
38
OpenMP
Compiler directive based; can use serial code Jointly defined and endorsed by a group of major
computer hardware and software vendors. The OpenMP Fortran API was released October 28, 1997. The C/C++ API was released in late 1998
Portable / multi-platform, including Unix and Windows NT platforms
Available in C/C++ and Fortran implementations Can be very easy and simple to use - provides for
“incremental parallelism” No free compilers available
![Page 39: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/39.jpg)
39
Message Passing Model
The programmer must send and receive messages explicitly
![Page 40: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/40.jpg)
40
Message Passing Model
A set of tasks that use their own local memory during computation.
Tasks exchange data through communications by sending and receiving messages Multiple tasks can reside on the same physical machine as
well as across an arbitrary number of machines. Data transfer usually requires cooperative
operations to be performed by each process. For example, a send operation must have a matching receive operation.
![Page 41: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/41.jpg)
41
Message Passing Implementations
Message Passing is generally implemented as libraries which the programmer calls
A variety of message passing libraries have been available since the 1980s These implementations differed substantially from each
other making it difficult for programmers to develop portable applications
In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations
![Page 42: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/42.jpg)
42
MPI – The Message Passing Interface Part 1 of the Message Passing Interface (MPI), the
core, was released in 1994. Part 2 (MPI-2), the extensions, was released in 1996. Freely available on the web:
http://www.mpi-forum.org/docs/docs.html
MPI is now the “de facto” industry standard for message passing Nevertheless, most systems do not implement the full
specification. Especially MPI-2
For shared memory architectures, MPI implementations usually don’t use a network for task communications Typically a set of devices is provided. Some for network
communication, some for shared memory. In most cases, they can coexist.
![Page 43: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/43.jpg)
43
Data Parallel Model
Typically a set of tasks performs the same operations on different parts of a big array
![Page 44: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/44.jpg)
44
Data Parallel Model The data parallel model demonstrates the
following characteristics: Most of the parallel work focuses on performing
operations on a data set The data set is organized into a common structure, such
as an array or cube A set of tasks works collectively on the same data
structure, however, each task works on a different partition of the same data structure
Tasks perform the same operation on their partition of work, for example, “add 4 to every array element”
On shared memory architectures, all tasks may have access to the data structure through global memory.
On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task
![Page 45: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/45.jpg)
45
Data Parallel Programming Typically accomplished by writing a program with
data parallel constructs calls to a data parallel subroutine library compiler directives
In most cases, parallel compilers are used: High Performance Fortran (HPF):
Extensions to Fortran 90 to support data parallel programming
Compiler Directives: Allow the programmer to specify the distribution and alignment of data. Fortran implementations are available for most common parallel platforms
DM implementations have the compiler convert the program into calls to a message passing library to distribute the data to all the processes. All message passing is done invisibly to the programmer
![Page 46: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/46.jpg)
46
Summary
Middleware for parallel programming: Shared memory: all the tasks (threads or processes) see a
global address space. They read and write directly from memory and synchronize explicitly.
Message passing: the tasks have private memory. For exchanging information, they send and receive data through a network. There is always a send() and receive() primitive.
Data parallel: the tasks work on different parts of a big array. Typically accomplished by using a parallel compiler which allows data distribution to be specified.
![Page 47: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/47.jpg)
47
Final Considerations…
Beware of Amdahl's Law!
![Page 48: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/48.jpg)
48
Load Balancing
Load balancing is always a factor to consider when developing a parallel application. Too big granularity Poor load balancing Too small granularity Too much communication
The ratio computation/communication is of crucial importance!
timeWork
Wait
Task 1
Task 2
Task 3
![Page 49: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/49.jpg)
49
Amdahl's Law
The speedup depends on the amount of code that cannot be parallelized:
ns
nsT ssT
Tsnspeedup
)1()1(
1),(
n: number of processorss: percentage of code that cannot be made parallelT: time it takes to run the code serially
![Page 50: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/50.jpg)
50
Amdahl's Law – The Bad News!
Speedup vs. Percentage of Non-Parallel Code
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Number of Processors
Sp
eed
up
0%
5%
10%
20%
Linear Speedup
![Page 51: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/51.jpg)
51
0
10
20
30
40
50
60
70
80
90
100
0% 5% 10% 20%
Percentage of Non-Parallel Code
Eff
icie
ncy
(%
)
Efficiency Using 30 Processors
n
snspeedupsnefficiency
),(),(
![Page 52: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/52.jpg)
52
What Is That “s” Anyway?
Three slides ago… “s: percentage of code that cannot be made parallel”
Actually, it’s worse than that. Actually it’s the percentage of time that cannot be executed in parallel. It can be: Time spent communicating Time spent waiting for/sending jobs Time spent waiting for the completion of other processes Time spent calling the middleware for parallel programming
Remember… if s is even as small as 0.05, the maximum speedup is only
20
![Page 53: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/53.jpg)
53
Maximum Speedup
nss
snspeedup)1(
1),(
If you have processorsthis will be 0, so the maximumpossible speedup is 1/s
non-parallel (s) maximum speedup
0% (linear speedup)
5% 20
10% 10
20% 5
25% 4
![Page 54: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/54.jpg)
54
On the Positive Side…
You can run bigger problems
You can run several simultaneous jobs (you have more parallelism available) Gustafson-Barsis with no equations:
“9 women cannot have a baby in 1 month, but they can have 9 babies in 9 months”
![Page 55: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/55.jpg)
Arquitectura de Computadores II
Paulo MarquesDepartamento de Eng. InformáticaUniversidade de [email protected]
2004
/200
5
6. Multi-Processamento6.4. Hardware
![Page 56: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/56.jpg)
56
Problema da Coerência das Caches (UMA)
![Page 57: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/57.jpg)
57
Mantendo a Coerência: Snooping
![Page 58: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/58.jpg)
58
Snooping
Leituras e Escritas de Blocos As múltiplas cópias de um bloco, quando existem leituras,
não são um problema No entanto, quando existe uma escrita, um processador
tem de ter acesso exclusivo ao bloco que quer escrever Os processadores, quando fazem uma leitura, têm
também de ter sempre o valor mais recente do bloco em causa
Nos protocolos de snooping, o hardware tem de localizar todas as caches que contêm uma cópia do bloco, quando existe uma escrita. Existem então duas abordagens possíveis: Invalidar todas caches que contêm esse bloco (write-
invalidate) Actualizar todas as caches que contêm esse bloco
![Page 59: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/59.jpg)
59
Protocolo de Snooping (Exemplo)
![Page 60: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/60.jpg)
60
Problema da Coerência das Caches (NUMA)
A abordagem de Snooping não é escalável para máquinas com dezenas/centenas de processadores (NUMA)
Nesse caso utiliza-se um outro tipo de protocolos – baseados em Directorias Uma Directoria é um local centralizado que mantém
informação sobre quem é que tem cada bloco
![Page 61: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/61.jpg)
Arquitectura de Computadores II
Paulo MarquesDepartamento de Eng. InformáticaUniversidade de [email protected]
2004
/200
5
5. Multi-Processamento5.4. Aspectos Recentes e Exemplos
![Page 62: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/62.jpg)
62
Tendências
Neste momento torna-se extremamente complicado escalar os processadores em termos de performance individual e clock-rate O futuro é o MULTI-PROCESSAMENTO!!!
A Intel, à semelhança de outros fabricantes introduz o Simultaneous Multi-Threading (SMT), na sua terminologia, chamado HyperThreading Um aumento de desempenho potencialmente razoável (max=30%)
à custa de um pequeno gasto de transístores (5%) Atenção: pode levar a uma performance pior! Prepara os programadores para a programação concorrente!!!
(a opinião generalizada é que o Hyperthreading serviu apenas para tal)
Os dual-core (dois processadores no mesmo die e/ou pacote) irão ser banais nos próximos 2/3 anos
Os servidores multi-processador (SMP – Symmetrical Multi-Processing) estão neste momento banalizados
Os clusters estão neste momento banalizados
![Page 63: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/63.jpg)
63
Anúncios...
![Page 64: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/64.jpg)
64
Como é que funciona o HyperThreading (1)?
Processador super-escalar “normal”
Dual Processor (SMP)
![Page 65: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/65.jpg)
65
Como é que funciona o HyperThreading (2)?
Time-sliced Multithreaded CPU(Super-Threaded CPU)
Hyper-Threaded CPU
![Page 66: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/66.jpg)
66
Motivações para o uso de Simultaneous Multi-Threading (SMT)
Normalmente existem mais unidades funcionais disponíveis do que aquelas que estão a ser utilizadas Limitações do tamanho dos blocos básicos e/ou paralelismo
disponível a nível das instruções (ILP)
Os computadores actuais estão constantemente a executar mais do que um programa/thread Existe trabalho disponível, independente, para fazer. Não
se encontra é na mesma thread!
Um dos aspectos em que esta abordagem é muito útil é a esconder latências inevitáveis de acesso a memória ou previsões erradas de saltos E.g. uma thread que tenha de ler dados de memória pode
ficar bastante tempo à espera enquanto os dados não chegam. Nessas alturas, tendo SMT, é possível outra thread executar.
![Page 67: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/67.jpg)
67
Implementação (Ideia Básica)
Replicar o Front-end do processador e tudo o que seja visível em termos de ISA (Instruction Set Architecture) e.g. Registos, Program Counters, etc. Desta forma, um processador físico torna-se dois
processadores
Particionam-se alguns recursos (e.g. filas de issue de instruções) e Partilham-se outros (e.g. Reorder-Buffers)
![Page 68: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/68.jpg)
68
Para terminar... Exemplo de um cluster!
Cluster da GOOGLE Tem de servir 1000 queries/segundo, cada query não
demorando mais de 0.5s! 8 biliões de páginas indexadas (8.058.044.651, 01/Maio/2005) Técnica para manter a indexação: Tabelas Invertidas (ver TC/BD) Todas as páginas são revisitadas mensalmente
Máquinas do cluster GOOGLE PCs “baratos” com processadores Intel, c/ 256MB RAM Cerca de 6.000 processadores, 12.000 discos (1 PByte de
espaço, 2 discos por máquina)
Linux Red Hat 2 sites na Califórnia e 2 na Virgínia
Ligação à rede Cada site tem uma ligação OC48 (2.5 Gbps) à Internet Entre cada par de sites existe um link de backup de OC12 (622
Mbps)
![Page 69: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/69.jpg)
69
Racks e Racks
40 PCs/rack40 Racks
No google, a aborgagemà redundância é utilizar umconjunto maciço de máquinascompletas!
![Page 70: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/70.jpg)
70
Máquinas super-rápidas??
![Page 71: Arquitectura de Computadores II Paulo Marques Departamento de Eng. Informática Universidade de Coimbra pmarques@dei.uc.pt 2004/2005 6. Multi-Processamento](https://reader036.vdocuments.site/reader036/viewer/2022062512/552fc101497959413d8bda3d/html5/thumbnails/71.jpg)
71
Material para ler
Computer Architecture: A Quantitative Approach, 3rd Ed. Secções 6.1, 6.3, 6.5 (brevemente), 6.9, 6.15
Alternativamente (ou complementarmente), a matéria encontra-se bastante bem explicada no Capítulo 9 do Computer Organization and Design, 3rd Ed.
D. Patterson & J. HennessyMorgan Kaufmann, ISBN 1-55860-604-1 August 2004
Em particular, a descrição do cluster Google foi retirada de lá. A única matéria não coberta foi a Secção 9.6
Este capítulo do livro será colocado online no site da cadeira, disponível apenas para utilizadores autenticados
Jon Stokes, “Introduction to Multithreading, Superthreading and Hyperthreading”, in Ars Technica, October 2003http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars