parallel program design

99
Parallel program design Speaker 呂呂呂 Date 2007/06/01

Upload: zongying-lyu

Post on 14-Nov-2014

192 views

Category:

Software


0 download

DESCRIPTION

Parallel program design

TRANSCRIPT

Page 1: Parallel program design

Parallel program design

Speaker :呂宗螢Date : 2007/06/01

Page 2: Parallel program design

Embedded and Parallel Systems Lab

2

Outline

IntroductionParallel Algorithm DesignParallel keywordpthreadOpenMPMPIConclusion

Page 3: Parallel program design

Embedded and Parallel Systems Lab

3

Introduction Why Use Parallel Computing?

Save time Solve larger problems Provide concurrency Cost savings Multi-core CPU 掘起

Intel® Core™2 duo Intel® Core™2 Quad AMD Opteron AMD Phenom Xbox360 PS3

Page 4: Parallel program design

Embedded and Parallel Systems Lab

4

Introduction

Parallel computing It is the use of a parallel computer to reduce the time

needed to solve a single computational problem.

Parallel programming It is a language that allows you to explicitly indicate how

different portions of the computation may be executed concurrently by different processors.

將一個程式分成 n 個不同的部份,使之能夠同時執行降低執行時間,其最後結果與原本程式相同

Page 5: Parallel program design

Embedded and Parallel Systems Lab

5

Introduction

Serial

Source : http://www.llnl.gov/computing/tutorials/parallel_comp

Page 6: Parallel program design

Embedded and Parallel Systems Lab

6

Introduction

Who’s Doing Parallel Computing

Page 7: Parallel program design

Embedded and Parallel Systems Lab

7

Introduction What are the using if for?

Page 8: Parallel program design

Embedded and Parallel Systems Lab

8

Introduction

常見的平行管線 (Pipeline) fork執行緒 (Thread)對稱式多處理機

(Symmetric MultiProcessors, SMP) 叢集運算 (Cluster)網格運算 (Grid)

SETI@Home Folding@Home (ps3)

Page 9: Parallel program design

Embedded and Parallel Systems Lab

9

Introduction 常見的平行程式 以記憶體來分

分散式記憶體為主 (distribute shared)訊息傳遞 (message passing) 為主

PVM (Parallel Virtual Machine ) MPI (Message Passing Interface)

以共享記憶體為主 (shared memory ) DSM (distribute shared memory) Fork thread OpenMP

Page 10: Parallel program design

Embedded and Parallel Systems Lab

10

Introduction Flynn's Classical Taxonomy

S I S D Single Instruction,

Single Data

S I M D Single Instruction,

Multiple Data

M I S D Multiple Instruction,

Single Data

M I M D Multiple Instruction,

Multiple Data

Page 11: Parallel program design

Embedded and Parallel Systems Lab

11

Introduction

SISD SIMD

Source : http://www.llnl.gov/computing/tutorials/parallel_comp

Page 12: Parallel program design

Embedded and Parallel Systems Lab

12

Introduction

M I S D

M I M D

Source : http://www.llnl.gov/computing/tutorials/parallel_comp

Page 13: Parallel program design

Embedded and Parallel Systems Lab

13

Introduction

Amdahl’s Law

enhanced

enhancedenhanced

new

oldoverall

Speedup

Fraction Fraction

1

ExTimeExTime

Speedup

1

Best you could ever hope to do:

enhancedmaximum Fraction - 1

1 Speedup

enhanced

enhancedenhancedoldnew Speedup

FractionFraction ExTime ExTime 1

Page 14: Parallel program design

Embedded and Parallel Systems Lab

14

Parallel Algorithm Design

Ian Foster Four-step process for designing parallel algorithm

1. Partitioning 2. Communication3. Agglomeration4. Mapping

平行化的大原則 Maximize processor utilization Minimize communication overhead Load balancing

Page 15: Parallel program design

Embedded and Parallel Systems Lab

15

Parallel Algorithm Design

PartitioningProcess of dividing the computation and the data

into pieces.Domain decompositionFunctional decomposition

Problem

Page 16: Parallel program design

Embedded and Parallel Systems Lab

16

Parallel Algorithm Design

CommunicationLocal communicationGlobal communication

Page 17: Parallel program design

Embedded and Parallel Systems Lab

17

Parallel Algorithm Design

AgglomerationIncreasing the locality (combining tasks that are

connected by a channel eliminates)

Combining sending and receiving task

Page 18: Parallel program design

Embedded and Parallel Systems Lab

18

Parallel Algorithm Design

MappingProcess of assigning tasks to processor

A

C D

BE

G

F

HI

A

C D

BE

G

F

HI

Page 19: Parallel program design

Embedded and Parallel Systems Lab

19

Foster’s parallel algorithm design

Problem A

C D

BE

G

F

HI

A

C D&FB

E

G HI

A

C

BE

G HI

D&F

Mapping

Partitioning Communication

Agglomeration

Page 20: Parallel program design

Embedded and Parallel Systems Lab

20

Parallel Example : matrixA B C

X =

X

=X

X

X

=

=

=merge

P1

P2

P3

P4

Page 21: Parallel program design

Embedded and Parallel Systems Lab

21

Decision Tree

Static number of tasks Dynamic number of tasks

Structured communication pattern

Unstructured communication pattern

Roughly constant computation time

per task

Computation time per task varies by

region

Frequent communications

between tasks

Many short-lived tasks. NO intertask

communication

Agglomerate tasks to minimize

communication.Create one task per

processor.

Cyclically Map tasks to processors

to balance computational load

Use a static load balancing algorithm.

Use a dynamic load balancing

algorithm.

Use a run-time task algorithm..

Source : Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP”

Page 22: Parallel program design

Embedded and Parallel Systems Lab

22

Parallel keyword private data

擁有獨立私有的資料,不受其他 process 所影響 share data

資料為共享的,所有 process 均可得之,並會受其他 process 執行所影響

barrier 資料同步化使用, process 執行至此會等待,直到所有 process

執行均執行到此才會繼續執行 reduction

將所有 process 所運算結果,合併起來 (ex : sum , max , min) atomic

使該記憶體位置為連動,意思為存取該記憶體位置時,不受其他process 所影響,避免相競現像 (race conditions)

critical 臨界區域,使該區域執行時,同時只能有一個 process 執行,避免

相競現像 (race conditions)

Page 23: Parallel program design

Embedded and Parallel Systems Lab

23

Thread

Page 24: Parallel program design

Embedded and Parallel Systems Lab

24

pthread

What is a Thread? A thread is a logical flow that runs in the context of a

process. Multiply threads can running concurrently in a single

process. Each thread has its own thread context

a unique integer thread ID (TID) stack stack pointer program counter general-purpose registers condition codes

Source : William W.-Y. Liang , “Linux System Programming”

Page 25: Parallel program design

Embedded and Parallel Systems Lab

25

Thread v.s. Processes Process :

When a process executes a fork call, a new copy of the process is created with its own variables and its own PID.

This new process is scheduled independently, and (in general) executes almost independently of the process that created is.

Thread : W hen we create a new thread in a process, the new thread

of execution gets its own stack (and hence local variables) but shares global variables, file descriptors, signal handlers, and its current directory state with the process that created it.

Source : William W.-Y. Liang , “Linux System Programming”

Page 26: Parallel program design

Embedded and Parallel Systems Lab

26

pthread Function

Function int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*func)(void*), void *arg)

功能 Create a new thread of execution

Parameters tid : ID for the created threadatt: : thread attribute object, if NULL 為 default attributefunc : thread functionarg : argument for the thread

Return value If ok return 0If error return error number (>=0)

Function int pthread_join(pthread_t tid, void **thread_return)

功能 Blocks the calling thread until the specified thread terninates

Parameters tid : ID for the created threadthread_return : buffer for the returned value

Return value If ok return 0If error return error number (>=0)

Page 27: Parallel program design

Embedded and Parallel Systems Lab

27

pthread Function

Function void pthread_exit(void * retval)

功能 Terminates the calling thread

Parameters retval : Thread return value. If not NULL , retval = thread_return (pthread_join)

Return value none

Function pthread_t pthread_self(void);

功能 Return current thread ID

Parameters none

Return value Thread ID (unsigned long int)

Page 28: Parallel program design

Embedded and Parallel Systems Lab

28

Example: thread.c#include <stdio.h>#include <pthread.h>char message[]="Example:create new thread";void *thread_function(void *arg){ pthread_t tid = pthread_self(); printf("thread_function is running\n"); printf("new ID:%u Argument is %s\n", tid, (char*)arg); pthread_exit("new thread end\n");}int main(void){ pthread_t new_thread; pthread_t master_thread = pthread_self(); void *thread_result; pthread_create(&new_thread, NULL, thread_function, (void*)message); pthread_join(new_thread, &thread_result); printf("\nmaster ID:%u the new thread return valus is:%s\n", master_thread,(char*)thread_result); return 0;}

Page 29: Parallel program design

Embedded and Parallel Systems Lab

29

pthread Attribute

Function int pthread_attr_init (pthread_attr_t *attr);

功能 Initialize a thread attributes object.

Parameters attr : thread attribute object

Return value If ok return 0If error return error number (>=0)

Functionint pthread_attr_destroy(pthread_attr_t *attr)

功能 Destory a thread attributes object.

Parameters attr : thread attribute object

Return value If ok return 0If error return error number (>=0)

Page 30: Parallel program design

Embedded and Parallel Systems Lab

30

pthread AttributeAttribute Function Argument (blue is default)detachstate Threads’ detach state. PTHREAD_CREATE_DETACHED :當 thread 結束時,會將所有

資源都釋放掉PTHREAD_CREATE_JOINABLE :當 thread 結束時,它的thread ID 和結束狀態會保留,直到行程中的有 thread 去對它呼叫pthread_join

schedpolicy Thread’s scheduling policy SCHED_FIFO : first in first out

SCHED_RR : round robin

SCHED_OTHER :沒有優先權schedparam Threads’ scheduling parameters

inheritsched Thread’s scheduling inhertience PTHREAD_INHERIT_SCHED : thread attribute 從建立者繼承PTHREAD_EXPLICIT_SCHED : thread 屬性由 thread attribute (pthread_attr_t) 來決定

scope Thread’s scope. PTHREAD_SCOPE_SYSTEM 、 PTHREAD_SCOPE_PROCESS ,But linux only have

PTHREAD_SCOPE_SYSTEM

guardsize Thread’s guard size (PAGESIZE bytes)

stackaddr Thread’s stack address

stacksize Thread’s stack size

Page 31: Parallel program design

Embedded and Parallel Systems Lab

31

Get pthread Attribute

int pthread_attr_getdetachstate(const pthread_attr_t *attr, int *detachstate);

int pthread_attr_getguardsize(const pthread_attr_t *attr, size_t *guardsize);

int pthread_attr_getinheritsched(const pthread_attr_t *attr, int *inheritsched);

int pthread_attr_getschedparam(const pthread_attr_t *attr, struct sched_param *param);

int pthread_attr_getschedpolicy(const pthread_attr_t *attr, int *policy); int pthread_attr_getscope(const pthread_attr_t *attr, int *scope); int pthread_attr_getstackaddr(const pthread_attr_t *attr, void

**stackaddr); int pthread_attr_getstacksize(const pthread_attr_t *attr, size_t *stacksize);

Page 32: Parallel program design

Embedded and Parallel Systems Lab

32

Set pthread Attribute

int pthread_attr_setdetachstate(pthread_attr_t *attr, int detachstate); int pthread_attr_setguardsize(pthread_attr_t *attr, size_t guardsize); int pthread_attr_setinheritsched(pthread_attr_t *attr, int inheritsched); int pthread_attr_setschedparam(pthread_attr_t *attr, const struct

sched_param *param); int pthread_attr_setschedpolicy(pthread_attr_t *attr, int policy); int pthread_attr_setscope(pthread_attr_t *attr, int scope); int pthread_attr_setstackaddr(pthread_attr_t *attr, void *stackaddr); int pthread_attr_setstacksize(pthread_attr_t *attr, size_t stacksize);

Page 33: Parallel program design

Embedded and Parallel Systems Lab

33

OpenMP Directive TableDirective Description

atomic Specifies that a memory location that will be updated atomically.

barrierSynchronizes all threads in a team; all threads pause at the barrier, until all threads execute the barrier.

critical Specifies that code is only executed on one thread at a time.

flush Specifies that all threads have the same view of memory for all shared objects.

for Causes the work done in a for loop inside a parallel region to be divided among threads.

master Specifies that only the master threadshould execute a section of the program.

ordered Specifies that code under a parallelized for loop should be executed like a sequential loop.

parallel Defines a parallel region, which is code that will be executed by multiple threads in parallel.

sections Identifies code sections to be divided among all threads.

singleLets you specify that a section of code should be executed on a single thread, not necessarily the master thread.

threadprivate Specifies that a variable is private to a thread.

Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx

Page 34: Parallel program design

Embedded and Parallel Systems Lab

34

OpenMP Clause TableClause Description

copyin Allows threads to access the master thread's value, for a threadprivate variable.

copyprivate Specifies that one or more variables should be shared among all threads.

default Specifies the behavior of unscoped variables in a parallel region.

firstprivateSpecifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.

if Specifies whether a loop should be executed in parallel or in serial.

lastprivateSpecifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).

nowait Overrides the barrier implicit in a directive.

num_threads Sets the number of threads in a thread team.

ordered Required on a parallel for statement if an ordered directive is to be used in the loop.

private Specifies that each thread should have its own instance of a variable.

reductionSpecifies that one or more variables that are private to each thread are the subject of a reduction operation at the end of the parallel region.

schedule Applies to the for directive. Have fourt method : static 、 dynamic 、 guided 、 runtime

shared Specifies that one or more variables should be shared among all threads.

Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx

Page 35: Parallel program design

Embedded and Parallel Systems Lab

35

Reference

System Threads Reference http://www.unix.org/version2/whatsnew/threadsref.html

Semaphone http://www.mkssoftware.com/docs/man3/sem_init.3.asp Richard Stones. Neil Matthew, “Beginning Linux Programming” William W.-Y. Liang , “Linux System Programming”

Page 36: Parallel program design

Embedded and Parallel Systems Lab

36

OpenMP

Page 37: Parallel program design

Embedded and Parallel Systems Lab

37

OpenMP

OpenMP 2.5 Multi-threaded & Share memory Fortran 、 C / C++ 基本語法

#pragma omp directive [clause] OpenMP 需求及支援環境

Windows Virtual studio 2005 standard Intel ® C++ Compiler 9.1

Linux gcc 4.2.0 Omni

Xbox 360 & PS3

Page 38: Parallel program design

Embedded and Parallel Systems Lab

38

於程式最前面 #include <omp.h> Virtual studio 2005 standard

專案 / 專案屬性 / 組態屬性 /c/c++/ 語言將 OpenMP 支援改為 yes

Page 39: Parallel program design

Embedded and Parallel Systems Lab

39

OpenMP Constructs

Page 40: Parallel program design

Embedded and Parallel Systems Lab

40

Types of Work-Sharing Constructs Loop : shares iterations of a loop

across the team. Represents a type of "data parallelism".

Source : http://www.llnl.gov/computing/tutorials/openMP/

Sections : breaks work into separate, discrete sections. Each section is executed by a thread. Can be used to implement a type of "functional parallelism".

Page 41: Parallel program design

Embedded and Parallel Systems Lab

41

Types of Work-Sharing Constructs

single :將程式於一個執行緒執行 ( 於一個子執行緒執行,但不會在master thread 執行 )

Source : http://www.llnl.gov/computing/tutorials/openMP/

Page 42: Parallel program design

Embedded and Parallel Systems Lab

42

Loop working sharing

#pragma omp parallel for

for( int i , i <10000, i++)

for( int j , j <100 , j++)

function(i);

#pragma omp parallel

{\\ 大括號必須斷行,不能接於 parallel 後#pragma omp for

for( int i , i <10000, i++)

for( int j , j <100 , j++)

function(i);

}

=

parallel for 只能使用迴圈的 index 為 int 型態,且執行次數是可預知的

Thread 0 (Master)

for( i = 0 , i <5000, i++)

for( int j , j <100 , j++)

function(i);

Thread 1

for( i = 5000 , i <10000, i++)

for( int j , j <100 , j++)

function(i);

於雙執行緒的 cpu 執行時情形

Page 43: Parallel program design

Embedded and Parallel Systems Lab

43

OpenMP example : log.cpp#include <omp.h>#pragma omp parallel for num_threads(2) // 將 for 迴圈平均分給 2 個 threads for (y=2;y<BufSizeY-2;y++) for (x=2;x<BufSizeX-2;x++)

for (z=0;z<BufSizeBand;z++) {addr=(y*BufSizeX+x)*BufSizeBand+z; ans = (BYTE)(*(InBuf+addr))*16+ (BYTE)(*(InBuf+((y*BufSizeX+x+1)*BufSizeBand+z)))*(-2) +

(BYTE)(*(InBuf+((y*BufSizeX+x-1)*BufSizeBand+z)))*(-2) +

(BYTE)(*(InBuf+(((y+1)*BufSizeX+x)*BufSizeBand+z)))*(-2)+ (BYTE)(*(InBuf+(((y-1)*BufSizeX+x)*BufSizeBand+z)))*(-2)+

(BYTE)(*(InBuf+((y*BufSizeX+x+2)*BufSizeBand+z)))*(-1)+

(BYTE)(*(InBuf+((y*BufSizeX+x-2)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y+2)*BufSizeX+x)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y-2)*BufSizeX+x)*BufSizeBand+z)))*(-1)+

(BYTE)(*(InBuf+(((y+1)*BufSizeX+x+1)*BufSizeBand+z)))*(-1) + (BYTE)(*(InBuf+(((y+1)*BufSizeX+x-1)*BufSizeBand+z)))*(-1)+

(BYTE)(*(InBuf+(((y-1)*BufSizeX+x+1)*BufSizeBand+z)))*(-1)+ (BYTE)(*(InBuf+(((y-1)*BufSizeX+x-1)*BufSizeBand+z)))*(-1);*(OutBuf+addr)=abs(ans)/8;}

Page 44: Parallel program design

Embedded and Parallel Systems Lab

44

Source image Out image

Convert Log Image

Page 45: Parallel program design

Embedded and Parallel Systems Lab

45

Sections Working Shareint main(int argc, char* argv[]) {

#pragma omp parallel sections {

#pragma omp section {

toPNG(); } #pragma omp section {

toJPG(); } #pragma omp section {

toTIF();}

}

}

Input image

toPNG

toJPG

toTIF

Page 46: Parallel program design

Embedded and Parallel Systems Lab

46

OpenMP notice

int Fe[10]; Fe[0] = 0;Fe[1] = 1; #pragma omp parallel for num_threads(2)for( i = 2; i < 10; ++ i )

Fe[i] = Fe[i-1] + Fe[i-2];

Data dependent

#pragma omp parallel {

#pragma omp for for( int i = 0; i < 1000000; ++ i )

sum += i; }

Race conditions

Page 47: Parallel program design

Embedded and Parallel Systems Lab

47

OpenMP notice

DeadLock#pragma omp parallel

private(me)

{

int me;

me = omp_get_thread_num ();

if (me == 0) goto Master;

#pragma omp barrier

Master:

#pragma omp single

write(*,*) ”done”

}

Page 48: Parallel program design

Embedded and Parallel Systems Lab

48

OpenMP example:matrix(1)#include <omp.h>#include <stdio.h>#include <stdlib.h>#define RANDOM_SEED 2882 //random seed#define VECTOR_SIZE 4 //sequare matrix width the same to height

#define MATRIX_SIZE (VECTOR_SIZE * VECTOR_SIZE) //total size of

MATRIXint main(int argc, char *argv[]){

int i,j,k;int node_id;int *AA; //sequence use & check the d2mce right or faultint *BB; //sequence useint *CC; //sequence useint computing;int _vector_size = VECTOR_SIZE;int _matrix_size = MATRIX_SIZE;char c[10];

Page 49: Parallel program design

Embedded and Parallel Systems Lab

49

OpenMP example:matrix(2)if(argc > 1){

for( i = 1 ; i < argc ;){ if(strcmp(argv[i],"-s") == 0){ _vector_size = atoi(argv[i+1]); _matrix_size =_vector_size * _vector_size; i+=2; } else{ printf("the argument only have:\n"); printf("-s: the size of vector ex: -s 256\n"); return 0; } } }

AA =(int *)malloc(sizeof(int) * _matrix_size);BB =(int *)malloc(sizeof(int) * _matrix_size);CC =(int *)malloc(sizeof(int) * _matrix_size);

Page 50: Parallel program design

Embedded and Parallel Systems Lab

50

OpenMP example:matrix(3)srand( RANDOM_SEED );/* create matrix A and Matrix B */

for( i=0 ; i< _matrix_size ; i++){AA[i] = rand()%10; BB[i] = rand()%10;

}/* computing C = A * B */

#pragma omp parallel for private(computing, j , k)for( i=0 ; i < _vector_size ; i++){

for( j=0 ; j < _vector_size ; j++){computing =0;for( k=0 ; k < _vector_size ; k++)

computing += AA[ i*_vector_size + k ] * BB[ k*_vector_size + j ];

CC[ i*_vector_size + j ] = computing;}

}

Page 51: Parallel program design

Embedded and Parallel Systems Lab

51

OpenMP example:matrix(4)

printf("\nVector_size:%d\n", _vector_size);

printf("Matrix_size:%d\n", _matrix_size);

printf("Processing time:%f\n", time);

return 0;

}

Page 52: Parallel program design

Embedded and Parallel Systems Lab

52

OpenMP Directive TableDirective Description

atomic Specifies that a memory location that will be updated atomically.

barrierSynchronizes all threads in a team; all threads pause at the barrier, until all threads execute the barrier.

critical Specifies that code is only executed on one thread at a time.

flush Specifies that all threads have the same view of memory for all shared objects.

for Causes the work done in a for loop inside a parallel region to be divided among threads.

master Specifies that only the master threadshould execute a section of the program.

ordered Specifies that code under a parallelized for loop should be executed like a sequential loop.

parallel Defines a parallel region, which is code that will be executed by multiple threads in parallel.

sections Identifies code sections to be divided among all threads.

singleLets you specify that a section of code should be executed on a single thread, not necessarily the master thread.

threadprivate Specifies that a variable is private to a thread.

Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx

Page 53: Parallel program design

Embedded and Parallel Systems Lab

53

OpenMP Clause TableClause Description

copyin Allows threads to access the master thread's value, for a threadprivate variable.

copyprivate Specifies that one or more variables should be shared among all threads.

default Specifies the behavior of unscoped variables in a parallel region.

firstprivateSpecifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.

if Specifies whether a loop should be executed in parallel or in serial.

lastprivateSpecifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).

nowait Overrides the barrier implicit in a directive.

num_threads Sets the number of threads in a thread team.

ordered Required on a parallel for statement if an ordered directive is to be used in the loop.

private Specifies that each thread should have its own instance of a variable.

reductionSpecifies that one or more variables that are private to each thread are the subject of a reduction operation at the end of the parallel region.

schedule Applies to the for directive. Have fourt method : static 、 dynamic 、 guided 、 runtime

shared Specifies that one or more variables should be shared among all threads.

Source :http://msdn2.microsoft.com/zh-tw/library/0ca2w8dk(VS.80).aspx

Page 54: Parallel program design

Embedded and Parallel Systems Lab

54

Reference

Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP” Introduction to Parallel Computing  

http://www.llnl.gov/computing/tutorials/parallel_comp/ OpenMP standard http://www.openmp.org/drupal/ OpenMP MSDN tutorial

http://msdn2.microsoft.com/en-us/library/tt15eb9t(VS.80).aspx OpenMP tutorial http://www.llnl.gov/computing/tutorials/openMP/#DO Kang Su Gatlin , Pete Isensee, “Reap the Benefits of Multithreading without

All the Work” ,MSDN Magazine

Page 55: Parallel program design

Embedded and Parallel Systems Lab

55

MPI

Page 56: Parallel program design

Embedded and Parallel Systems Lab

56

MPI MPI is a language-independent communications

protocol used to program parallel computers 分散式記憶體( Distributed-Memory ) SPMD(Single Program Multiple Data ) Fortran , C / C++

Page 57: Parallel program design

Embedded and Parallel Systems Lab

57

MPI 需求及支援環境

Cluster EnvironmentWindows

Microsoft AD (Active Directory) serverMicrosoft cluster server

LinuxNFS (Network FileSystem)NIS (Network Information Services) 又稱 yellow

pagesSSHMPICH 2

Page 58: Parallel program design

Embedded and Parallel Systems Lab

58

MPI 安裝http://www-unix.mcs.anl.gov/mpi/mpich/下載 mpich2-1.0.4p1.tar.gz

[shell]# tar –zxvf mpich2-1.0.4p1.tar.gz[shell]# mkdir /home/yourhome/mpich2[shell]# cd mpich2-1.0.4p1[shell]# ./configure –prefix=/home/yourhome/mpich2 // 建議自行建立目錄安裝[shell]# make[shell]# make install

再來是

[shell]# cd ~yourhome // 到自己 home 目錄下[shell]# vi .mpd.conf // 建立文件

內容為secretword=<secretword> (secretword 可以依自己喜好打 )Ex:

secretword=abcd1234

Page 59: Parallel program design

Embedded and Parallel Systems Lab

59

MPI 安裝[shell]# chmod 600 mpd.conf[shell]# vi .bash_profiles

將 PATH=$PATH:$HOME/bin改成 PATH=$HOME/mpich2/bin:$PATH:$HOME/bin重登 server

[shell]# vi mpd.hosts // 在自己 home 目錄下建立 hosts list 文件

ex :

cluster1cluster2cluster3cluster4

Page 60: Parallel program design

Embedded and Parallel Systems Lab

60

MPI constructs

MPI

Point-to-Point Communication

Collective Communication

Process Group Virtual Topology

BlockingMPI_Send()MPI_Receive()

NonblockingMPI_Isend()MPI_Irecv()

SynchronizationMPI_Barrier() Data ExchangeMPI_Bcast()MPI_Scatter()MPI_Gather()Mpi_Alltoall()

Collective ComputationMPI_Reduce()

MPI_Comm_group()MPI_Comm_create()MPI_Group_incl()MPI_Group_rank()MPI_Group_size()MPI_Comm_free()

MPI_Cart_create()MPI_Cart_coords()MPI_Cart_shift()

Page 61: Parallel program design

Embedded and Parallel Systems Lab

61

MPI 程式基本架構 #include "mpi.h"

MPI_Init();

Do some work or MPI functionexample: MPI_Send() / MPI_Recv()

MPI_Finalize();

Page 62: Parallel program design

Embedded and Parallel Systems Lab

62

MPI Ethernet Control and Data Flow

Source : Douglas M. Pase, “Performance of Voltaire InfiniBand in IBM 64-Bit Commodity HPC Clusters,” IBM WhitePapers, 2005

Page 63: Parallel program design

Embedded and Parallel Systems Lab

63

MPI Communicator

0

12

3

456

7

8

MPI_COMM_WORLD

Page 64: Parallel program design

Embedded and Parallel Systems Lab

64

MPI Function

function int MPI_Init( int *argc, char *argv[])

功能 起始 MPI 執行環境,必須在所有 MPI function 前使用,並可以將 main 的指令參數 (argc, argv) 傳送到所有 process

parameters int argc :參數數目char* argv[] :參數內容

return value int :如果執行成功回傳 MPI_SUCCESS ,0

function int MPI_Finzlize()

功能 結束 MPI 執行環境,在所有工作完成後必須呼叫

parameters

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Page 65: Parallel program design

Embedded and Parallel Systems Lab

65

MPI Function

function int MPI_Comm_size( MPI_Comm comm, int *size)

功能 取得總共有多少 process 數在該 communicator

parameters comm : IN , MPI_COMM_WORLDsize : OUT ,總計 process 數目

return value int :如果執行成功回傳 MPI_SUCCESS ,0

function int MPI_Comm_rank ( MPI_Comm comm, int *rank)

功能 取得 process 自己的 process ID

parameters comm : IN , MPI_COMM_WORLDrank : OUT ,目前 process ID

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Page 66: Parallel program design

Embedded and Parallel Systems Lab

66

MPI Functionfunction int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

功能 傳資料到指定的 Process ,使用 Standard 模式parameters buf : IN 要傳送的資料 ( 變數 )

count : IN ,傳送多少筆datatype : IN ,設定傳送的資料型態dest : IN ,目標 Process IDtag : IN ,設定頻道comm : IN , MPI_COMM_WORLD

return value int :如果執行成功回傳 MPI_SUCCESS ,0

function int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

功能 接收來自指定的 Process 資料parameters buf : OUT ,要接收的資料 ( 變數 )

count : IN ,接收多少筆datatype : IN ,設定接收的資料型態source : IN ,接收的 Process IDtag : IN ,設定頻道comm : IN , MPI_COMM_WORLDstatus : OUT ,取得 MPI_Status

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Page 67: Parallel program design

Embedded and Parallel Systems Lab

67

MPI Function

Status :指出來源的 process ID 和傳送的 tag ,在 C 是使用 MPI_Status 的資料型態typedef struct MPI_Status { int count; int cancelled; int MPI_SOURCE; // 來源 ID int MPI_TAG; // 來源傳送的 tag int MPI_ERROR; // 錯誤控制碼 } MPI_Status;

function double MPI_Wtime()

功能 傳回一個時間 ( 秒數,浮點數 ) 代表目前時間,通常用來看程式執行的時間parameters

return value double :傳回時間

Page 68: Parallel program design

Embedded and Parallel Systems Lab

68

MPI Function

function int MPI_Type_commit(MPI_Datatype *datatype);

功能 建立 datatype

parameters datatype : INOUT ,新的 datatype

return value int :如果執行成功回傳 MPI_SUCCESS ,0

function MPI_Type_free(MPI_Datatype *datatype);

功能 釋放 datatype

parameters datatype : INOUT ,需釋放的 datatype

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Page 69: Parallel program design

Embedded and Parallel Systems Lab

69

MPI Functionfunction int MPI_Type_contiguous (int count, MPI_Datatype oldtype, MPI_Datatype *newtype)

功能 將現有資料型態 (MPI_Datatype) ,簡單的重新定大小,形成新的資料型態,就是指將數個相同型態的資料整合成一個

parameters count : IN ,新型態的大小 ( 指有幾個 oldtype 組成 )oldtype : IN ,舊有的資料型態 (MPI_Datatype)newtype : OUT ,新的資料型態

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Page 70: Parallel program design

Embedded and Parallel Systems Lab

70

撰寫程式和執行的步驟

1. 啟動 MPI 環境mpdboot -n 4 -f mpd.hosts //-n 為啟動 pc 數量 , mpd.hosts 為 pc

清單

2. 撰寫 MPI 程式vi hello.c

3. Compilempicc hello.c –o hello.o  

4. 執行程式mpiexec –n 4 ./hello.o //-n 為 process 數量

5. 結束 MPImpdallexit

Page 71: Parallel program design

Embedded and Parallel Systems Lab

71

MPI example : hello.c#include "mpi.h"#include <stdio.h>#define SIZE 20

int main(int argc,char *argv[]){ int numtasks, rank, dest, source, rc, count, tag=1; char inmsg[SIZE]; char outmsg[SIZE];

double starttime, endtime; MPI_Status Stat; MPI_Datatype strtype;

MPI_Init(&argc,&argv); // 起始 MPI 環境 MPI_Comm_rank(MPI_COMM_WORLD, &rank); // 取得自己的 process ID

MPI_Type_contiguous(SIZE, MPI_CHAR, &strtype); // 設定新的資料型態 string MPI_Type_commit(&strtype);   // 建立新的資料型態 string

starttune=MPI_Wtime(); // 取得目前時間

Page 72: Parallel program design

Embedded and Parallel Systems Lab

72

MPI example : hello.c if (rank == 0) { dest = 1; source = 1; strcpy(outmsg,"Who are you?");

// 傳送訊息到 process 0 rc = MPI_Send(outmsg, 1, strtype, dest, tag, MPI_COMM_WORLD); printf("process %d has sended message: %s\n",rank, outmsg);

// 接收來自 process 1 的訊息 rc = MPI_Recv(inmsg, 1, strtype, source, tag, MPI_COMM_WORLD, &Stat); printf("process %d has received: %s\n",rank, inmsg); } else if (rank == 1) { dest = 0; source = 0; strcpy(outmsg,"I am process 1"); rc = MPI_Recv(inmsg, 1, strtype, source, tag, MPI_COMM_WORLD, &Stat); printf("process %d has received: %s\n",rank, inmsg); rc = MPI_Send(outmsg, 1 , strtype, dest, tag, MPI_COMM_WORLD); printf("process %d has sended message: %s\n",rank, outmsg); }

Page 73: Parallel program design

Embedded and Parallel Systems Lab

73

MPI example : hello.c endtime=MPI_Wtime(); // 取得結束時間

// 使用 MPI_CHAR 來計算實際收到多少資料 rc = MPI_Get_count(&Stat, MPI_CHAR, &count);

printf("Task %d: Received %d char(s) from task %d with tag %d and use

time is %f \n", rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG,

endtime-starttime);

MPI_Type_free(&strtype); // 釋放 string 資料型態MPI_Finalize(); // 結束 MPI

}

process 0 has sended message: Who are you?

process 1 has received: Who are you?

process 1 has sended message: I am process 1

Task 1: Received 20 char(s) from task 0 with tag 1 and use time is 0.001302

process 0 has received: I am process 1

Task 0: Received 20 char(s) from task 1 with tag 1 and use time is 0.002133

Page 74: Parallel program design

Embedded and Parallel Systems Lab

74

openMP vs. MPI

openMP MPI DSM

private data Yes Yes Yes

share data Yes No Yes

barrier Yes Yes Yes

reduction Yes Yes Yes / No

atomic Yes No Yes / No

critical Yes No Yes

Page 75: Parallel program design

Embedded and Parallel Systems Lab

75

Collective Communication Routines

function int MPI_Barrier(MPI_Comm comm)

功能 當程式執行到 Barrier 便會 block ,等待所有其他 process 也執行到 Barrier ,當所有 Group 內的 process 均執行到 Barrier 便會取消 block 繼續往下執行

parameters comm : IN , MPI_COMM_WORLD

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Types of Collective Operations: Synchronization : processes wait until all members of the group have reached

the synchronization point. Data Movement : broadcast, scatter/gather, all to all. Collective Computation (reductions) : one member of the group collects data

from the other members and performs an operation (min, max, add, multiply, etc.) on that data.

Page 76: Parallel program design

Embedded and Parallel Systems Lab

76

MPI_Bcastfunction int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int source(root), MPI_Comm

comm)

功能 將訊息廣播出去,讓所有人接收到相同的訊息parameters buffer : INOUT ,傳送的訊息,也是接收訊息的 buff

count : IN ,傳送多少個訊息datatype : IN ,傳送的資料型能source( 標準 root) : IN ,負責傳送訊息的 processcomm : IN , MPI_COMM_WORLD

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Page 77: Parallel program design

Embedded and Parallel Systems Lab

77

MPI_Gatherfunction int MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void*

recvbuf, int recvcount, MPI_Datatype recvtype, int destine, MPI_Comm comm)

功能 將分散在各個 process 所傳送的訊息,整合起來,然後傳送到指定的 process接收

parameters sendbuf : IN ,傳送的訊息sendcount : IN ,傳送多少個sendtype : IN ,傳送的型態recvbuf : OUT ,接收訊息的 bufrecvcount : IN ,接收多少個recvtype : IN ,接收的型態destine : IN ,負責接收訊息的 processcomm : IN , MPI_COMM_WORLD

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Page 78: Parallel program design

Embedded and Parallel Systems Lab

78

MPI_Gather

Page 79: Parallel program design

Embedded and Parallel Systems Lab

79

MPI_Allgather

function int MPI_Allgather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

功能 將分散在各個 process 所傳送的訊息,整合起來,然後廣播到所有 process

parameters sendbuf : IN ,傳送的訊息sendcount : IN ,傳送多少個sendtype : IN ,傳送的型態recvbuf : OUT ,接收訊息的 bufrecvcount : IN ,接收多少個recvtype : IN ,接收的型態comm : IN , MPI_COMM_WORLD

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Page 80: Parallel program design

Embedded and Parallel Systems Lab

80

MPI_Allgather

Page 81: Parallel program design

Embedded and Parallel Systems Lab

81

MPI_Reduce

function int MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int destine, MPI_Comm comm)

功能 在傳送時順便做一些 Operation(ex : MPI_SUM 做加總 ) ,然後將結果送到destine process

parameters sendbuf : IN ,傳送的訊息recvbuf : OUT ,接收訊息的 bufcount : IN ,傳送接收多少個datatype : IN ,傳送接收的資料型態op : IN ,想要做的動作destine : IN ,接收訊息的 process IDcomm : IN , MPI_COMM_WORLD

return value int :如果執行成功回傳 MPI_SUCCESS ,0

Page 82: Parallel program design

Embedded and Parallel Systems Lab

82

MPI_Reduce

MPI Reduction Operation C Data Types

MPI_MAX maximum integer, float

MPI_MIN minimum integer, float

MPI_SUM sum integer, float

MPI_PROD product integer, float

MPI_LAND logical AND integer

MPI_BAND bit-wise AND integer, MPI_BYTE

MPI_LOR logical OR integer

MPI_BOR bit-wise OR integer, MPI_BYTE

MPI_LXOR logical XOR integer

MPI_BXOR bit-wise XOR integer, MPI_BYTE

MPI_MAXLOC max value and location float, double and long double

MPI_MINLOC min value and location float, double and long double

Page 83: Parallel program design

Embedded and Parallel Systems Lab

83

MPI example : matrix.c(1)#include <mpi.h>#include <stdio.h>#include <stdlib.h>#define RANDOM_SEED 2882 //random seed#define MATRIX_SIZE 800 //sequare matrix width the same to height#define NODES 4//this is numbers of nodes. minimum is 1. don't use < 1#define TOTAL_SIZE (MATRIX_SIZE * MATRIX_SIZE)//total size of

MATRIX#define CHECK

int main(int argc, char *argv[]){int i,j,k;int node_id;int AA[MATRIX_SIZE][MATRIX_SIZE]; int BB[MATRIX_SIZE][MATRIX_SIZE]; int CC[MATRIX_SIZE][MATRIX_SIZE];

Page 84: Parallel program design

Embedded and Parallel Systems Lab

84

MPI example : matrix.c(2)

#ifdef CHECKint _CC[MATRIX_SIZE][MATRIX_SIZE]; //sequence user, use to check the parallel result CC

#endifint check = 1;int print = 0;int computing = 0;double time,seqtime;int numtasks;int tag=1;int node_size;MPI_Status stat;

MPI_Datatype rowtype;srand( RANDOM_SEED );

Page 85: Parallel program design

Embedded and Parallel Systems Lab

85

MPI example : matrix.c(3)MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD, &node_id);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);if (numtasks != NODES){

printf("Must specify %d processors. Terminating.\n", NODES);MPI_Finalize();return 0;

}if (MATRIX_SIZE%NODES !=0){

printf("Must MATRIX_SIZE%NODES==0\n", NODES);MPI_Finalize();

return 0;}

MPI_Type_contiguous(MATRIX_SIZE, MPI_FLOAT, &rowtype); MPI_Type_commit(&rowtype);

Page 86: Parallel program design

Embedded and Parallel Systems Lab

86

MPI example : matrix.c(4)/*create matrix A and Matrix B*/

if(node_id == 0){ for( i=0 ; i<MATRIX_SIZE ; i++){

for( j=0 ; j<MATRIX_SIZE ; j++){AA[i][j] = rand()%10; BB[i][j] = rand()%10;

} }}

/*send the matrix A and B to other node */node_size = MATRIX_SIZE / NODES;

Page 87: Parallel program design

Embedded and Parallel Systems Lab

87

MPI example : matrix.c(5)//send AA

if (node_id == 0) for (i=1; i<NODES; i++) MPI_Send(&AA[i*node_size][0], node_size, rowtype, i, tag,

MPI_COMM_WORLD);else

MPI_Recv(&AA[node_id*node_size][0], node_size, rowtype, 0, tag, MPI_COMM_WORLD, &stat);//send BB

if (node_id == 0) for (i=1; i<NODES; i++) MPI_Send(&BB, MATRIX_SIZE, rowtype, i, tag,

MPI_COMM_WORLD);else MPI_Recv(&BB, MATRIX_SIZE, rowtype, 0, tag,

MPI_COMM_WORLD, &stat);

Page 88: Parallel program design

Embedded and Parallel Systems Lab

88

MPI example : matrix.c(6)

/*computing C = A * B*/time = -MPI_Wtime();

for( i=node_id*node_size ; i<(node_id*node_size+node_size) ; i++){for( j=0 ; j<MATRIX_SIZE ; j++){

computing = 0;for( k=0 ; k<MATRIX_SIZE ; k++)

computing += AA[i][k] * BB[k][j];CC[i][j] = computing;

}}MPI_Allgather(&CC[node_id*node_size][0], node_size, rowtype, &CC, node_size, rowtype, MPI_COMM_WORLD);

time += MPI_Wtime();

Page 89: Parallel program design

Embedded and Parallel Systems Lab

89

MPI example : matrix.c(7)

#ifdef CHECKseqtime = -MPI_Wtime();if(node_id == 0){ for( i=0 ; i<MATRIX_SIZE ; i++){ for( j=0 ; j<MATRIX_SIZE ; j++){ computing = 0; for( k=0 ; k<MATRIX_SIZE ; k++) computing += AA[i][k] * BB[k][j]; _CC[i][j] = computing; }

}}seqtime += MPI_Wtime();

Page 90: Parallel program design

Embedded and Parallel Systems Lab

90

/* check result */if(node_id == 0){ for( i=0 ; i<MATRIX_SIZE; i++){

for( j=0 ; j<MATRIX_SIZE ; j++){ if( CC[i][j] != _CC[i][j]){

check = 0; break;

}}

}}

Page 91: Parallel program design

Embedded and Parallel Systems Lab

91

MPI example : matrix.c(8)

/*print result */#endif

if(node_id ==0){printf("node_id=%d\ncheck=%s\nprocessing time:%f\n\

n",node_id,(check)?"success!":"failure!", time);#ifdef CHECK

printf("sequent time:%f\n", seqtime);#endif

}

MPI_Type_free(&rowtype); MPI_Finalize();

return 0;}

Page 92: Parallel program design

Embedded and Parallel Systems Lab

92

Reference

Top 500 http://www.top500.org/ Maarten Van Steen, Andrew S. Tanenbaum, “Distributed Systems: Principles

and Paradigms ” System Threads Reference

http://www.unix.org/version2/whatsnew/threadsref.html Semaphone http://www.mkssoftware.com/docs/man3/sem_init.3.asp Richard Stones. Neil Matthew, “Beginning Linux Programming” W. Richard Stevens, “Networking APIs : Sockets and XTI“ William W.-Y. Liang , “Linux System Programming” Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP” Introduction to Parallel Computing  

http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 93: Parallel program design

Embedded and Parallel Systems Lab

93

Reference

Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP” Introduction to Parallel Computing  

http://www.llnl.gov/computing/tutorials/parallel_comp/ MPI standard http://www-unix.mcs.anl.gov/mpi/ MPI http://www.llnl.gov/computing/tutorials/mpi/

Page 94: Parallel program design

Embedded and Parallel Systems Lab

94

Conclusion 如何想出好的平行演算法是非常困難的。 開發工具及除錯工具普遍不足 新一代的語言

IBM 的 X10 、 Sun 的 Fortress 、 Cray 的 Chapel X10 是以 java1.4 為基礎來擴充的語言

async(place.factory.place(1)){ for (int i=1 ; i<=10 ; i+=2 ) ans += i; }

Page 95: Parallel program design

Embedded and Parallel Systems Lab

95

Reference

Top 500 http://www.top500.org/ Maarten Van Steen, Andrew S. Tanenbaum, “Distributed Systems: Principles

and Paradigms ” System Threads Reference

http://www.unix.org/version2/whatsnew/threadsref.html Semaphone http://www.mkssoftware.com/docs/man3/sem_init.3.asp Richard Stones. Neil Matthew, “Beginning Linux Programming” W. Richard Stevens, “Networking APIs : Sockets and XTI“ William W.-Y. Liang , “Linux System Programming” Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP” Introduction to Parallel Computing  

http://www.llnl.gov/computing/tutorials/parallel_comp/

Page 96: Parallel program design

Embedded and Parallel Systems Lab

96

Reference

MPI standard http://www-unix.mcs.anl.gov/mpi/ MPI http://www.llnl.gov/computing/tutorials/mpi/ OpenMP standard http://www.openmp.org/drupal/ OpenMP MSDN tutorial

http://msdn2.microsoft.com/en-us/library/tt15eb9t(VS.80).aspx OpenMP tutorial http://www.llnl.gov/computing/tutorials/openMP/#DO Kang Su Gatlin , Pete Isensee, “Reap the Benefits of Multithreading

without All the Work” ,MSDN Magazine Gary Anthes  “Languages for Supercomputing Get 'Suped' Up”,

Computerword March 12, 2007 IBM X10 research

http://domino.research.ibm.com/comm/research_projects.nsf/pages/x10.X10-presentations.html

Page 97: Parallel program design

Embedded and Parallel Systems Lab

97

The End

Thank you very much!

Page 98: Parallel program design

Embedded and Parallel Systems Lab

98

附錄

Pipeline

Page 99: Parallel program design

Embedded and Parallel Systems Lab

99

Pipeline