a comparison mpi vs posix threads. overview mpi allows you to run multiple processes on 1 host how...

14
A COMPARISON MPI vs POSIX Threads

Upload: laureen-ross

Post on 26-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

A COMPARISON

MPI vs POSIX Threads

Page 2: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

Overview

MPI allows you to run multiple processes on 1 host How would running MPI on 1 host compare with POSIX thread

solution? Attempting to compare MPI vs POSIX run times Hardware

Dual 6 Core (2 threads per core) 12 logical http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/AboutRage.txt

Intel Xeon CPU E5 – 2667 (show schematic) http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/xeon-e5-v2-datasheet-vol-1.pdf

2.96 GHz 15 MB L3 Cache

All code / output / analysis available here: http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/

Page 3: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

Specifics

Going to compare runtimes of code in MPI vs code written using POSIX threads and shared memory Try to make the code as similar as possible so we’re comparing apples with oranges and not apples with monkeys Since we are on 1 machine the BUS is doing all the com traffic, that should make the POSIX and MPI versions similar

(ie. The network doesn’t get involved)

Only makes sense with 1 machine Set up test bed

Try each step individually, check results, then automate

Use Matrix Matrix multiply code we developed over the semester Everyone is familiar with the code and can make observations http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/pthread_matrix_21.c http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_3.c http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/matmat_no_mp.c

Use square matrices Vary Matrix sizes from 500 -> 10,000 elements square (plus a couple of big

ones) Matrix A will be filled with 1-n Left to Right and Top Down Matrix B will be the identity matrix

Can then check our results easily as A*B = A when B = identity matrix http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/mat_500_result.txt Ran all processes ie. compile / output result / parsing many times and checked before writing final scripts to do the

processing

Page 4: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

Matrix Sizes

MATRIX SIZE NUM ELEMENTS LOOP CALCULATIONS N multiplies N-1 Adds500 250000 249750000600 360000 431640000700 490000 685510000800 640000 1023360000900 810000 1457190000

1000 1000000 19990000001100 1210000 26607900001200 1440000 34545600001300 1690000 43923100001400 1960000 54860400001500 2250000 67477500001600 2560000 81894400001700 2890000 98231100001800 3240000 116607600001900 3610000 137143900002000 4000000 159960000002100 4410000 185175900002200 4840000 212911600002300 5290000 243287100002400 5760000 276422400002500 6250000 312437500002600 6760000 351452400002700 7290000 393587100002800 7840000 438961600002900 8410000 487695900003000 9000000 539910000004000 16000000 1.27984E+115000 25000000 2.49975E+116000 36000000 4.31964E+117000 49000000 6.85951E+118000 64000000 1.02394E+129000 81000000 1.45792E+12

10000 100000000 1.9999E+12

Third Column:Just the number of calculations inside the loop for calculating the matrix elements

Page 5: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

Specifics cont.

About the runs For each MATRIX size (500 -> 3000 ,4000, 5000, 6000,7000,8000,9000,10000) Vary thread count 2-12 (POSIX) Vary Processes 2-12 (MPI) Run 10 trials of each and take average (machine mostly idle when not running tests, but want to smooth spikes in run times

caused by the system doing routine tasks)

Make observations about anomalies in the run times where appropriate Caveats

All initial runs with no optimization for testing, but hey this is a class about performance Second set of runs with optimization turned on –O1 ( note: -O2 & -O3 made no appreciable difference)

First level optimization made a huge difference > 3 x improvement GNU Optimization explanation can be found here: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html Built with just the –O1 flags to see if I could catch the “one” making the most difference (nope) (code isn’t that complicated) Not all optimizations are flag controlled Regardless of whether the code is written in the most efficient fashion (and it’s not) because of the similarity we can make

some runs and observations

Oh No moment ** Huge improvement in performance with optimized code, why? What if the improvement in performance ( from compiler optimization) was due to the identity matrix? Came back and made matrix B non Identity, same performance. Whew.

I now Believe the main performance improvement came from loop unrolling. Maybe the compiler found a clever way to increase the speed because of the simple math and it’s not really doing all the

calculations I thought it was? Came back and made matrix B non Identity, same performance. Whew. Ready to make the runs

Page 6: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

Discussion

Please chime in as questions come up. Process Explanation: (After initial testing and verification)

http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/process_explanation.txt

Attempted a 25,000 x 25,000 matrix Compiler error for MPI (exceeded MPI_Bcast 2 GB limit on matrices) http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/BadCompileMPI.txt Not an issue for POSIX threads (until you run out of memory on the machine) swap

Settled on 12 Processes / Threads because of the number of cores available Do you get enhanced or degraded performance by exceeding that number? http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/POSIX_MANY_THREADS.txt

Example of process space / top output (10,000 x 10,000) Early testing, before runs started. Pre Optimization http://web.cs.sunyit.edu/~rahnb1/CS523/final_project/RESULTS/RageTestRun_Debug_CPU_Usage.txt

Page 7: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

Time Comparison (Boring)

20 25 30 35 40 456789

101112

67

89

1011

12

POSIX Threads Matrix Matrix Multiply Matrix Size - 4000 x 4000

Time (secs)

Nu

mber

of

PO

SIX

th

reads

20 25 30 35 40 456

7

8

9

10

11

12

6

7

8

9

10

11

12

MPI Matrix Matrix Multiply Matrix Size - 4000 x 4000

Time (secs)

Nu

mb

er

of

MP

I P

rocess

es

Page 8: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

Time Comparison (still boring…)In all these cases time for 5 ,4, 3, 2 processes much longer than 6 so left of for comparison

40 50 60 70 80 90 1006789

101112

67

89

1011

12

MPI Matrix Matrix Multiply Matrix Size - 5000 x 5000

Time (secs)

Nu

mb

er

of

MP

I P

rocess

es

40 50 60 70 80 90 1006

7

8

9

10

11

12

6

7

8

9

10

11

12

POSIX Threads Matrix Matrix Multiply Matrix Size - 5000 x 5000

Time (secs)

Nu

mb

er

of

PO

SIX

Th

read

s

MPI Doesn’t “catch” back up till 11 processes

POSIX Doesn’t “catch” back up till 9 processes

Page 9: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

MPI Time Curve

0 10 20 30 40 50 60 702

3

4

5

6

7

8

9

10

11

12

MPI Matrix Sizes 2400x2400 - 3000x3000

3000 x 3000 2900 x 2900 2800 x 2800 2700 x 2700 2600 x 2600 2500 x 2500 2400 x 2400

Time (secs)

Num

ber

of

MP

I P

rocess

es

Note: 3000 x 3000 per-forms better than 2900 x 2900

Run Time 1 processor optimized3000 x 3000, straight C no MPI

Page 10: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

POSIX Time Curve

3 5 7 9 11 13 15 17 19 212

3

4

5

6

7

8

9

10

11

12

POSIX Matrix Sizes 2400x2400 – 3000x3000

3000 x 3000 2900 x 2900 2800 x 2800 2700 x 2700 2600 x 2600 2500 x 2500 2400 x 2400

Time (secs)

Num

ber

of

PO

SIX

Thre

ads

Up to here 3000 x 3000 performs better than 2900 x 2900

Page 11: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

POSIX Threads Vs MPI Processes Run TimesMatrix Sizes 4000x4000 – 10,000 x 10,000

0 100 200 300 400 500 600 700 800 900 100023456789

101112

6789101112

6789101112

67

89

101112

67

89

1011

12

678

910

11

4

67

89

1011

12

4

67

89

1011

12

POSIX Threads 4000 x 4000 - 10,000 x 10,00010,000 x 10,000 9000 x 9000 8000 x 8000 7000 x 7000 6000 x 6000 5000 x 5000 4000 x 4000

Time (secs)

Nu

mb

er

of

PO

SIX

th

read

s

0 100 200 300 400 500 600 700 800 900 100023456789

101112

6789101112

678

910

1112

6789

1011

12

678

91011

12

67

89

1011

4

67

89

101112

4

67

89

1011

12

MPI Processes 4000 x 4000 - 10,000 x 10,00010,000 x 10,000 9000 x 9000 8000 x 8000 7000 x 7000 6000 x 6000 5000 x 5000 4000 x 4000

Time (secs)

Nu

mber

of

MP

I P

roce

sses

Page 12: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

POSIX Threads 1500 x 1500 – 2500x2500

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 52

3

4

5

6

7

8

9

10

11

12

POSIX Threads Matrix Sizes 1500 x 1500 - 2500 x 25002500 x 2500 2400 x 2400 2300 x 2300 2000 x 2000 2100 x 2100 2000 x 2000 1900 x 1900 1800 x 1800 1700 x 1700

1600 x 1600 1500 x 1500

Time (Secs)

Num

ber

of

PO

SIX

Thre

ads

Page 13: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

1600 x 1600 case

Straight C runs long enough to see top output (here I can see the memory usage) threaded ,MPI, and non mp code share same basic structure for calculating “C” Matrix

Suspect some kind of boundary issue here, possibly “false sharing”? Process fits entirely in shared L3 cache 15 MB x 2 = 30MB Do same number of calculations but make initial array allocations larger (shown below)

[rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS (1 2 3 4 5)foreach? ./a.outforeach? EndMatrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.979548 secsMatrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.980786 secsMatrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.971891 secsMatrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 21.974897 secsMatrices (1600x1600) Size Allocated (1600 x 1600) : Run Time 22.012967 secs[rahnbj@rage ~/SUNY]$ foreach NUM_TRIALS ( 1 2 3 4 5 )foreach? ./a.outforeach? EndMatrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.890815 secsMatrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.903997 secsMatrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.881991 secsMatrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.884655 secsMatrices (1600x1600) Size Allocated (1601 x 1601) : Run Time 12.887197 secs[rahnbj@rage ~/SUNY]$

Page 14: A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread

Future Directions

POSIX Threads with Network memory? (NFS) Combo MPI and POSIX Threads?

MPI to multiple machines, then POSIX threads ? http://cdac.in/index.aspx?id=ev_hpc_hegapa12_mode01_multicore_mpi_pthreads POSIX threads that launch MPI ?

Couldn’t get MPE running with MPIch (would like to re-investigate why) Investigate optimization techniques

Did the compiler figure out how to reduce run times because of the simple matrix multiplies? <- NO Rerun with non-identity B matrix and compare times <- DONE

Try different languages ie CHAPEL Try different algorithms Want to add OpenMP to the mix

Found this paper on OpenMP vs direct POSIX programming (similar tests) http://www-polsys.lip6.fr/~safey/Reports/pasco.pdf

For < 6 processes look at thread_affinity and assignment of threads to a physical processor