performance oriented mpi jeffrey m. squyres andrew lumsdaine nersc/lbnl and u. notre dame

38
Performance Oriented Performance Oriented MPI MPI Jeffrey M. Squyres Jeffrey M. Squyres Andrew Lumsdaine Andrew Lumsdaine NERSC/LBNL and U. Notre NERSC/LBNL and U. Notre Dame Dame QuickTime™ and a GIF decompressor are needed to see this picture.

Upload: erick-mckinney

Post on 03-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Performance Oriented MPIPerformance Oriented MPI

Jeffrey M. SquyresJeffrey M. Squyres

Andrew LumsdaineAndrew Lumsdaine

NERSC/LBNL and U. Notre DameNERSC/LBNL and U. Notre Dame

QuickTime™ and aGIF decompressor

are needed to see this picture.

Page 2: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

OverviewOverview

Overview and History of MPIOverview and History of MPI Performance Oriented Point to PointPerformance Oriented Point to Point Collectives, Data TypesCollectives, Data Types Diagnostics and TuningDiagnostics and Tuning Rules of Thumb and GotchasRules of Thumb and Gotchas

Page 3: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Scope of This TalkScope of This Talk

Beginning to intermediate userBeginning to intermediate user General principles and rules of thumbGeneral principles and rules of thumb When and where performance might be When and where performance might be

availableavailable Omit (advanced) low-level issuesOmit (advanced) low-level issues

Page 4: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Overview and History of MPIOverview and History of MPI

Library (not language) specificationLibrary (not language) specification GoalsGoals

– PortabilityPortability– EfficiencyEfficiency– Functionality (small and large)Functionality (small and large)

Safety (communicators)Safety (communicators) Conservative (current best practices)Conservative (current best practices)

Page 5: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Performance in MPIPerformance in MPI

MPI includes many performance-MPI includes many performance-oriented featuresoriented features

These features are only These features are only potentiallypotentially high- high-performanceperformance

The standard seeks not to preclude The standard seeks not to preclude performance, it does not mandate itperformance, it does not mandate it

Progress might only be made during MPI Progress might only be made during MPI function callsfunction calls

Page 6: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

(Potential) Performance (Potential) Performance FeaturesFeatures

Non-blocking operationsNon-blocking operations Persistent operationsPersistent operations Collective operationsCollective operations MPI DatatypesMPI Datatypes

Page 7: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Basic Point to PointBasic Point to Point

““Six function MPI” includesSix function MPI” includes MPI_Send()MPI_Send() MPI_Recv()MPI_Recv() These are useful, but there is moreThese are useful, but there is more

Page 8: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Basic Point to PointBasic Point to Point

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {

MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD);

} else {

MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status);

}

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {

MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD);

} else {

MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status);

}

Page 9: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Non-Blocking OperationsNon-Blocking Operations

MPI_Isend()MPI_Isend() MPI_Irecv()MPI_Irecv() ““I” is for immediateI” is for immediate Paired with MPI_Test()/MPI_Wait()Paired with MPI_Test()/MPI_Wait()

Page 10: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Non-Blocking OperationsNon-Blocking Operations

MPI_Comm_rank(comm,&rank);

if (rank == 0) {

MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request);

/* Do some computation */

MPI_Wait(&request,&status);

} else {

MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request);

/* Do some computation */

MPI_Wait(&request,&status);

}

MPI_Comm_rank(comm,&rank);

if (rank == 0) {

MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request);

/* Do some computation */

MPI_Wait(&request,&status);

} else {

MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request);

/* Do some computation */

MPI_Wait(&request,&status);

}

Page 11: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Persistent OperationsPersistent Operations

MPI_Send_Init() MPI_Send_Init() MPI_Recv_init()MPI_Recv_init() Creates a request but does not start itCreates a request but does not start it MPI_Start() begins the communicationMPI_Start() begins the communication A single request can be re-used with A single request can be re-used with

multiple calls to MPI_Start()multiple calls to MPI_Start()

Page 12: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Persistent OperationsPersistent Operations

MPI_Comm_rank(comm, &rank);

if (rank == 0)

MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request);

else

MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request);

/* … */

for (i = 0; i < n; i++) {

MPI_Start(&request);

/* Do some work */

MPI_Wait(&request, &status);

}

MPI_Comm_rank(comm, &rank);

if (rank == 0)

MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request);

else

MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request);

/* … */

for (i = 0; i < n; i++) {

MPI_Start(&request);

/* Do some work */

MPI_Wait(&request, &status);

}

Page 13: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Collective OperationsCollective Operations

May be layered on point to pointMay be layered on point to point May use tree communication patterns May use tree communication patterns

for efficiencyfor efficiency Synchronization! (No non-blocking Synchronization! (No non-blocking

collectives)collectives)

Page 14: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Collective OperationsCollective Operations

MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm); MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm);

O(P) O(log P)

Page 15: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

MPI DatatypesMPI Datatypes May allow MPI to send a message May allow MPI to send a message

directly from memory directly from memory May avoid copying/packingMay avoid copying/packing (General) high performance (General) high performance

implementations not widely availableimplementations not widely available

network

copy

Page 16: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Quiz: MPI_Send()Quiz: MPI_Send()

After I call MPI_Send()After I call MPI_Send()– The recipient has received the messageThe recipient has received the message– I have sent the messageI have sent the message– I can write to the message buffer without I can write to the message buffer without

corrupting the messagecorrupting the message I can write to the message bufferI can write to the message buffer

Page 17: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Sidenote: MPI_Ssend()Sidenote: MPI_Ssend()

MPI_Ssend() has the (perhaps) MPI_Ssend() has the (perhaps) expected semanticsexpected semantics

When MPI_Ssend() returns, the When MPI_Ssend() returns, the recipient has received the messagerecipient has received the message

Useful for debugging (replace Useful for debugging (replace MPI_Send() with MPI_Ssend())MPI_Send() with MPI_Ssend())

Page 18: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Quiz: MPI_Isend()Quiz: MPI_Isend()

After I call MPI_Isend()After I call MPI_Isend()– The recipient has started to receive the The recipient has started to receive the

messagemessage– I have started to send the messageI have started to send the message– I can write to the message buffer without I can write to the message buffer without

corrupting the messagecorrupting the message None of the above (I must call None of the above (I must call

MPI_Test() or MPI_Wait())MPI_Test() or MPI_Wait())

Page 19: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Quiz: MPI_Isend()Quiz: MPI_Isend()

True or FalseTrue or False– I can overlap communication and I can overlap communication and

computation by putting some computation computation by putting some computation between MPI_Isend() and between MPI_Isend() and MPI_Test()/MPI_Wait()MPI_Test()/MPI_Wait()

False (in many/most cases)False (in many/most cases)

Page 20: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Communication is Still Communication is Still ComputationComputation

A CPU, usually the main one, must do A CPU, usually the main one, must do the communication workthe communication work– Part of your process (inside MPI calls)Part of your process (inside MPI calls)– Another process on main CPUAnother process on main CPU– Another thread on main CPUAnother thread on main CPU– Another processorAnother processor

Page 21: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

No Free LunchNo Free Lunch Part of your process (most common)Part of your process (most common)

– Fast but no overlapFast but no overlap Another process (daemons)Another process (daemons)

– Overlap, but slow (extra copies)Overlap, but slow (extra copies) Another thread (rare)Another thread (rare)

– Overlap and fast, but difficultOverlap and fast, but difficult Another processor (emerging)Another processor (emerging)

– Overlap and fast, but more hardwareOverlap and fast, but more hardware– E.g., Myri/gm, VIAE.g., Myri/gm, VIA

Page 22: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

How Do I Get Performance?How Do I Get Performance?

Minimize time spent communicatingMinimize time spent communicating– Minimize data copiesMinimize data copies

Minimize synchronizationMinimize synchronization– I.e., time waiting for communicationI.e., time waiting for communication

Page 23: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Minimizing Communication Minimizing Communication TimeTime

BandwidthBandwidth LatencyLatency

Page 24: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Minimizing LatencyMinimizing Latency

Collect small messages together (if you Collect small messages together (if you can)can)– One 1024-byte message instead of 1024 One 1024-byte message instead of 1024

one-byte messagesone-byte messages Minimize other overhead (e.g., copying)Minimize other overhead (e.g., copying) Overlap with computation (if you can)Overlap with computation (if you can)

Page 25: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Example: Domain Example: Domain DecompositionDecomposition

Page 26: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Naïve ApproachNaïve Approach

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

for (i = 0; i < 4; i++)

MPI_send(…);

for (i = 0; i < 4; i++)

MPI_recv(…);

}

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

for (i = 0; i < 4; i++)

MPI_send(…);

for (i = 0; i < 4; i++)

MPI_recv(…);

}

Page 27: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Naïve ApproachNaïve Approach

Deadlock! (Deadlock! (MaybeMaybe)) Can fix with careful coordination of Can fix with careful coordination of

receiving versus sending on alternate receiving versus sending on alternate processesprocesses

But this can still serializeBut this can still serialize

Page 28: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

MPI_Sendrecv()MPI_Sendrecv()

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

for (i = 0; i < 4; i++) {

MPI_Sendrecv(…);

}

}

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

for (i = 0; i < 4; i++) {

MPI_Sendrecv(…);

}

}

Page 29: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Immediate OperationsImmediate Operations

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

for (i = 0; i < 4; i++) {

MPI_Isend(…);

MPI_Irecv(…);

}

MPI_Waitall(…);

}

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

for (i = 0; i < 4; i++) {

MPI_Isend(…);

MPI_Irecv(…);

}

MPI_Waitall(…);

}

Page 30: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Receive Before SendingReceive Before Sending

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

for (i = 0; i < 4; i++)

MPI_Irecv(…);

for (i = 0; i < 4; i++)

MPI_Isend(…);

MPI_Waitall(…);

}

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

for (i = 0; i < 4; i++)

MPI_Irecv(…);

for (i = 0; i < 4; i++)

MPI_Isend(…);

MPI_Waitall(…);

}

Page 31: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Persistent OperationsPersistent Operations

for (i = 0; i < 4; i++) {

MPI_Recv_init(…);

MPI_Send_init(…);

}

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

MPI_Startall(…)

MPI_Waitall(…);

}

for (i = 0; i < 4; i++) {

MPI_Recv_init(…);

MPI_Send_init(…);

}

while (!done) {

exchange(D, neighbors, myrank);

dored(D);

exchange(D, neighbors, myrank);

doblack(D);

}

void exchange(Array D, int *neighbors, int myrank) {

MPI_Startall(…)

MPI_Waitall(…);

}

Page 32: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

OverlappingOverlappingwhile (!done) {

MPI_Startall(…); /* Start exchanges */

do_inner_red(D); /* Internal computation */

for (i = 0; i < 4; i++) {

MPI_Wait_any(…); /* As information arrives */

do_received_red(D); /* Process */

}

MPI_Startall(…);

do_inner_black(D);

for (i = 0; i < 4; i++) {

MPI_Wait_any(…);

do_received_black(D);

}

}

while (!done) {

MPI_Startall(…); /* Start exchanges */

do_inner_red(D); /* Internal computation */

for (i = 0; i < 4; i++) {

MPI_Wait_any(…); /* As information arrives */

do_received_red(D); /* Process */

}

MPI_Startall(…);

do_inner_black(D);

for (i = 0; i < 4; i++) {

MPI_Wait_any(…);

do_received_black(D);

}

}

Page 33: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Advanced OverlapAdvanced Overlap

MPI_Startall(…); /* Start all receives */

/* … */

while (!done) {

MPI_Startall(…); /* Start sends */

do_inner_red(D); /* Internal computation */

for (i = 0; i < 4; i++) {

MPI_Wait_any(…); /* Wait on receives */

if (received) {

do_received_red(D); /* Process */

MPI_Start(…); /* Restart receive */

}

}

/* Repeat for black */

}

MPI_Startall(…); /* Start all receives */

/* … */

while (!done) {

MPI_Startall(…); /* Start sends */

do_inner_red(D); /* Internal computation */

for (i = 0; i < 4; i++) {

MPI_Wait_any(…); /* Wait on receives */

if (received) {

do_received_red(D); /* Process */

MPI_Start(…); /* Restart receive */

}

}

/* Repeat for black */

}

Page 34: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

MPI Data TypesMPI Data Types

MPI_Type_vectorMPI_Type_vector MPI_Type_structMPI_Type_struct Etc.Etc. MPI_Pack might be betterMPI_Pack might be better

network

copy

Page 35: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Minimizing SynchronizationMinimizing Synchronization

At synchronization point (e.g., with At synchronization point (e.g., with collective communication) all processes collective communication) all processes must arrive at collective callmust arrive at collective call

Can spend lots of time waitingCan spend lots of time waiting This is often an algorithmic issueThis is often an algorithmic issue

– E.g., check for convergence every 5 E.g., check for convergence every 5 iterations instead of every iterationiterations instead of every iteration

Page 36: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

GotchasGotchas

MPI_ProbeMPI_Probe– Guarantees extra memory copyGuarantees extra memory copy

MPI_Any_sourceMPI_Any_source– Can cause additional (internal) loopingCan cause additional (internal) looping

MPI_All_to_allMPI_All_to_all– All pairs must communicateAll pairs must communicate– Synchronization (avoid in general)Synchronization (avoid in general)

Page 37: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Diagnostic ToolsDiagnostic Tools

TotalviewTotalview PrismPrism UpshotUpshot XMPIXMPI

Page 38: Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

SummarySummary

Receive before sendingReceive before sending Collect small messages togetherCollect small messages together Overlap (if possible)Overlap (if possible) Use immediate operationsUse immediate operations Use persistent operationsUse persistent operations Use diagnostic toolsUse diagnostic tools