efficient dynamic verification algorithms for mpi …

EFFICIENT DYNAMIC VERIFICATION

ALGORITHMS FOR MPI

APPLICATIONS

by

Sarvani Vakkalanka

A dissertation submitted to the faculty ofThe University of Utah

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Science

School of Computing

The University of Utah

August 2010

T h e U n i v e r s i t y o f U t a h G r a d u a t e S c h o o l

STATEMENT OF DISSERTATION APPROVAL

The dissertation of

has been approved by the following supervisory committee members:

, Chair Date Approved

, Member

Date Approved

, Member

Date Approved

, Member

Date Approved

, Member

Date Approved

and by , Chair of

the Department of

and by Charles A. Wight, Dean of The Graduate School.

ABSTRACT

The Message Passing Interface (MPI) Application Programming Interface (API)

is widely used in almost all high performance computing applications. Yet, con-

ventional debugging tools for MPI suffer from two serious drawbacks: they can-

not prevent the exponentially growing number of redundant schedules from being

explored; and they cannot prevent the processes from being locked into a small

subset of schedules, unfortunately often reaching the potentially buggy schedules

only when programs are ported to new platforms.

Dynamic verification methods are the natural choice for debugging real world

MPI programs when model extraction and maintenance are expensive. While many

dynamic verification tools exist for verifying shared memory programs, there are no

corresponding tools that support MPI – the lingua franca of parallel programming.

While interleaving reduction suggests the use of dynamic partial order reduction

(DPOR), four aspects of MPI make previous DPOR algorithms inapplicable: (i)

MPI contains asynchronous calls that can complete out of program order; (ii)

MPI has global synchronization operations that have weak semantics; (iii) the

runtime of MPI cannot, without intrusive modifications, be forced to pursue a

specific interleaving with nondeterministic wildcard receives; and (iv) the progress

of MPI operations can depend on platform-dependent runtime buffering, making

bugs sometimes appear when resources are added to boost performance. This

dissertation provides a formal model for MPI, and introduces a tailor-made no-

tion of Happens-Before ordering for MPI functions. The crucial feature of this

Happens-Before relation is that it elegantly solves all these four problems. MPI

dynamic analysis is turned into a prioritized scheduling algorithm respecting MPI’s

Happens-Before.

This dissertation contributes three algorithms that have been demonstrated

in the context of a practical MPI dynamic verification tool called In-Situ Partial

order (ISP). The Partial Order avoiding Elusive Interleavings (POE) algorithm is

a simple prioritized execution of the MPI transitions and is guaranteed to find

all deadlocks, assertion violations and resource leaks under zero buffering. The

POEOPT algorithm avoids many of the redundant interleavings of POE by fully

exploiting MPI’s Happens-Before. Finally, the POEMSE algorithm discovers all

possible minimal runtime bufferings that guarantee to discover bugs. POEMSE’s

slack analysis has minimal overheads, and offers the power of verifying for safe

portability by considering all relevant bufferings that might exist in various plat-

forms. In effect, a program is dynamically verified not just with respect to the

platform on which the tool is run, but also with respect to all platforms.

iv

To

Surya and Siri

CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

CHAPTERS

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Specifics of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Dissertation Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Message Passing Interface (MPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 MPI Program Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Necessity of DPOR for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 MPI Formal Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.3 POE Dynamic Verification Algorithm . . . . . . . . . . . . . . . . . . . . 81.4.4 POEOPT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.5 POEMSE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.6 The ISP Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Impact of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Message Passing Interface (MPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.1 MPI Isend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 MPI Irecv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.3 MPI Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.4 MPI Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.5 MPI Ordering Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Dynamic Partial Order Reduction (DPOR) . . . . . . . . . . . . . . . . . . . . 202.2.1 DPOR Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Applying DPOR to MPI : Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3. MPI FORMAL MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 Formal Transition System for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1.1 State Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.2 The State of an MPI Execution . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 MPI Transition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Process Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 MPI Runtime Book-keeping Sets . . . . . . . . . . . . . . . . . . . . . . . . 343.2.3 MPI Runtime Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.4 Conditional Matches-before . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.5 Dynamic Instruction Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.6 One Transition or Multiple? . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.7 Dependent Transition Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.8 Selectors and Useful Predicates . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 Illustration of the Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4 Applying DPOR to MPI Transition System . . . . . . . . . . . . . . . . . . . 42

4. THE POE ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1 MPI Transition Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 The POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.1 Persistent Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.2 Persistent Sets and MPI Program Correctness . . . . . . . . . . . . . 474.2.3 POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Illustration of POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4 Issues with POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4.1 Redundant Interleavings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.2 POE and Buffered Sends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5. POE AND REDUNDANT INTERLEAVINGS . . . . . . . . . . . . . . 57

5.1 POE and Redundant Interleavings . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 InterHB and Co-enabledness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.3 POE Algorithm Modified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6. DETERMINISTIC MPI PROGRAMS . . . . . . . . . . . . . . . . . . . . . 69

6.1 Deterministic MPI Programs and HB . . . . . . . . . . . . . . . . . . . . . . . . 69

7. HANDLING SLACK IN MPI PROGRAMS . . . . . . . . . . . . . . . . 73

7.1 Verification for Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2 Introduction to Slack Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.2.1 Zero Buffering Can Miss Deadlocks . . . . . . . . . . . . . . . . . . . . . . 767.2.2 Too Much Buffering Can Miss Deadlocks . . . . . . . . . . . . . . . . . 77

7.3 Using HB to Detect Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.3.1 HB Graph and Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.4 Finding Minimal Waits Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.5 POEMSE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8. EXTENSIONS TO THE FORMAL MODEL . . . . . . . . . . . . . . . . 89

8.1 Handling More MPI Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898.1.1 MPI Send and MPI Recv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898.1.2 MPI Waitall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

vii

8.2 Communicators and Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928.2.1 Extensions to the Formal Model . . . . . . . . . . . . . . . . . . . . . . . . 94

9. ISP: A PRACTICAL DYNAMIC MPI VERIFIER . . . . . . . . . . 96

9.1 ISP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.1.1 The Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.1.2 The ISP Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

9.2 ISP : Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.2.1 Out-of-Order Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.2.2 Scheduling MPI Waitany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.2.3 Buffering Sends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

9.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

10. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

10.1 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

viii

LIST OF FIGURES

1.1 Example MPI Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 GEM Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 MPI Ordering Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Example Thread Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 DPOR Illustration: Initial Interleaving . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 DPOR Illustration: Updating Backtrack Set . . . . . . . . . . . . . . . . . . . . 25

2.5 Simple MPI Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Illustration of Surprising MPI Runtime Behavior with DPOR . . . . . . . 27

2.7 Crooked Barrier Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1 Simple MPI Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Execution of Figure 3.1 with MPI Transitions . . . . . . . . . . . . . . . . . . . 41

3.3 MPI Execution with a Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 MPI Execution of Figure 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1 Pseudocode for POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Pseudocode for GetTransition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Pseudocode for UpdateBacktrack . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Pseudocode for GenerateInterleaving . . . . . . . . . . . . . . . . . . . . . 50

4.5 Crooked Barrier Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6 POE Interleaving 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.7 POE Interleaving 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.8 Redundant POE Interleavings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.9 POE and Persistent Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.10 Buffering Sends and POE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1 Redundant POE Interleavings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 POE and Persistent Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Simple Optimization and Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 InterHB Relation Across Match-sets . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.5 HB Relation for Figure 5.3 Shown as Graph . . . . . . . . . . . . . . . . . . . . 64

5.6 Redundancy with New POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 64

5.7 Pseudocode for POEOPT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.8 Pseudocode for GetBacktrack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66


5.10 Pseudocode for AddtoBacktrack . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.11 Pseudocode for GenerateInterleaving . . . . . . . . . . . . . . . . . . . . . 67

7.1 Buffering Sends and Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.2 Specific Buffering Needed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.3 Path Breaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.4 Example Formulas and GHB graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7.5 Algorithm to Find Minimal Wait Sets . . . . . . . . . . . . . . . . . . . . . . . . . 83


7.7 Pseudocode for GetTransition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7.8 Pseudocode for AddSlacktoBacktrack . . . . . . . . . . . . . . . . . . . . . 85

9.1 ISP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

x

LIST OF TABLES

9.1 Comparison of POE with Marmot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

9.2 Results for POE and POEOPT on MADRE . . . . . . . . . . . . . . . . . . . . . 103

9.3 Results for POE and POEMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

ACKNOWLEDGMENTS

This dissertation would not be complete without the help and support of faculty,

friends and family. I had the good fortune of meeting some of the smartest as well

as the most warm-hearted people during my graduate studies at the University of

Utah. The foremost among them is my advisor, Prof. Ganesh Gopalakrishnan. As

I was hunting in the department for a good advisor, it was universally acknowledged

by many graduate students that Prof. Ganesh is one of the best advisors in the

department. During my PhD studies, I also came to understand that he was more

than a good advisor. He is one of the most humble and warm-hearted persons and

a friend to all the students in the Gauss group. I would like to thank him for all

the opportunities, exposure, guidance and support he provided me right from the

very beginning of my PhD studies under him.

I would also like to thank Prof. Mike Kirby, who is my co-advisor, for his

support and encouragement. This dissertation would not be in its current form

without the valuable suggestions from Prof. Suresh Venkatasubramanian, Prof.

Matt Might and Prof. Stephen Siegel, who were also members of my dissertation

committee. I thank my dissertation committee from the bottom of my heart for

their understanding and support during the last year of my PhD when I was going

through some medical complications.

The best part of my PhD studies was being associated with the smart graduate

and undergraduate students in the Gauss group. Each person in the group is

special and I think I became a better person by my association with them. I have

to especially thank Subodh Sharma, who never refused to help me in anyway he

could. I can never repay him for the rides to and from the Salt Lake City airport,

whether it was day or night. I also thank Anh Vo, Michael Delisi, Geof Sawaya,

Sriram Ananthakrishnan and Guodong Li for their help and input on my research

during the entire course of my PhD studies.

I would like to thank Rajeev Thakur of Argonne National Labs and Bronis de

Supinsky of Lawrence Livermore National Labs for their support. Our tool ISP

would not be as successful without their intelligent and expert input.

This dissertation would not have reached the thesis editor without the help of

Karen Feinauer. I will always remember her as a person with a warm, welcoming

smile. My deepest gratitude goes to Karen for all the help with scanning the

corrections. This dissertation would literally not have seen the light of day without

her help.

There were times during the last 3.5 years when I was impossible to live with.

The ups and downs of research and my corresponding mood swings were felt more

by my family than anyone else. The person who took the brunt of all this is

my husband Surya who jumped with joy for me when I was successful and also

encouraged me when things did not go so well. He is the pillar who provided me

with immense support through some of the toughest times in my life. I know that

saying a mere “Thank You” is not sufficient. I only hope that I can be as good a

friend as he has been.

A decade ago, I would not have even dreamed that I was capable of doing a PhD.

My only goal was to finish my undergraduate studies and start on a well-paying job

to support my family financially. It was my twin sister Sridevi who showed me the

way to study while providing financial support. She is one of the most courageous

women who achieves her goals through sheer determination. It was though her

encouragement that I moved to the US for a PhD. I think there is no better place

to tell her that I love her and that my life would be incomplete without her presence

right from the day we were born.

xiii

CHAPTER 1

INTRODUCTION

It is no exaggeration to say that computer software already governs everything

we do as a human society. All computer software – regardless of its purpose –

must be correct as well as efficient. What differentiates various types of software

is the price we are willing to pay for achieving these goals, and for whom these

goals ultimately matter. Clearly, inefficient software (not producing results on

time; consuming excessive amounts of energy, etc.) is also “buggy.” We believe

in allowing humans to make such efficiency-related decisions, and focus on helping

them ensure the functional correctness of their designs (for now, “correct” can be

taken to mean “free of assertion violations and deadlocks”).

This dissertation is focussed on correctness issues that arise in software that

underlies large-scale scientific simulations. Such simulations are responsible for

virtually all the high performance computing (HPC) simulation experiments that

scientists and engineers perform on a virtually unlimited class of problems (weather

modeling, earthquake prediction, safety of nuclear stockpiles, drug discovery, testing

nascent theories in Physics, to name a few). Our goal is to contribute tools that

practitioners in HPC can employ in their day-to-day work to ensure that their HPC

simulation programs are correct.

Day-to-day software development in HPC is still an arduous process, often

relying on primitive debugging methods such as “printf debugging.” Modern

commercial tools in this area (e.g., TotalView [57], STAT [55], etc.) are extremely

helpful for debugging errors after a crash has been recorded. However, these tools

have no analytical power that lets them study a piece of software over the fewest

number of concurrent interleavings or data inputs, and locate bugs with formal

2

assurance. They rely on human ingenuity for test data input selection – known

to be unreliable and nonscalable. They rely on the concurrency schedules that

naturally occur in the test environment for concurrency coverage – known to be very

inadequate even from simple studies [68]. Future HPC software will be far more

complex, employing, for example, innovative techniques for energy management

and load balancing. All these additions to the inherent complexity of the core

software will overwhelm even the best available methods.

The HPC community – comprised of scientists and engineers who do not neces-

sarily have a computer science background – have expressed that today’s available

methods are incapable of providing the required levels of correctness. The 2009

ExaScale Software Study [7] points out the sheer complexity of Extreme Scale

computing system designs which will witness an increased use of different system

components all the way from core-to-core communication protocols to middle ware

that manages multiproblem integration. This study asserts, ...Handling such com-

ponents in a seamless way and allowing programmers to pursue efficiency while

still providing multiple safety nets are all open challenges, needing the use of

formal methods. In his recent talk entitled Slouching Towards Exascale: Pro-

gramming Models for High Performance Computing [30], Lusk observes, Formal

methods provide the only truly scalable approach to developing correct code in this

complex [Exascale] programming environment. Such statements are easily justified

considering the economic- and opportunity-costs of errant HPC simulations. For

example, today’s Petascale system installations can cost millions of dollars just in

energy costs [19].

1.1 Specifics of this Dissertation

This dissertation aims to develop practical concurrency verifiers based on formal

principles that will ensure that High Performance Computing (HPC) programs

written using the Message Passing Interface (MPI) library are free of egregious

and costly errors. We choose MPI because of its dominant position in HPC.

The importance of MPI is well known; it is employed in virtually all scientific

3

explorations requiring parallelism, such as weather simulation, medical imaging,

and earthquake modeling that are run on expensive high performance computing

clusters.

We want our verifiers to be:

nonobtrusive allowing designers to focus on problem solving.

reliable by scaling well.

widely usable by directly working on the designers’ programs (not requiring mod-

els of these programs).

1.1.1 Dissertation Statement

Dynamic formal verification methods incorporating innovative partial order

reduction methods can help develop nonobtrusive, reliable, and widely usable tools

for MPI programs. Such tools can not only verify a given program with respect

to a given platform (machine, runtime) but also reliably predict and flag errors

pertaining to scheduling and buffering variations across all platforms.

1.2 Message Passing Interface (MPI)

The MPI standard [33] is an informal document that provides English descrip-

tions of the individual behaviors of about 300 MPI operations. There are several

popular MPI library implementations [34, 37, 27]. Typical MPI programs are

C/C++/Fortran programs that create a fixed number of processes at inception.

These processes then perform computations in their private stores, invoking various

MPI operations in the MPI library to synchronize and exchange data. MPI supports

the SPMD-based programming model. An example MPI program written in C is

shown in Figure 1.1.

All function calls of the form MPI_XXXX are calls into the MPI library. The MPI

program is executed with the number of processes as a command line input which is

passed as a parameter to MPI_Init (line 12). The MPI_Init library call will create

4

1: #include <stdio.h>2: #define buf_size 1283: int main (int argc, char **argv) {4: int nprocs = -1;5: int rank = -1;6: char processor_name[128];7: int namelen = 128;8: int buf0[buf_size];9: int buf1[buf_size];10: MPI_Status status;11: /* init */12: MPI_Init (&argc, &argv);13: MPI_Comm_size (MPI_COMM_WORLD, &nprocs);14: MPI_Comm_rank (MPI_COMM_WORLD, &rank);15: MPI_Get_processor_name (processor_name, &namelen);16: printf ("(%d) is alive on %s\n", rank, processor_name);17: fflush (stdout);18: MPI_Barrier (MPI_COMM_WORLD);19: if (nprocs < 2) {20: printf ("not enough tasks\n");21: } else if (rank == 0) {22: MPI_Recv (buf1, buf_size, MPI_INT,23: MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status);24: MPI_Recv (buf0, buf_size, MPI_INT,25: 1, 0, MPI_COMM_WORLD, &status);26: } else if (rank == 1) {27: memset (buf0, 0, buf_size);28: memset (buf1, 1, buf_size);29: MPI_Send (buf0, buf_size, MPI_INT, 0, 0, MPI_COMM_WORLD);30: MPI_Send (buf1, buf_size, MPI_INT, 0, 0, MPI_COMM_WORLD);31: }32: MPI_Barrier (MPI_COMM_WORLD);33: MPI_Finalize ();34: printf ("(%d) Finished normally\n", rank);35: }

Figure 1.1. Example MPI Program

5

the requested number of processes and each of the processes start executing the

same program code immediately following MPI_Init. Every process is provided a

unique process rank in the range [0 . . . n − 1] where n is the number of processes

provided as the input. Every process begins executing MPI library calls with an

MPI_Init and finishes executing MPI library calls by calling MPI_Finalize.

After executing some MPI library calls (line 13–17), all the processes synchronize

at MPI_Barrier call (line 18). Every process blocks at MPI_Barrier until all other

processes execute the call to MPI_Barrier. The program is then executed based

on the process rank of the executing process. The process with rank 0 executes

MPI_Recv operations (line 21–25) while the process with rank 1 executes MPI_Send

operations (lines 26–31). The rest of the processes with a rank greater than 1 will

block at MPI_Barrier (line 32) until process ranks 0 and 1 eventually execute an

MPI_Barrier. All processes finally execute MPI_Finalize and terminate.

1.3 MPI Program Verification

MPI programs can have many kinds of bugs [64] which can be very hard to

debug using the traditional debugging techniques. Usual debugging techniques for

MPI programs include explicit modifications to the source code, message tracing

and visualization. Programmers typically go through a number of debugging or

testing iterations before a bug is fixed. This iterative analysis and debugging is

time consuming, error prone and complicated, especially if the messages induce

nondeterministic behaviors.

The MPI bugs that arise due to nondeterministic communication races are

among the most difficult to debug. The programmer must be able to enumerate the

full nondeterministic execution scenarios possible and test each of the scenarios for

various possible bugs. Such manual testing of various possible execution scenarios

is usually impractical for large applications. Testing tools [26, 64] are capable

of testing certain execution scenarios but do not guarantee coverage. Given the

complexity of MPI applications and the difficulty debugging them, we are convinced

that there is a need for verification tools for MPI programs. Verification tools

6

usually employ well-known verification algorithms that guarantee coverage and

hence bug detection.

There are two popular forms of formal verification - namely, model-based veri-

fication and dynamic verification. Model-based verification [54, 50] tools usually

require programmers to build a model of their application in a different language

and then verify their model against various properties. Model-based verification

will only help the programmer debug and guarantee bug freedom in the model

but not in the actual program. Also, building a model for a large and complex

MPI program itself can be time consuming. A model-based verification tool called

MPI-SPIN [50] is presently available for MPI.

Dynamic verification tools [4, 11, 24] take as their input the user code provided

with a test harness. Then, using customized scheduling, algorithms enforce specific

classes of concurrent schedules to occur. Such schedules are effective in hunting

down bugs and are often sufficient to provide important formal coverage guaran-

tees. Dynamic verification tools almost always employ techniques such as dynamic

partial order reduction (DPOR) [69, 9, 17], bounded preemption searching [4, 36],

or combinations of DPOR and symmetry reduction [69] to prevent redundant

state/interleaving explorations. While many such tools exist for verifying shared

memory programs, there is a noticeable dearth of dynamic verification tools sup-

porting the scientific programming community that employs the Message Passing

Interface (MPI).

Though model-based verification tools are popular for their guaranteed coverage

for all inputs, we believe that dynamic verification provides a more practical solution

for MPI programs. Most programs are not input-centric and any specific inputs

are usually handled separately in a different code path. It is usually sufficient to

run the dynamic verification tools with possible input test harnesses and get the

required coverage. Additionally, dynamic verification tools are very easy to use

with little to no programmer effort.

7

This dissertation presents novel dynamic verification algorithms for MPI that

have been implemented into a tool called ISP which stands for In-Situ Partial

order.

1.4 Contributions

1.4.1 Necessity of DPOR for MPI

Our first contribution (described in Chapter 2) provides reasons why a new

dynamic partial order reduction algorithm for MPI is necessary. We show through

illustrations how a direct application of the classical DPOR algorithm will not work

for MPI programs. This also forms the motivation behind the algorithms developed

in this dissertation.

1.4.2 MPI Formal Semantics

Our next contribution is simple and intuitive formal semantics for MPI (de-

scribed in Chapter 3). We provide formal transition semantics for four MPI func-

tions: namely, MPI_Irecv, MPI_Isend, MPI_Barrier, and MPI_Wait. The transi-

tion semantics are divided into two parts called the Process transitions and Runtime

transitions. This division among semantics follows directly from the fact that the

MPI program execution environment consists of the MPI processes that execute

the program code and an MPI runtime daemon that serves these processes. The

MPI runtime contains the library that implements the MPI standard. Processes

issue MPI function calls into the MPI runtime. The MPI runtime is responsible

for the actual execution of the MPI operations issued by the processes according

to the standard.

Our formal transition model is constructed from our experience in building a

formal TLA+ model for MPI [40, 42] and reading the MPI standard. The formal

transition model has embedded within it the ordering guarantees among the MPI

operations described by the MPI standard. We call these ordering guarantees

IntraHB (Intra-Happens-Before) ordering since the ordering is described only for

MPI operations within a process. Our MPI verification tool ISP implements the

8

runtime transitions of the formal model. The formal model has been extended to

60 MPI operations which is implemented by our verification tool ISP.

1.4.3 POE Dynamic Verification Algorithm

POE stands for Partial Order under Elusive interleavings. The POE algorithm

(described in Chapter 4) is a prioritized execution of the MPI transitions in the

formal model. The prioritized execution allows the discovery of full nondeterminism

in an MPI program. However, the POE algorithm can only generate interleavings

when the MPI sends are not provided any buffering by the MPI runtime. Also,

the POE algorithm can generate a large number of redundant interleavings which

can unnecessarily increase the verification time. Our tool ISP implements the POE

algorithm and has verified a number of small as well as large MPI programs.

1.4.4 POEOPT Algorithm

The POEOPT algorithm is an optimized POE algorithm (described in Chap-

ter 5) that attempts to reduce the redundant interleavings generated by the POE

algorithm. We found that the IntraHB relation among the MPI operations does

not provide the information required to eliminate the redundant interleavings. We

extend the IntraHB relation with the InterHB (Inter-Happens-Before) relation that

is derived from the formal MPI transitions system and IntraHB relation of MPI

operations. We use both the IntraHB and InterHB analysis of an MPI program

execution to extend the POE algorithm to the POEOPT algorithm.

1.4.5 POEMSE Algorithm

MPI programs exhibit slack inelastic behavior [31]. That is, it is possible

for programs to depict new behaviors where they can deadlock or enter into an

erroneous state when more slack or buffer is provided to the MPI_Isend operations

by the runtime. The MPI_Isend operation sends a message to another process

that receives this message. The messages are copied from the memory space of

the process sending the message to the memory space of the process receiving

9

the message. However, it is possible for the MPI runtime to provide buffer space

to the messages. In this case, the message being sent is copied into the runtime

provided buffer even when there is no process to receive that message. The buffering

availability for a process depends on the current runtime buffer usage by the

process and a configuration parameter called eager limit. The eager limit is usually

configured into the MPI runtime and is purely an implementation decision of the

MPI library implementation. The MPI standard does not specify any rules or

guidelines on the eager limit.

The buffer availability for a message is a dynamic property. A program can show

two different behaviors when executed with two different libraries. One solution is

to buffer all the sends to help discover the new behaviors due to slack. However, the

send operations themselves can contribute to deadlocks when they are not buffered.

Buffering all the sends would hence miss these deadlock behaviors.

A brute force way is to verify the program with every send having no buffering

and full buffering. This is prohibitively expensive since a program contains a large

number of sends in general. The POEMSE algorithm (described in Chapter 7) uses

the InterHB and IntraHB analysis to find the minimal send sets that must be

buffered in order to find the new behaviors due to buffering. In addition to this, it

also finds all the minimal sets of sends to be buffered that can cause a new behavior

and generates an interleaving for every such minimal set. Our experimental results

show that the POEMSE algorithm does well in practice and its verification time is

only slightly more than the POE algorithm.

1.4.6 The ISP Tool

The above algorithms are nonobtrusive, are reliable, and can be used as the

basis for creating widely usable tools. These facts are clearly brought out through

the ISP tool that is also one of the major contributions of this dissertation. The

implementation of ISP is described in Chapter 8.

The impact that our work has had is now described in the next section.

10

1.5 Impact of This Dissertation

• Our publications include [60, 59, 46, 58, 66, 62, 1, 67]

• We have built and released a dynamic formal verification tool for MPI C

programs called In-Situ Partial Order Analysis (ISP [24]). While several

students have contributed toward ISP (for which we are very grateful), this

dissertation provided the core ideas as well as much of the implementation.

There have been over 200 worldwide downloads of ISP.

• We held half-day tutorials featuring ISP at ICS (May’09, [13]), FM09 (Nov’09,

[14]), and PPoPP (Jan’10, [15]).

• We presented a full-day invited tutorial featuring ISP at EuroPVM/MPI

(Sep’09, [16]).

• Our NSF REU undergraduates Alan Humphrey and Chris Derrick built the

Graphical Explorer for Message Passing (GEM [23]). Figure 1.2 presents a

snapshot of GEM’s user interface. GEM was officially accepted as part of the

PTP 3.0 version in 12/09.

• REU UG Sawaya and Atzeni built a Concurrency Education website [6]

containing all examples from a popular MPI textbook [38] for teaching MPI

using ISP.

1.6 Related Work

The area of formal verification has been successfully applied to many applica-

tions that include applications in telecommunication software design (e.g., [18]),

aerospace software (e.g., [65]), device driver design (e.g., [3]) and operating sys-

tem kernels (e.g., [35]). The use of formal methods for HPC software design,

and in particular to MPI-based parallel/distributed program design, has found an

increasing level of activity in the recent years. The earliest use of model checking

in this area is by Matlin et al. [32], who used the SPIN model checker [18] to

11

Figure 1.2. GEM Front-end

verify parts of the MPD process manager. Subsequently, Siegel and Avrunin used

model checking to verify MPI programs that employ a limited set of two-sided MPI

communication primitives [51]. Siegel subsequently published several techniques

for efficiently analyzing MPI programs [52, 48, 2].

Siegel provides a state model for MPI programs and describes how the state

model is incorporated into MPI SPIN [52, 48, 2]. Deadlock properties for de-

terministic programs when the programs have no wildcard receives are proved in

[52]. Siegel later proposed the “Urgent” algorithm to check for deadlocks in MPI

programs with wildcard receives [48].

The “Urgent” algorithm is only defined for blocking (synchronous) mode receives

but it is not clear how the algorithm can be extended for nonblocking mode receives.

12

Also, the “Urgent” algorithm does an exponential search on all the sends for every

buffering possibility, which can be expensive.

Some of the earlier publications of the “Utah Verification” group in this area

pertained to the use of model checking to analyze MPI programs [39, 41], an

executable formal specification of MPI [40, 29] and an efficient model checking

algorithm for MPI [42]. One difficulty in model checking is the need to create an

accurate model of the program being verified. This step is tedious and error prone.

If the model itself is not accurate, the verification will not be accurate. To avoid

this problem, an in-situ model checker ISP was first developed in [44] which dealt

with MPI one-sided communication. Techniques to enhance the efficiency of this

algorithm were reported in [61]. Our recent work [60, 59, 58, 66, 62] introduces

the POE algorithm which is implemented into our tool ISP. We also employed the

POE algorithm for detecting the presence of functionally irrelevant barriers in MPI

programs [46].

Other research groups have approached the formal verification of MPI programs

through schedule perturbation [68], data flow analysis [56] and by detecting bug

patterns [45]. A survey of MPI-related tools and debuggers can be found in [47].

CHAPTER 2

BACKGROUND

This chapter provides a basic introduction to Message Passing Interface (MPI)

along with a detailed description of a small set of MPI functions in Section 2.1.

A detailed description of the classic Dynamic Partial Order Reduction (DPOR)

is provided in Section 2.2. Section 2.3 describes various issues that arise when

classical DPOR is directly applied to MPI. The initial impetus for our work was

the inapplicability of classical DPOR to MPI. This led to a thorough formalization

of MPI and a new understanding of how to handle many aspects of MPI based

on a single unifying formalism: a new Happens-Before order for MPI. Using the

techniques in this dissertation, we can analyze MPI programs that allow out of

order message matching, have collective operations, and whose behavior can alter

significantly in a resource-dependent manner.

2.1 Message Passing Interface (MPI)

This section provides a basic introduction to MPI and four MPI functions:

MPI_Isend, MPI_Irecv, MPI_Wait, MPI_Barrier in English. The reason we chose

these functions is that a thorough understanding of over 60 MPI functions can

be obtained by studying just these four functions. This section is not intended

to replace the MPI standard and the readers are encouraged to read the MPI

standard [33] for a more extensive introduction to the above functions. Most of

the dissertation only deals with these four MPI functions to keep the formal model

simple. The formal model can be easily extended to handle more MPI functions

(Chapter 8). ISP, the tool that implements our formal model, handles over 60 most

frequently used MPI functions. A formal notation for the MPI functions introduced

in this section is provided in Section 3.1.

14

Most MPI programs have two or more processes communicating through MPI

functions. All the processes have MPI process ids called ranks ∈ N0 = {0, 1, . . .}

that range from 0 . . . n − 1 for n processes. In addition to the processes, the

MPI execution environment also has an MPI daemon process which we call MPI

Runtime. The MPI library that implements the MPI standard is a part of the

MPI runtime. The processes issue the MPI functions into the MPI runtime. By

“issue” we mean that that MPI function call is invoked by the MPI process. The

MPI runtime keeps track of the MPI functions issued by the processes, matches

the MPI functions across processes and transfers data across processes according

to the MPI standard. The MPI runtime hence forms the critical component of the

MPI execution environment.

Every MPI process starts execution by issuing MPI_Init (argc, argv). A

process cannot issue any other MPI function unless it issues an MPI_Init. Every

MPI process that issues an MPI_Init must also issue an MPI_Finalize eventually.

No further MPI functions can be issued by a process once it issues an MPI_Finalize

except for MPI_Finalized which checks if a MPI_Finalize has been invoked.

MPI_Finalized is a local process action and we ignore any MPI_Finalized MPI

function that is executed by a program. We assume that all the examples provided

in this dissertation always implicitly issue an MPI_Init at the beginning and an

MPI_Finalize at the end of an execution and do not explicitly show them in any

of the examples. We also assume that all processes are single threaded.

Every MPI function will attain the following states during its lifetime in the

MPI runtime:

• issued : The MPI function has been issued into the MPI runtime.

• returned : The MPI function call has returned and the process that issued

the function can continue executing.

• matched: Since most MPI functions usually work in a group (for example, an

MPI_Isend from one process will be matched with a corresponding MPI_Irecv

from another process), an MPI function is considered matched when the MPI

15

runtime is able to match various MPI functions into a group which we call a

match-set. All the MPI function calls in the match-set will be considered as

having attained the matched state.

• complete: An MPI function can be considered to be complete by the MPI

process that issued the MPI function when all visible memory effects have

occurred (e.g., when an MPI_Isend is buffered by the MPI runtime, the

MPI_Isend can be considered as complete when the message buffer has been

copied out from the process memory space into the runtime memory space).

We adapt the “complete” state from the MPI standard which applies for

MPI_Isend and MPI_Irecv and extend it to MPI_Wait and MPI_Barrier

trivially to keep the state model consistent.

The semantics of an MPI program are determined by the order in which sends

and receives are allowed to match. Matching does not imply that data transfer

has occurred; it is simply a commitment on part of sends and receives to ‘pair

with each other.’ Completion is mainly of local significance. An MPI_Isend that

completes allows the MPI_Wait operation waiting on it to return. This allows later

operations (coming after MPI_Wait in program order) to begin matching. This

is how completion indirectly affects matching – a crucial aspect of MPI behavior

that gets modulated by the amount of runtime buffering. More specifically, early

completion is possible in buffer resource endowed systems that provide higher eager

limits [21]. While the modulation of message passing behavior by the amount of

buffering is a well-known result [31, 48], our dissertation provides the first efficient

analysis of where such buffering matters – based on MPI Happens-Before.

We now describe the MPI functions MPI_Isend, MPI_Irecv, MPI_Barrier, and

MPI_Wait.

2.1.1 MPI Isend

MPI_Isend is a nonblocking send that has the following prototype:

MPI_Isend (void *buff, int count, MPI_Datatype datatype, int dest,

16

int tag, MPI_Comm comm, MPI_Request *handle);

where buff is the starting address of the data buffer that needs to be transferred

to the receiving side, datatype is the abstract type of the data in buff, count is

the number of elements of type datatype in buff, dest is the destination process

rank where the message is to be sent, tag is the message tag, and comm is the

MPI communicator. Tags and communicators provide fine grained communication

across processes. For simplicity, we abstract away the tags and communicators.

Chapter 8 provides a detailed description on communicators and tags and how our

formal model can be extended to handle them. The handle is set by the MPI

runtime and uniquely identifies the MPI_Isend in the MPI runtime.

Notation: We denote MPI_Isend by S.

The function call to S may return immediately (nonblocking) while the actual

send can happen at a later time. An S is considered complete by the process

issuing it if the data from buff is copied out. buff can be either copied out into

the MPI runtime provided buffer or to the buffer space of the MPI process receiving

this message. Buffer availability in the MPI runtime depends on a configuration

parameter called eager limit. An S issued by a process may be buffered if the buffer

size is below the eager limit. However, there is no guarantee that a S with small

message size will always be buffered by the runtime. If the MPI runtime buffer is

available, the S can be completed immediately by the MPI runtime. Otherwise, the

S can be completed by the runtime only after it is matched with a receive operation

issued by the dest process and the data is copied from buff to the receiving buffer

space. It is illegal for the MPI process to reuse the send buffer (buff) before the

send is completed. The completion of a send is detected by the process issuing it

using MPI_Wait.

We use S to denote a buffered send and S to denote a send with no runtime

buffering.

17

2.1.2 MPI Irecv

MPI_Irecv is a nonblocking receive with the following prototype:

MPI_Irecv (void *buff, int count, MPI_Datatype datatype, int src,

int tag, MPI_Comm comm, MPI_Request *handle);

where buff is the starting address of the memory where the data is to be received,

count, datatype have the same semantics as described for MPI_Isend, and src is

the rank of the process from where the message is to be received. The src can also

be MPI_ANY_SOURCE which indicates that the receive can be matched with an S from

any process when S’s dest is the same as receiving process rank. It is customary

to call receives with src set to MPI_ANY_SOURCE as wildcard receives and for ease

of notation we denote MPI_ANY_SOURCE as ‘*’. The data is received into buff and

handle is returned by the MPI runtime which uniquely identifies the receive in the

MPI runtime.

Notation: We denote an MPI_Irecv by R.

The function call to an R may return immediately and is considered complete

when all the data is copied into buff. It is illegal to reuse buff before the receive

completes. The completion of a receive is detected by the process using MPI_Wait.

2.1.3 MPI Wait

MPI_Wait is a blocking call and is used to detect the completion of a send (S)

or a receive (R) and has the following prototype:

MPI_Wait (MPI_Request *handle, MPI_Status *status),

where handle is returned in an S or an R and status describes the status of the

S or R corresponding to handle.

Notation: We denote an MPI_Wait by W .

The MPI runtime blocks the call to W until the send or receive is complete. The

MPI runtime resources associated with the handle are freed when a W returns and

handle is set to a special field called MPI_REQUEST_NULL. A W call with handle

18

set to MPI_REQUEST_NULL is ignored by the MPI runtime. An S or R without an

eventual W is considered as a resource leak.

2.1.4 MPI Barrier

A barrier call has prototype MPI_Barrier (MPI_Comm comm).

Notation: We denote MPI_Barrier by B.

B is a blocking function and is used to synchronize MPI processes.

A process blocks after issuing the barrier until all the participating processes

with the same comm also issue their respective barriers. Note that unlike the

traditional barriers used in threads where all the instructions before the thread

barrier must also be complete when the barrier returns, the MPI B does not provide

any such guarantees. An MPI B can be considered as a weak fence instruction.

2.1.5 MPI Ordering Guarantees

The ordering guarantees provided by the MPI runtime according to the MPI

standard define the order in which MPI program execution proceeds. MPI requires

all MPI library implementations (i.e., MPI runtime) to provide the following FIFO

ordering guarantees:

• For any two sends Sj and Sk, j < k from the same process i (i.e., Sj is issued

before Sk by process i) targeting the same destination (say process rank l),

the earlier send Sj is always matched with a receive before the later send Sk.

Note that this order is irrespective of the buffering status of the sends, i.e.,

the sends Sj and Sk can complete out-of-order. Consider the MPI execution in

Figure 2.1(a). Pi and Pl are two processes with ranks i and l, respectively. Pi

issues two sends to process l, S1 and S2, respectively, where S1 sends a million

bytes in data buffer d1 while S2 sends 10 bytes in data buffer d2. The W3(2)

is the W corresponding to S2 and W4(1) is the wait operation corresponding

to S1. Since S2 only sends 10 bytes, it is possible that S2 is provided MPI

runtime buffer and hence complete before S1 which is completed only after

the million bytes of data is copied into the d1 of R1. The solid directed edge

19

between S1 and S2 shows that S1 will be matched before S2. Hence, even if

S2 is complete before S1, S1 is always matched with the first matching R1

(shown by the dotted line between S1 and R1.

• For any two receives Rj, Rk j < k from the same process l receiving from

the same source (say i), the earlier receive Rj is always matched with a send

before the later receive Rk. Note that the receives can complete out-of-order.

Figure 2.1(a) shows two MPI receive operations of Pl: R1 and R2 that receive

messages from Pi. R1 is matched before R2 (shown as a solid directed edge

from R1 to R2). Since R2 receives only 10 bytes, it is possible that W3(2)

unblocks immediately since R2 has received the data while W4(1) which

corresponds to R1 remains blocked until R1 completes.

• For any two receives Rj, Rk, j < k from the same process l, when the first

receive Rj can receive from any source (called wildcard receive), the first

receive Rj is always matched with a send before the later receive Rk. This

scenario is depicted in Figure 2.1(b).

• For any two receives Rj, Rk, j < k from the same process l, where Rj is a

nonwildcard receive and Rk is a wildcard receive, Rj is matched before Rk

only when a matching send is available. Otherwise, Rk can be matched with

a send before Rj. In a sense, Rk(∗) has the ability to “reach over” Rj and

match. We call this behavior conditional-matches-before. This scenario is

shown in Figure 2.1(c). R1 receives message from Pm’s S1. Since there is a

matching S1 of Pm available, R1 is matched before R2. However, if Pm did

not have S1 available, then R2 can match before R1 and Pl will block on

W4(1) until a send from Pm is available. Since the matching is dependent on

the availability of a matching S, we call this as “conditional matches-before”

ordering.

The MPI standard requires that two messages sent from process i towards

process l must be matched in the same order (called nonovertaking in MPI). The role

20

(a) S and R Ordering (b) Wildcard Receive Ordering

(c) Conditional Ordering

Figure 2.1. MPI Ordering Guarantees

of matching order guarantee is to constrain the order in which sends and receives

match so as to guarantee nonovertaking. Notice that the sends and receives need

only to be matched in order. However, since the completion is only detected by a

W , the MPI standard does not enforce any order on the completion of the sends

and receives, leaving the choice to the MPI library implementation. Also note that

all the orders are defined on MPI functions within a process (i.e., all these are Intra

orders).

2.2 Dynamic Partial Order Reduction (DPOR)

Dynamic Partial Order Reduction (DPOR) [9] dynamically tracks various inter-

actions between threads/processes and generates only the Mazurkiewicz traces [5]

(called relevant interleavings henceforth). This is done by identifying the back-

tracking points in the interleavings and updating the backtrack points dynamically

until at the end of the execution, persistent [9, 10] sets have been formed at every

such point.

21

In multithreaded programming, the most common bugs are deadlocks and data

races. Deadlocks arise due to improper lock and unlock operations on mutexes

and data races occur when a shared memory is accessed concurrently by one or

more threads. Deadlocks and data races can be notoriously hard to debug. Many

approaches were proposed to discover data races and deadlocks in programs [8, 36].

However, applying the DPOR algorithm is the only method that can guarantee cov-

erage. We describe the classical DPOR algorithm in the context of multithreaded

programs in this section. Note that the DPOR algorithm will only help generate

interleavings. A more sophisticated analysis on the interleavings generated may be

required to actually detect the bugs. For example, detecting data races will require

a lock-set or Happens-Before analysis on the interleavings to actually detect the

presence of the data race in the interleaving. We limit this section to describe how

DPOR can be used to generate interleavings only.

Let σi denote a state. A state is identified by the values assumed by the variables

in that state. Let σ0 be the initial state where the values assumed by all variables are

⊥ (undefined). Let enabled(σi) be the set of program instructions (called transitions

henceforth) that can be executed in σi. Let backtrack(σi) ⊆ enabled(σi) be the

backtrack points that denote the transitions that must be executed from σi in

order to explore all relevant interleavings. An interleaving I is shown as σ0t0−→

σ1t1−→ . . .

tn−1−−→ σn where σ0 is the start state and σn is the terminating

state. σiti−→ σi+1 is a state transition in I from σi to σi+1 when transition ti is

executed from σi. proc(ti) denotes the process or thread executing the transition

ti. When backtrack(σi) = enabled(σi) for every state σi, then the entire state space

is explored.

DPOR algorithm works by identifying the backtrack points based on two no-

tions:

1. Co-enabledness of transitions.

2. Dependence between transitions.

22

Definition 2.1 A transition t1 is co-enabled with a transition t2 if there exists

some state σi such that t1, t2 ∈ enabled(σi) [9]

Definition 2.2 Two transitions t1, t2 are independent implies that the following

properties hold for all states σi:

1. if t1 ∈ enabled(σi) and σit1−→ σj, then t2 ∈ enabled(σi) iff t2 ∈ enabled(σj).

2. if t1, t2 ∈ enabled(σi), then there is a unique state σj such that σit1t2−−→ σj and

σit2t1−−→ σj.

Otherwise, the transitions are dependent. [9]

Independent transitions are also known as commuting transitions. DPOR re-

quires that the transition dependence and co-enabledness are correctly or conserva-

tively identified. Two lock transitions on the same mutex by different threads are

dependent whereas locks on different mutexes are independent. Also, a write access

and read/write access to a shared variable by two different threads are dependent

whereas accesses to distinct shared variables are independent. Though identi-

fying dependence/independence between transitions is straightforward, detecting

co-enabledness can be more involved. For multithreaded programs, dependent tran-

sitions are always conservatively considered co-enabled. The conservative approach

can cause redundant interleavings but will not affect the correctness or complete-

ness. Like classical Partial Order Reduction, since only dependent transitions can

cause the exploration of new states, we describe how the DPOR algorithm fills the

backtrack set after generating an interleaving σ0t0−→ σ1

t1−→ . . .tn−1−−→ σn. We only

provide a simple description of the DPOR algorithm. For more details, readers are

encouraged to read [9].

1. The DPOR algorithm first generates an interleaving I = σ0t0−→ σi

t1−→ . . .tn−1−−→

σn.

2. The algorithm maintains a stack of the states generated in an interleaving

where σn is at the top of the stack.

23

3. For every state σi, the backtrack sets are updated - goto step 7.

4. Pop the states out of the stack until a state where backtrack(σi) '= ∅ is found.

If the stack is empty, then there are no more interleavings to be explored.

Hence, exit. Otherwise, goto step 5.

5. Restart all the processes and regenerate the interleaving by executing t0 . . . ti−1

until σi is reached. Now explore the transitions in backtrack(σi) and generate

an interleaving.

6. goto step 2.

7. For transition ti executed from σi in I, find the dependent transitions T ⊆

{t0 . . . ti−1} in I such that every transition in T is dependent and may be

co-enabled with ti.

8. Find the transition tj ∈ T such that j ≥ k for all tk ∈ T .

9. Update backtrack(σj) with proc(ti)’s transition in enabled(σj). If no such

transition exists, then let backtrack(σj) = enabled(σj).

10. goto step 4.

We now explain the DPOR algorithm described through an illustrative example.

2.2.1 DPOR Illustration

Consider the multithreaded program execution of two threads p1 and p2 shown

in Figure 2.2.

Variables x and y are shared variables. Only two possible final states for the

thread executions are x = 2, y = 1 and x = 3, y = 1. We now illustrate the

p1 : lock(l); x = 1; x = 2; unlock(l);p2 : lock(l); y = 1; x = 3; unlock(l);

Figure 2.2. Example Thread Execution

24

execution of DPOR algorithm for the example in Figure 2.2 when only the lock

operations are considered dependent. Figure 2.3 shows the generation of the first

interleaving. The very first interleaving is generated by arbitrarily selecting some

transition in enabled(σi) for all states. The program terminates (or deadlocks)

when enabled(σn) = ∅. In the example, the initial state σ0 has the variables x, y set

to ⊥ and enabled(σ0) has the lock(l) operations of p1 and p2. Note that for state σ1,

Figure 2.3. DPOR Illustration: Initial Interleaving

25

enabled(σ1) does not contain p2’s lock operation because the p2’s lock instruction

remains disabled until p1 eventually unlocks (state σ4). The execution proceeds

until the final state σ8 is reached. At this point, the DPOR algorithm updates

the backtrack sets for all the states. Note that the only dependence is between

transitions t4 and t0, as shown in Figure 2.4. Once the dependence is recognized,

the backtrack set of σ0 (shown in bold font) is updated with p2’s transition enabled

Figure 2.4. DPOR Illustration: Updating Backtrack Set

26

in σ0 which is p2 : lock(l). No other backtrack sets are updated. The DPOR

algorithm now pops all the states from σ8 through σ1 since all their backtrack sets

are empty. A new execution is restarted from σ0 with p2’s lock operation and will

result in a new final state where x = 2 and y = 1.

The DPOR algorithm works for multithreaded programs but will fail when

applied to MPI programs. Multithreaded programs are guaranteed sequential

consistency for the atomic lock and unlock operations which also behave as strong

fence instructions. However, MPI does not provide any such guarantees. The

following section illustrates these issues in more detail.

2.3 Applying DPOR to MPI : Issues

We now describe the issues in applying DPOR to MPI program through illus-

trative examples. Consider the example program shown in Figure 2.5

The example in Figure 2.5 shows the dynamic execution of three MPI processes

P0, P1 and P2. Process P0 issues an MPI_Isend (shown as S0) to dest = P1 with

the buffer d0 having the value 0. Similarly, P2 issues a send (S2) to dest = P1 with

the buffer d2 as 2. Process P1 issues a MPI_Irecv R1 which is a wildcard receive

(src = ∗) and with the receiving buffer d1. (Note that the MPI functions have

only the arguments necessary to explain the examples.) The wildcard receive R1

can receive data from either S0 or S2. However, P1 has an error when R1 receives

from S2. In order to discover this error, the program must enter into a state where

d1 is 2. The goal is to explore all the possible nondeterminism due to wildcard

receives. That is, it is necessary to match a wildcard receive with all the possible

P0 P1 P2

S0(P1, d0 = 0); R1(∗, d1); S2(P1, d2 = 2);if (d1 == 2) error

Figure 2.5. Simple MPI Example

27

sends. Since DPOR helps explore all relevant interleavings, we apply DPOR to the

above simple MPI program, as shown in Figure 2.6.

The first interleaving generated is shown in Figure 2.6(a). In this interleaving,

S0 is issued first, followed by R1 and finally, S2 is issued. The value of d1 is 0 as

expected since S0 is issued early. The dependence between the two sends matching

the same wildcard receive R1 causes the backtrack set of σ0 to be updated with

S2. Therefore, in the next interleaving, S2 will be issued earlier instead of S0

so that R1 receives from S2 instead of S0. The second interleaving is shown in

Figure 2.6(b). However, note that d1 can still receive a value of 0 (shown in bold

font) instead of 2 contrary to what is expected. This happens because the issue order

of the sends do not indicate the matching order when the sends are from different

processes. The MPI runtime decides the matching between the sends and receives.

In [68], the author presents how skewed MPI runtime matches are in real world

MPI execution environments. Unfortunately, the authors’ solution in [68] is both

highly wasteful (introduces random delays in the main computational path) and

(a) Update Backtrack for first In-terleaving

(b) Surprising result with DPOR

Figure 2.6. Illustration of Surprising MPI Runtime Behavior with DPOR

28

still does not guarantee that the offending message matches will be enforced. Since

there are exponentially more useless schedule perturbations that Jitterbug causes

than useful perturbations, the designer cannot have any hopes that the method

(even ignoring its slowing of operations) will yield benefits.

Therefore, even when S2 is issued earlier, it is possible for the MPI runtime to

match the later issued send S0 with R1. Hence, it is possible that the error is never

caught since DPOR erroneously assumes that all nondeterministic code paths are

explored.

DPOR when applied to multithreaded programs depends on the fact that the

underlying cache coherence protocol ensures that when only one process executes

an instruction at a time, then multiple writes to a shared variable happen in the

issue order. This is not true for MPI. Therefore, applying classic DPOR to the

example in Figure 2.5 can result in a bug omission.

However, let us assume that it is possible to implement an MPI runtime that has

verification support so that the MPI runtime matches the sends with receives in the

order they are issued. This should solve the problem with DPOR. Unfortunately,

this is not sufficient. MPI programs can have complex behaviors due to which it

will not be possible for a send to be issued earlier even though it can match with a

wildcard receive. We illustrate this with our “Crooked Barrier” example shown in

Figure 2.7.

In the example shown in Figure 2.7, B0, B1 and B2 are the matching MPI_Barrier

operations issued by processes P0, P1, P2, respectively. Note that the barrier B0 is

issued after S0 is issued. Processes P1 and P2 are blocked at their barriers B1 and

P0 P1 P2

S0(P1, d0 = 0); B1; B2

B0; R1(∗, d1); S2(P1, d2 = 2);if (d1 == 2) error

Figure 2.7. Crooked Barrier Example

29

B2 until P0 issues its barrier. Once P0’s barrier is issued, all the barriers unblock

so that R1 and S2 are issued. The verification-based MPI runtime matches S0 with

R1 since S0 is issued before S2. However, it is possible for S2 to also match with R1.

In order to accomplish this, it is necessary for S2 to be issued into the MPI runtime

before S0. This is impossible because B0 cannot be issued unless S0 is issued.

Hence, in any execution of the program in Figure 2.7, S0 is always issued before S2.

Hence, even with a verification-based MPI runtime, it will not be possible to explore

all nondeterministic code paths in an MPI program. The example in Figure 2.7

clearly illustrates that the DPOR algorithm for the multithreaded programs cannot

be applied as is to MPI programs.

We need specialized formal verification tools for important domains. Each do-

main has its own computational model. This dissertation discovers and presents the

MPI computational model uniformly in terms of Happens-Before relation between

MPI operations.

The DPOR algorithms developed in this dissertation do not require any changes

to MPI programs or the MPI library implementations to support our dynamic

verification algorithms. We enforce the requisite matches during replays of our

DPOR for MPI by dynamically determinizing MPI receive operations. This way, we

can “fire and forget” MPI receives fully knowing that they will match the intended

sends. In a nutshell, this dissertation contributes two key ideas in this area:

• Dynamically determine all send matches: This is a guaranteed algo-

rithm that will find all senders that can ever match a wildcard receive.

• Replay over all dynamic rewrites: We dynamically replay the execution

for each sender by dynamically rewriting the receive to match that sender.

ISP employs similar techniques for handling other sources of nondeterminism

such as introduced by MPI_Iprobe.

CHAPTER 3

MPI FORMAL MODEL

This chapter presents a formal transition system for MPI (Section 3.2). To keep

the formal model simple, this chapter assumes that all the sends are unbuffered,

i.e., none of the sends are provided with any runtime buffering. Buffered sends are

dealt with in Chapter 7. We illustrate the application of our formal model to a

small MPI program in Section 3.3. Section 3.4 illustrates why the classical DPOR

is still not directly adaptable to the MPI transitions described in this chapter.

3.1 Formal Transition System for MPI

Let N0 denote {0, 1, . . .} and let N denote {1, 2, . . .}. As in set theory, we often

write k ∈ n to mean k ∈ {0, . . . , n− 1}.Consider MPI program execution with PID ∈ N MPI processes, each denoted

by Pi for i ∈ PID. View each Pi as a sequence. Thus, Pi,j can be regarded as the

jth member of the sequence Pi, denoting the jth MPI operation issued by the ith

MPI process. Sequence Pi is of length |Pi|. We assume that our MPI programs

terminate and therefore, execution sequences are finite.

Let Op denote the set of all MPI operations. An MPI operation belonging to

Op is one of these:

1. Si,j(k) for i, k ∈ PID and j ∈ |Pi|. This is the MPI call MPI_Isend(to:k)

issued as the jth call by MPI process i.

2. Ri,j(k) for i, k ∈ PID and j ∈ |Pi|. This is the MPI call MPI_Irecv(from:k)


3. Ri,j(∗) for i, k ∈ PID and j ∈ |Pi|. This is the MPI call MPI_Irecv(MPI_ANY_SRC)


31

4. Wi,j′(hi,j) for i ∈ PID and j, j′ ∈ |Pi| and j < j′. This is the MPI call

MPI_Wait(handle) where handle is the wait handle returned by an earlier

issued Si,j(k), Ri,j(k), or Ri,j(∗).

5. Bi,j for i ∈ PID and j ∈ |Pi|. This is the MPI call MPI_Barrier.

Recognizers for members of Op:

• isS(Fi,j) is true when F = S and false otherwise.

• isR(Fi,j) is true when F = R and false otherwise.

• isW (Fi,j) is true when F = W and false otherwise.

• isB(Fi,j) is true when F = B and false otherwise.

3.1.1 State Model

Every MPI function is in one of the following four execution states:

• issued (I): The MPI function has been issued into the MPI runtime.

• returned (R): The MPI function has returned and the process that issued this

function can continue executing.

• matched (M): Since most MPI functions usually work in a group (for example,

a S from one process will be matched with a corresponding R from another

process), an MPI function is considered matched when the MPI runtime is able

to match the various MPI functions into a group which we call a match-set.

All the function calls in the match-set will be considered as having attained

the matched state.

• complete (C): An MPI function can be considered to be complete according to

the MPI process that issues the MPI function when all visible memory effects

have occurred (e.g., in case the MPI runtime has sufficient buffering, we can

consider an MPI_Isend to complete when it has copied out the memory buffer

32

into the runtime buffer). The completion condition is different for different

MPI functions (e.g., the R matching a S may not have seen the data yet, but

still the S can complete on the send side when the send is buffered).

3.1.2 The State of an MPI Execution

Our formal model does not model the actual MPI programs or the underlying

language semantics. Instead, we model dynamic execution sequences which are

presented by an “Oracle” that understands the underlying language semantics.

We communicate with the “Oracle” asking it for the next MPI operation that is

executed by a process. Hence, the formal model presented here abstracts away the

program variable values or local process states. Our formal model finds out various

send and receive matches which implicitly models the program data. Note that the

formal model directly applies to our ISP scheduler that is only aware of the dynamic

MPI operations executed by a process but the rest of the program instructions

remain invisible to the scheduler. All values retuned by an MPI call (e.g., MPI

receive data, status flags) are assumed to be available to this Oracle which can

base conditional branches based on them to decide the next MPI operation.

The state of an MPI program execution is denoted by the record

{I : Op, M : 2Op, C : Op, R : Op, pc : 2N}

or more compactly as the tuple

〈I,M, C, R, pc〉.

Here, I denotes those instructions that have been issued. Set R denotes those

instructions whose calls have returned to the calling process. Set M denotes those

calls that have matched, and C denotes calls that have completed. M will consist

of sets of matching MPI calls: either sets of the form {Si,j(k), Rk,l(i)}, containing

matching sends and receives, or {Bi,j | i ∈ PID, j ∈ |Pi|}, showing matching

barriers.

33

The initial state of our transition system, σ0, is

〈∅, ∅, ∅, λi.0〉.

Since every process starts execution with MPI_Init, Pi,0 is MPI_Init for i ∈

PID.

A transition moves the system from state σ to the next state σ′, and is written

σit−→ σi+1. The MPI execution system consists of process transitions and MPI run-

time transitions. The MPI transition system provided in this section is very generic;

an actual MPI runtime can follow any specific scheduling strategy consistent with

transitions described here.

3.2 MPI Transition System

We are now ready to present the MPI transition system.

3.2.1 Process Transitions

The process transitions consist of issuing the visible MPI operations into the

MPI runtime. We have four process transitions for each of the MPI functions: PS,

PR, PW and PB. Let Σ be the reached states predicate.

The process transitions are defined using a rule that infers the new reached

state set Σ. For a process Pi let Curi denote the instruction being executed by Pi

at program counter pci in state 〈I,M, C, R, pc〉. The process transition for a S for

i ∈ PID is as follows:

PS :Σ(σ as 〈I,M, C, R, pc〉), isS(Curi)

Σ〈I ∪ {Curi}, M,C,R, pc〉

When process Pi has to issue a send Si,j(k) at its current program counter, the

state transition occurs by issuing the send into the MPI runtime which involves

updating the I set with Si,j(k). Note that except for the change in the runtime

state, there is no change in the local process state. Even though the send can

return immediately, the PS transitions does not show any increment in the program

34

counter for Pi. This is because the MPI standard does not provide any restrictions

on how immediately the send has to return.

The process transition when the operation issued by Pi is Ri,j(k) is shown below.

PR :Σ(σ as 〈I,M, C, R, pc〉), isR(Curi)


The PR transition is similar to the PS transition and the state transition involves

updating the I set of the runtime state with Curi.

PW :Σ(σ as 〈I,M, C, R, pc〉), isW (Curi)


The PW transition shows the state transition when Pi issues Wi,j′(hi,j) where

hi,j refers to an earlier send Si,j or receive Ri,j. The state transition results in the

state with the I set updated with Curi.

Finally, the PB transition is shown below when Pi issues a Barrier. The state

transition for PB is similar to rest of the process transitions.

PB :Σ(σ as 〈I,M, C, R, pc〉), isB(Curi)


3.2.2 MPI Runtime Book-keeping Sets

As the processes issue the MPI operations into the MPI runtime, at every state

σ, the MPI runtime also maintains certain book-keeping sets. These sets help the

runtime transitions follow the MPI ordering guarantees described in Section 2.1.5.

The book-keeping sets are defined for a state σ = 〈I,M, C, R, pc〉 and all the

bookkeeping sets are ⊆ I × I.

Definition 3.1 Nonovertake(σ as 〈I,M, C, R, pc〉) ⊆ I × I =

{〈Si,j(k), Si,j′(k)〉, 〈Ri,j(k), Ri,j′(k)〉, 〈Ri,j(∗), Ri,j′(k)〉〈Ri,j(∗), Ri,j′(∗)〉 | j < j′}.

The Nonovertake set, as the name suggests, tracks the nonovertaking sends and

receives that must be matched in a particular order. When a send is matched with

35

a receive, the send and the receive tuple will enter the M set to signify that a send

and receive have been matched. The Nonovertake set will allow the sends/receive

to enter the M set in the program order when the nonovertaking property needs to

be maintained according to the standard. For example, Si,j′ or Ri,j′ cannot enter

the M set before Si,j or Ri,j, respectively.

Definition 3.2 Resource(σ as 〈I,M, C, R, pc〉) ⊆ I × I =

{〈Si,j, Wi,j′(hi,j)〉, 〈Ri,j, Wi,j′(hi,j)〉 | j < j′}.

The Resource set is to track the completion order of a send (receive) and its

corresponding W . When an MPI function is completed, then the MPI runtime

moves it to the C set. However, the completion of Wi,j′ depends on when the Si,j

(Ri,j completes (i.e., the send buffer is copied out of the process Pi’s buffer space

and Ri,j receives the data into its receive buffer). Wi,j′ can enter the C set only

after Si,j (Ri,j) enters the C set. We call the set the Resource set because the W

frees resources assigned to its handle.

Definition 3.3 Fence(σ as 〈I,M, C, R, pc〉)⊆ I × I =

{〈Wi,j, Fi,j′)〉, 〈Bi,j, Fi,j′)〉 | j < j′, F ∈ {S, R,W, B}}.

The Fence set indicates that the blocking MPI functions W and B act as fences.

This set indicates that when a Wi,j or Bi,j is issued, then no later MPI instructions

Fi,j′ can be issued until Wi,j (Bi,j) move into the C set.

Definition 3.4 IntraHB(σ as 〈I,M, C, R, pc〉) ⊆ I × I is defined as follows:

IntraHB(σ as 〈I,M, C, R, pc〉) = NonOvertake(σ) ∪ Resource(σ) ∪ Fence(σ)

The IntraHB relation is used by the MPI runtime to move the MPI operations

into various sets (and hence cause state transitions). Note that the IntraHB is a

relation across the MPI operations issued by the same process. Hence, the name

Intra-HappensBefore. Before we present the full set of MPI Runtime transitions,

we define Ancestor, Descendant and Ready sets.

36

Definition 3.5 Ancestor(σ : state, y : op) = {x | 〈x, y〉 ∈ (IntraHB(σ))+}

Definition 3.6 Descendant(σ : state, x : op) = {y | 〈x, y〉 ∈ (IntraHB(σ))+}

Definition 3.7 Ready(σ as〈I,M, C, R, pc〉) =

{x ∈ I | ∀y : (y ∈ Ancestor(x) ∧ y /∈ Ready(σ)) ∧(¬isW (x) ∧ (∃m ∈ M ∧ y ∈ m)) ∨ (isW (x) ∧ y ∈ C)}.

The Ready set defines the set of MPI operations in I that are ready to be

matched so that they can enter the M set when the matching MPI operations are

found. The Ready set contains the set of MPI operations whose ancestors have all

been matched (i.e., the ancestors are in the M set) when the operations are not W

operations. When the operations are W operations, the ancestors of W must be in

the C set for W to enter the C set.

3.2.3 MPI Runtime Transitions

We are now ready to present the MPI runtime transitions. We first present

the RSRet and RRRet transitions which stand for the MPI Runtime Send Return

transitions and runtime Receive Return transition.

RSRet :Σ(σ as 〈I,M, C, R, pc〉), Si,j ∈ I ∧ Si,j /∈ R

Σ〈I,M, C, R ∪ {Si,j}, pc[i ← pci + 1]〉

RRRet :Σ(σ as 〈I,M, C, R, pc〉), Ri,j ∈ I ∧Ri,j /∈ R

Σ〈I,M, C, R ∪ {Ri,j}, pc[i ← pci + 1]〉

The RSRet and RRRet transitions define the control transfer back to the processes

when they issue the Si,j and Ri,j instructions. The control transfer is shown by

incrementing and updating the program counter of process Pi and by updating the

R set of the MPI runtime state.

RSR :Σ(σ as 〈I,M, C, R, pc〉), {Si,j(k), Rk,l(i)} ⊆ Ready(σ)

Σ(σ′ as 〈I,M ∪ {{Si,j, Rk,l}}, C, R, pc〉)Assert : Ready(σ′) = Ready(σ)− {Si,j, Rk,l}

37

The MPI runtime transition RSR shows the formation of a send and a receive

match set when they are both ready to be matched (i.e., Si,j and Rk,l ∈ Ready(σ)).

The transition matches the send and receive and moves them to the M set. By

virtue of Definition 3.7, Ready(σ′) will satisfy the assertion shown. This assertion

(separately provable from Definition 3.7) shows how matched items are removed

from Ready. The MPI runtime transition to complete the send and receive once

they are matched is as follows:

RSC :Σ(σ as 〈I,M, C, R, pc〉), {Si,j, Rk,l} ∈ M ∧ Si,j /∈ C

Σ〈I,M, C ∪ {Si,j}, R, pc]〉

RRC :Σ(σ as 〈I,M, C, R, pc〉), {Si,j, Rk,l} ∈ M ∧Rk,l /∈ C

Σ〈I,M, C ∪ {Rk,l}, R, pc]〉

The RSC and RRC transitions look for the sends and receives that have been

matched and update the C set with the send and receive operations. Note that the

completion and matching can happen in the same transitions. However, the MPI

runtime can first match and then do the actual data transfer (which completes the

send and receive operations) later. This could be due to various reasons like large

data buffers, busy network or performance optimizations. Our runtime transition

system captures this latitude provided by the MPI standard.

RWC :Σ(σ as 〈I,M, C, R, pc〉), Wi,j ∈ Ready(σ)

Σ(σ′ as 〈I,M ∪ {{Wi,j}}, C ∪ {Wi,j}, R}, pc〉)Assert : Ready(σ′) = Ready(σ)− {Wi,j}

RWC is the transition that completes a W operation. When the W operation

is in Ready(σ), then the W is ready to be completed by the virtue of the definition

of Ready set. This is because a W operation enters the Ready set only when its

corresponding send or receive has completed and is in the C set.

RBC :Σ(σ as 〈I,M, C, R, pc〉), bar as {Bi,j | Bi,j ∈ Ready(σ)}, | bar |= PID

Σ(σ′ as 〈I,M ∪ {bar}, C ∪ bar,R, pc]〉)Assert : Ready(σ′) = Ready(σ)− bar

38

The RBC transition matches and completes a B operation. When the Ready(σ)

contains a B operation for all the processes in PID, then the transition matches

all the barriers by updating the M with {bar} and also updates the C set. The

Ready(σ′) set is also appropriately updated.

RWRet :Σ(σ as 〈I,M, C, R, pc〉), Wi,j ∈ C ∧Wi,j /∈ R

Σ〈I,M, C, R ∪ {Wi,j}, pc[i ← pci + 1]〉

RBRet :Σ(σ as 〈I,M, C, R, pc〉), Bi,j ∈ C ∧Bi,j /∈ R

Σ〈I,M, C, R ∪ {Bi,j}, pc[i ← pci + 1]〉

The RWRet and RBRet return the control back to the processes issuing the B

and W operations once the B and W are completed. The final runtime transition

is the RSR∗ transition that matches a send and a wildcard receive.

RSR∗ :

Σ(σ as 〈I,M, C, R, pc〉),{Si,j(k), Rk,l′(∗)} ⊆ Ready(σ),¬∃l < l′ : Rk,l(i) ∈ Ready(σ)

Σ(σ′ as 〈I,M ∪ {{Si,j, Rk,l′(i)}}, C, R, pc〉)Assert : Ready(σ′) = Ready(σ)− {Si,j, Rk,l′}

3.2.4 Conditional Matches-before

The RSR∗ transition matches a send with a wildcard receive. However, the send

can be matched only when there is no nonwildcard receive that can match it. This

satisfies the conditional matches-before requirement. For two receives Rk,l(i) and

Rk,l′(∗), (l < l′) issued by the Pk, the later wildcard receive Rk,l′ cannot be matched

with an available send Si,j(k) without matching the earlier receive Rk,l. Note that

〈Rk,l, Rk,l′〉 /∈ IntraHB.

This makes it possible for both Rk,l and Rk,l′ to be in Ready(σ) at the same

time. By checking that Rk,l is not in the Ready(σ), the conditional matches-before

order is preserved.

39

3.2.5 Dynamic Instruction Rewriting

Also note that the M set now has Rk,l′(i) which rewrites the src field of the

receive to the send’s process rank with which it is matched.

3.2.6 One Transition or Multiple?

One may wonder whether RSR∗ is a single transition or a collection of one or

more. The answer is that it is a single transition! The rule RSR∗ of course defines

a family of transitions, one per sender that can match the wildcard receive. These

transitions are the only ones that are dependent – a notion that will be defined

formally in the next chapter. For this reason, we call all the transitions denoted by

RSR∗ the dependent transition groups.

3.2.7 Dependent Transition Group

Let Rk,l(∗) be a wildcard receive statement in an MPI process execution Pk,

for some k ∈ PID. Let σ ∈ Σ, and let Si,j(k) (for various i, j, k) represent sends

such that transition RSR∗ can fire by virtue of {Si,j(k), Rk,l(∗)} ⊆ Ready(σ). Let

all these transitions be denoted as τ . We say that τ is in the same dependent

transition group (DTG) of state σ. isDtg(τ) is true precisely for such sets τ .

3.2.8 Selectors and Useful Predicates

Several predicates associated with these transition rules are now defined. These

will be used in subsequent chapters.

• is∗(t): for a transition t, is∗(t) is true exactly when t is an RSR∗ transition.

• has∗(τ): for a set of transitions τ , has∗(τ) is true if there is a t ∈ τ such that

is∗(t).

• hasnon∗(τ): for a set of transitions τ , hasnon∗(τ) is true if there is a t ∈ τ such

that ¬is∗(t).

• choose∗(τ) denotes t ∈ τ such that is∗(t) (like Hilbert’s choice operator).

40

• choosenon∗(τ) denotes t ∈ τ such that ¬is∗(t) (like Hilbert’s choice operator).

• all∗(τ) denotes the dependent transition groups in τ . That is,

all∗(τ) = {g | isDtg(g)}.

Notice that multiple wildcard moves may be enabled at a state σ. That is,

there could be multiple DTGs at a state. Also notice that in the “crooked barrier”

example presented earlier (Figure 2.7), it is possible to have a barrier transition

and the wildcard receive transition both enabled at a state. Thus, it is possible to

have both DTGs and regular (deterministic) transitions both enabled at the same

time. Our algorithm POE is one that prioritizes deterministic transitions until we

reach a state with only DTGs. POEOPT optimizes POE by considering the DTGs

to be independent as far as possible.

This concludes the MPI runtime transitions, the next section illustrates MPI

program execution using the MPI runtime transitions provided in this section.

3.3 Illustration of the Formal Model

We now illustrate the working of the formal MPI model as a state transition.

Consider the simple MPI execution shown in Figure 3.1. Figure 3.2 shows the

execution of the MPI program Figure 3.1. Each state σi is labeled with

• The I,M, C, R sets that identify the state σi.

• enabled(σi) is the set of transitions that can be executed from σi.

• The IntraHB relation among the MPI operations in I and

P0 P1

S0,1(1); R1,1(0);W0,2(h0,1); W1,2(h1,1);

Figure 3.1. Simple MPI Program

41

Figure 3.2. Execution of Figure 3.1 with MPI Transitions

42

• Ready(σi)

The MPI execution of Figure 3.2 proceeds as follows:

• σ0 has the process transitions enabled to issue S0,1 and R1,1. The rest of the

sets are empty. The process transitions are denoted as PS : S0,1 and PR : R1,1

that instantiates particular PS and PR transitions, respectively.

• The PS : S0,1 transition is executed from state σ0 and reaches state σ1 with

S0,1 in σ1.I. Since S0,1 has no ancestors, i.e., Ancestors(σ1, S0,1) = ∅, S0,1 is

also in Ready(σ1).

• RSRet : S0,1 is now enabled in σ1 which is the transition executed from σ1 to

generate σ2.

• Since S0,1 returned, W0,2 is now ready to be issued which is evident from

PW : W0,2 ∈ enabled(σ2).

• PW : W0,2 is executed from σ2 to generate σ3.

• Note that W0,2 will not be in Ready(σ3) due to the IntraHB relation between

S0,1 and W0,2 and the fact that S0,1 /∈ σ3.C

• W0,2 will enter Ready(σ7) after S0,1 completes by executing RSC : S0,1 in σ6.

• The rest of the execution can be understood similarly. The execution ends

when there are no more transitions to be executed, i.e., enabled(σ13) = ∅.

3.4 Applying DPOR to MPI Transition System

We now present the DPOR algorithm applied to the MPI transition system and

present issues that arise when DPOR is applied to the MPI transitions for MPI

programs. We redo the example presented in Figure 2.5 with minor changes. Note

that these changes do not in any way change the semantics of the examples.

Consider the example in Figure 3.3. Note that the example deadlocks when R1,1

is matched with S0,1. An MPI execution for this example is shown in Figure 3.4.

43

P0 P1 P2

S0,1(1); R1,1(∗) S2,1(1);W0,2(h0,1); R1,2(0); W2,2(h2,1);

W1,3(h1,1);W1,4(h1,2);

Figure 3.3. MPI Execution with a Deadlock

Figure 3.4. MPI Execution of Figure 3.3

44

By the time the execution reaches state σi, the process transitions PS : S2,1, S0,1

and PR : R1,1 and runtime transitions RRRet : R1,1 and SSRet : S2,1, S0,1 have been

executed. Ready(σi) contains {S2,1, R1,1, S0,1} which enables two RSR∗ transitions

as shown in enabled(σi). At σi the RSR∗ : {S2,1, R1,1} transition is executed.

S0,1 will be eventually matched with R1,2. Once the interleaving is generated,

the DPOR algorithm starts updating the backtrack sets. The only dependent

transitions are RSR∗ : {S0,1, R1,1} and RSR∗ : {S2,1, R1,1} since executing one of

them will disable the other since R1,1 will be removed from Ready(σi). However,

unlike thread programs, once a transition is disabled, it will never get enabled. The

RSR∗ : {S0,1, R1,1} transition never gets enabled in the current interleaving. In

thread-based programs, if a thread instruction remains disabled, this will lead to a

deadlock. This is not necessarily true for MPI, as seen in this example. Since the

dependent transition is never enabled, the DPOR algorithm will never update the

backtrack sets and the deadlock remains undetected. Note that it is possible to

have an interleaving (execution) where one of the RSR∗ transitions is never enabled

in the execution. This can happen if the RSR∗ : {S2,1, R1,1} is executed before the

PS : S0,1 is executed.

CHAPTER 4

THE POE ALGORITHM

This chapter presents the POE algorithm which stands for Partial Order avoid-

ing Elusive Interleavings. This version of POE is applicable to MPI programs

where sends do not have runtime buffering. Section 4.1 presents the dependence

properties of MPI transitions. Section 4.2 presents the POE algorithm along with

the proof of correctness. Section 4.3 illustrates the working of POE algorithm.

Finally, Section 4.4 presents two drawbacks of the POE algorithm which concludes

the chapter.

4.1 MPI Transition Dependence

This section presents dependence and independence properties of MPI transi-

tions. Transition independence is presented in Definition 2.2.

Definition 4.1 An MPI transition t is enabled in a state σi (written t ∈

enabled(σi)) when it can be fired according to a MPI transition rule (presented

in Section 3.2).

Definition 4.2 Two transitions t1 and t2 are co-enabled when there is a state

σi such that {t1, t2} ⊆ enabled(σi).

Definition 4.3 Let t(σ) denote the state attained after transition t fires. Tran-

sitions t1 and t2 are independent exactly when for all states σ, t2 ∈ enabled(σ) ⇔

t2 ∈ enabled(t1(σ)), and further t1(t2(σ)) = t1(t2(σ)). Two transitions are depen-

dent if they are not independent.

46

The definition in [5, Chapter 10] allows transitions to be independent even if

one transition can enable the other. We can follow this stricter definition along the

lines of [9].

Lemma 4.4 All transitions of τ such that isDtg(τ) are pairwise dependent.

All other transition pairs are independent.

Lemma 4.4 is very important in the development of dynamic verification algo-

rithms for MPI programs. The above lemma implies that in programs that do not

have any wildcard receives, all the transitions are independent. Hence, the number

of relevant interleavings is only one. For programs with wildcard receives, the

number of relevant interleavings is governed by the dependent transition groups

generated by RSR∗ transitions. Compared to shared memory programs, MPI

programs have far fewer relevant interleavings. In particular, operations involving

different communicators are completely independent. This explains why random

delay tricks such as in [68] are far less effective for message passing programs, and

why algorithms such as POE are even more important.

4.2 The POE Algorithm

This section presents the POE algorithm and its proof of correctness. Before

describing the POE algorithm, we first define and explain the notion of persistent

sets which underlies almost all our correctness arguments. In almost all cases, our

proof goal will be to show that for every reachable state σi, our algorithms compute

persistent backtrack sets.

4.2.1 Persistent Sets

Definition 4.5 For a reachable state σi ∈ Σ of an MPI program, a set of

transitions τ ⊆ enabled(σ) is persistent iff, for all nonempty sequences of transitions

σiti−→ σi+1

ti+1−−→ σi+2 . . .tp−1−−→ σp that lie outside τ , tp is independent of all transitions

in τ .

47

4.2.2 Persistent Sets and MPI Program Correctness

Our interest is in detecting two classes of bugs:

• Deadlocks (states where all MPI processes are stuck). Even “partial dead-

locks” (deadlocks involving a proper subset of processes) will turn into full

deadlocks because we expect our terminating MPI processes to call MPI_Finalize.

• Violations of C assert statements placed within MPI processes. C assert

violations are equivalent to deadlocks because following [12], we can check

each assertion as a precondition of MPI transitions, and prevent the transition

from firing in case the precondition is false.

It is shown in [10] that persistent set-based search will reveal all deadlocks. That

is, if there is any path in the global state space where σ is reachable, then there

is a path traversing through only persistent sets where σ is reachable. Thus, the

correctness of all our POE variants will be argued by showing that they compute

persistent backtrack sets.

4.2.3 POE Algorithm

This section presents the POE algorithm and proves that the backtrack set at

every state generated by the POE algorithm is persistent.

The POE algorithm has some book-keeping sets, variables and helper routines

that are defined below:

• backtrack(σi) is the set of transitions that indicate the backtrack set of state

σi. The backtrack(σi) has the same semantics as in the classical DPOR

algorithm.

• done(σi) ⊆ backtrack(σi) tracks the transitions of the backtrack(σi) that have

already been executed from σi.

• statevec is a vector of states explored in an interleaving. statevec also behaves

as a stack where a new state is pushed at the top of statevec and a state is

popped from the top of statevec. Initially, the statevec is empty.

48

• curr(σi) denotes the MPI transition that was executed from σi in the inter-

leaving generated in statevec. Note that curr(σi) ∈ backtrack(σi).

• Execute(σi, curr(σi)) is a helper routine which indicates that the MPI

transition curr(σi) has been executed in the current state σi. The function

returns the next state after the MPI transition curr(σi) is executed in the

current state σi.

• GetTransition takes a set of MPI transitions as its argument and returns

a transition from the argument set as per the pseudo-code.

We are now ready to describe the full POE algorithm. Figures 4.1, 4.2, 4.3, 4.4

provide the pseudo-code for the full POE algorithm. Figure 4.1 is the main POE

routine that is invoked with two inputs : the first argument is the initial state

σ0 and the second argument is an empty statevec. The initial state σ0 is pushed

into statevec and GetTransition is invoked by POE to get the transition to be

executed from σ0. backtrack(σ0) is updated with curr(σ0) (line 2–4 Figure 4.1).

POE then invokes GenerateInterleaving. GenerateInterleaving gener-

ates an interleaving by selecting the transitions in a prioritized manner (line 6, 10)

by invoking GetTransition. All RSR∗ transitions have the least priority. The

rest of the transitions have the same priority.

Once an interleaving is generated, the backtrack sets are updated by the POE

algorithm as shown by the routine UpdateBacktrack in Figure 4.3. The al-

gorithm to generate the backtrack sets is simple : In a given state σ, if curr(σ)

has some transition t that is not an RSR∗ transition, then the backtrack(σ) is a

singleton set {t}. Otherwise, backtrack(σ) = enabled(σ).

After the backtrack set of every state is updated, the POE algorithm then starts

popping off states from statevec until a state σ = statevec[i] is reached where there

is some transition in backtrack(σ) '= done(σ). If no such state is present, then the

POE algorithm will pop out all the states out of statevec and the POE algorithm

ends, signalling that there are no more interleavings to be explored. Otherwise,

49

1: POE(σ0, statevec) {2: statevec.push(σ0);3: curr(σ0) = GetTransition(enabled(σ0));4: backtrack(σ0) = backtrack(σ0) ∪ {curr(σ)};5: while (! statevec.empty()) {6: GenerateInterleaving (statevec);7: UpdateBacktrack (statevec);8: for (i = statevec.size()−1; i ≥ 0; i−−) {9: if (backtrack(statevec[i]) == done(statevec[i])) {10: statevec[i].pop();11: } else {12: break;13: }14: }15: }

Figure 4.1. Pseudocode for POE Algorithm

1: GetTransition(set of transitions T ) {2: if hasnon∗(T )3: return choosenon∗(T );4: else5: return choose∗(T )6: }

Figure 4.2. Pseudocode for GetTransition

1: UpdateBacktrack(statevec) {2: for each (σ ∈ statevec) {3: if (enabled(σ) = ∅)4: return;5: if (is∗(curr(σ)))6: backtrack(σ) = enabled(σ)};7: }8: }

Figure 4.3. Pseudocode for UpdateBacktrack

50

1: GenerateInterleaving(statevec) {2: σ = statevec[0];3: for (i= 0; i < statevec.size()−1; i++) {4: σ = Execute(statevec[i], curr(statevec[i]));5: }6: curr(σ) = GetTransition(backtrack(σ)−done(σ));7: do {8: σ = Execute(σ, curr(σ));9: statevec.push(σ);10: curr(σ) = GetTransition(enabled(σ));11: backtrack(σ) = backtrack(σ) ∪ {curr(σ)};12: done(σ) = done(σ) ∪ {curr(σ)};13: } while (enabled(σ) '= ∅);14: }

Figure 4.4. Pseudocode for GenerateInterleaving

the algorithm invokes GenerateInterleaving on the statevec which results in

restarting of all the MPI processes.

The algorithm for GenerateInterleaving is shown in Figure 4.4. Lines 2–5

show the state generation when the program is restarted. The algorithm executes

the same transitions curr(σ) for all but the state that is at the top of statevec.

From this state, a new transition is executed from backtrack(σ) − done(σ) (lines

6–7 Figure 4.4). This will cause new states to be generated which are pushed

onto the statevec along with the backtrack sets generated for these states. The

states are generated until enabled(σ) = ∅ which is the terminating state (lines 8–14

Figure 4.4).

We now prove the following theorem:

Theorem 4.6 For any state σ generated by the POE algorithm, backtrack(σ)

is persistent.

Proof : Induction, in post order (by the successor relation).

Basis case: The state after MPI_Finalize is persistent, as it has an empty

enabled set of transitions.

51

Induction Hypothesis: Pick some state σi and assume that all its successors

are persistent.

Induction Step: Now consider σi.

Our algorithm guarantees that either ¬has∗(backtrack(σi)) or backtrack(σi) =

enabled(σi). The latter case preserves persistence. The former case also preserves

persistence as any state satisfying ¬has∗(backtrack(σi)) cannot have any transition

in backtrack(σi) or be dependent with any other transition (Lemma 4.4).

Note: This proof did not need the induction hypothesis because this version of

POE ensures locally (for each state) that only persistent sets are chosen. We will

employ the same proof structure for later versions of POE, and in those proofs, we

will need the induction hypothesis.

4.3 Illustration of POE Algorithm

We illustrate the POE algorithm on the “Crooked Barrier” example shown in

Figure 4.5.

Figure 4.6 and Figure 4.7 show the interleavings generated by the POE algo-

rithm for Figure 4.5. The POE algorithm will select one of the process transitions

in enabled(σ0) and will add it to backtrack(σ0) and updates curr(σ0). Gener-

ateInterleaving is now invoked with σ0 at the top of statevec.

GenerateInterleaving then executes curr(σ0) (line 8) and generates the

next state σ1. One of the transitions in enabled(σ1) is selected using GetTran-

sition that implements the prioritized execution semantics of POE. GetTran-

P0 P1 P2

S0,1(1) B1,1 B2,1

B0,2 R1,2(∗) S2,2(1)W0,3(h0,1) R1,3(2) W2,3(h2,2)

W1,4(h1,2)W1,5(h1,3)

Figure 4.5. Crooked Barrier Example

52

Figure 4.6. POE Interleaving 1

sition always selects a non-RSR∗ transition if available. This can be seen in the

interleaving in Figure 4.6 where the transitions executed from σ0 to σl to σm are

non-RSR∗ transitions. σm has only RSR∗ transitions in enabled(σm). GetTransi-

tion arbitrarily selects RSR∗ : {S0,1, R1,2} and generates the rest of the interleaving

and returns. The POE algorithm then invokes UpdateBacktrack for each of

the states generated. UpdateBacktrack only updates a state when is∗(curr(σ))

is true. In this case, it updates the backtrack(σm) with enabled(σm). For the rest of

the states, backtrack(σ) = {curr(σ)}. The POE algorithm will start popping off

states from statevec until it reaches a state where backtrack(σ) − done(σ) '= ∅.

53

Figure 4.7. POE Interleaving 2

All the states from σr to σm+1 get popped out. Since statevec is not empty,

GenerateInterleaving is now invoked and generates the second interleaving

shown in Figure 4.7. GenerateInterleaving will now re-execute the same

set of transitions from σ0 until σm (lines 3–5). The transition to be executed

from σm is selected from backtrack(σm) − done(σm) which is RSR∗ : {S2,2, R1,2}.

The second interleaving will eventually reach the final deadlocked state σf where

Ready(σf ) '= emptyset and the processes have not executed MPI_Finalize.

54

4.4 Issues with POE Algorithm

Though the POE algorithm is guaranteed to detect deadlocks and generate all

relevant interleavings for an MPI program, it does so under the assumption that

none of the sends have any runtime buffering. Also, the POE algorithm can result

in a number of redundant interleavings as will be made evident in this section.

4.4.1 Redundant Interleavings

The POE algorithm will cause multiple interleavings only when there is a state

with multiple RSR∗ transitions. Consider the MPI program in Figure 4.8

The POE algorithm execution will result in a state σi where enabled(σi) =

{RSR∗ : {S0,1, R1,1}, RSR∗ : {S2,1, R3,1}}. The POE algorithm would now make

backtrack(σi) = enabled(σi) which will result in the two interleavings while the

number of relevant interleavings for the program in Figure 4.8 is only 1. Note that

for n such independent RSR∗ transitions co-enabled in a state, the POE algorithm

will cause n! interleavings while just 1 interleaving is sufficient. However, this

redundancy cannot be eliminated by not adding transitions from different DTG

groups to backtrack sets. Consider the example program shown in Figure 4.9.

The POE algorithm would enter a state σi where enabled(σi) = {RSR∗ :

{S0,1, R1,1}, RSR∗ : {S2,1, R3,1}}. Now consider the scenario where the POE

algorithm would consider the two RSR∗ transitions as independent and add only

one of them to the backtrack set, say RSR∗ : {S0,1, R1,1}. In this case, it is possible

to take a transition in enabled(σi) − backtrack(σi) = {RSR∗ : {S2,1, R3,1}} and

enter a state σj where enabled(σj) = {RSR∗ : {S0,1, R1,1},RSR∗ : {S2,3, R1,1}}. The

transition RSR∗ : {S0,1, R1,1} can be disabled by the transition RSR∗ : {S2,3, R1,1}.

P0 P1 P2 P3

S0,1(1); R1,1(∗) S2,1(3); R3,1(∗)W0,2(h0,1); W1,2(h1,1); W2,2(h2,1); W3,2(h3,1);

Figure 4.8. Redundant POE Interleavings

55

P0 P1 P2 P3

S0,1(1); R1,1(∗) S2,1(3); R3,1(∗)W0,2(h0,1); W1,2(h1,1); W2,2(h2,1); W3,2(h3,1);S0,3(3); R1,3(∗) S2,3(1); R3,3(∗)W0,4(h0,3); W1,4(h1,3); W2,4(h2,3); W3,4(h3,3);

Figure 4.9. POE and Persistent Sets

An identical scenario occurs when backtrack(σi) contains only RSR∗ : {S2,1, R3,1}.

This make the backtrack(σi) nonpersistent. The POE algorithm hence makes

backtrack(σi) = enabled(σi) to make the backtrack sets persistent for RSR∗ transi-

tions.

Chapter 5 extends the POE algorithm to the POEOPT algorithm to reduce this

redundancy.

4.4.2 POE and Buffered Sends

The POE algorithm works only when the sends do not have adequate buffering.

However, if sends can be buffered, it misses the deadlocks present in the program.

Consider the MPI example shown in Figure 4.10.

When none of the sends are buffered, only S1,1 can match the wildcard receive

R2,1 and there is no deadlock. However, when S1,1 is buffered, the W1,2 can complete

even before the send is matched. This enables S0,1 and R1,3 to match and since S0,1

is matched, it can complete unblocking the W0,2. Now, S0,3 is issued and since the

P0 P1 P2

S0,1(1) S1,1(2) R2,1(∗)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1)S0,3(2) R1,3(0) R2,3(0)W0,4(h0,3) W1,4(h1,3) W2,4(h2,3)

Figure 4.10. Buffering Sends and POE

56

wildcard receive is not yet matched, it can be matched with S0,3 and result in a

deadlock since R2,3 will not have a matching send. Note that this deadlock cannot

happen when none of the sends are buffered. We call this the slack inelastic [31]

property of MPI. One solution would be to buffer all the sends. However, this will

mean that any deadlocks corresponding to nonbuffered sends will not be detected

by POE. Since buffer allocation is a dynamic property, our goal is to extend POE

so that it can detect all deadlocks. Chapter 7 extends the POE algorithm to handle

buffered sends.

CHAPTER 5

POE AND REDUNDANT

INTERLEAVINGS

This chapter extends the POE algorithm to reduce the redundant interleavings

generated by the POE algorithm. Section 5.1 provides a few examples on how

POE generates redundant interleavings. We then define the InterHB relation in

Section 5.2 and use this to derive co-enabledness properties of MPI operations.

Section 5.3 then describes the POEOPT algorithm that uses the co-enabledness

properties derived in Section 5.2 to reduce the redundant interleavings in POE.

Section 5.3 also provides the proof that the backtrack set of every state generated

by the POEOPT algorithm is persistent.

5.1 POE and Redundant Interleavings

This section presents a few examples to describe scenarios where the POE

algorithm can contribute to redundant interleavings. The POE algorithm generates

multiple interleavings only in the presence of wildcard receives. For deterministic

MPI programs with no wildcard receives, the POE algorithm optimally produces

only a single interleaving. Hence, the programs of interest are those MPI programs

that have wildcard receives.

Consider the MPI program execution shown in Figure 5.1.

The POE algorithm executes Figure 5.1 from state σ0 as follows:

• The PS transitions are executed : σ0PS :{S0,1}−−−−−→ σ1

PS :{S2,1}−−−−−→ σ2PS :{S4,1}−−−−−→ σ3.

• The PR transitions are executed : σ3PR:{R1,1}−−−−−−→ σ4

PR:{R3,1}−−−−−−→ σ5.

• The PW transitions are executed : σ5PW :{W0,2}−−−−−−→ σ6

PW :{W1,2}−−−−−−→ σ7PW :{W2,2}−−−−−−→

σ8PW :{W3,2}−−−−−−→ σ9

PW :{W4,2}−−−−−−→ σ10.

58

P0 P1 P2 P3 P4

S0,1(1) R1,1(∗) S2,1(1) R3,1(∗) S4,1(3)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1) W3,2(h3,1) W4,2(h4,1)

R1,3(2)W1,4(h1,3)

Figure 5.1. Redundant POE Interleavings

• There are no more process transitions available at σ10 and enabled(σ10) = {

RSR∗ : {S0,1, R1,1}, RSR∗ : {S2,1, R1,1}, RSR∗ : {S4,1, R3,1}}.

• For state σ10, backtrack(σ10) = enabled(σ10).

• The POE algorithm will generate one interleaving for each transition in

backtrack(σ10), resulting in 3 interleavings.

However, the above program requires only 2 interleavings. Also, if there were

two more processes such that RSR∗ : {S5,1(6), R6,1(∗)} were also enabled in σ10,

the POE algorithm would generate 10 interleavings while only 2 interleavings are

sufficient to detect the deadlock present in the program.

The redundancy in the POE algorithm arises when there are multiple RSR∗

transitions enabled in a state σi and the wildcard receives involved in the RSR∗

transitions are different. POE generates these redundant interleavings in order to

keep the backtrack sets persistent. Consider the example in Figure 5.2.

The POE algorithm will execute Figure 5.2 as follows:

P0 P1 P2 P3

S0,1(1) R1,1(∗) S2,1(3) R3,1(∗)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1) W3,2(h3,1)S0,3(3) R1,3(∗) S2,3(1) R3,3(∗)W0,4(h0,3) W1,4(h1,3) W2,4(h2,3) W3,4(h3,3)

Figure 5.2. POE and Persistent Sets

59

• The PS transitions are executed : σ0PS :{S0,1}−−−−−→ σ1

PS :{S2,1}−−−−−→ σ2.

• The PR transitions are executed : σ2PR:{R1,1}−−−−−−→ σ3

PR:{R3,1}−−−−−−→ σ4.

• The PW transitions are executed : σ4PW :{W0,2}−−−−−−→ σ5

PW :{W1,2}−−−−−−→ σ6PW :{W2,2}−−−−−−→

σ7PW :{W3,2}−−−−−−→ σ8.

• There are no more process transitions available at σ8 and enabled(σ8) = {

RSR∗ : {S0,1, R1,1}, RSR∗ : {S2,1, R3,1}}.

If backtrack(σ8) '= enabled(σ8), the POE algorithm would generate only 3 inter-

leavings when there are 4 relevant interleavings that match {R1,1, S0,1}, {S2,3, R1,1},

{S2,1, R3,1}, {S0,3, R3,1}. The backtrack(σ8) is therefore not persistent.

The goal of this chapter is to reduce the redundant interleavings while keeping

the backtrack sets persistent. One simple optimization would be to look at the

generated interleaving I = σ → σ1 → . . . → σn and check if there is some send

Sm,n(i) such that for all σi generated in I, RSR∗ : {Sm,n(i), Ri,j(∗)} /∈ enabled(σi)

and update the backtrack set to enabled set only if such Sm,n exists. This will

fix the redundant interleavings in Figure 5.1 and will also maintain the persistent

backtrack sets for Figure 5.2. Now let us apply the simple optimization to the

example in Figure 5.3.

When the simple optimization is applied to the POE algorithm execution of

Figure 5.3, the optimization would find that S3,5 can be matched with R1,1 and

P0 P1 P2 P3

S0,1(1) R1,1(∗) R2,1(∗) S3,1(2)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1) W3,2(h3,1)

S1,3(3) R3,3(1)W1,4(h1,3) W3,4(h3,3)R1,5(∗) S3,5(1)W1,6(h1,5) W3,6(h3,5)

Figure 5.3. Simple Optimization and Redundancy

60

that there is no RSR∗ : {S3,5, R1,1} transition enabled in any state. This will cause

both RSR∗ : {S0,1, R1,1} and RSR∗ : {S3,1, R2,1} to be added to the backtrack set.

However, notice that S3,5 and R1,1 will never be in Ready(σi) for any state σi to

form a RSR∗ transition. The number of relevant interleavings for Figure 5.3 is one

while two interleavings are generated even by applying the simple optimization.

The POE algorithm is only aware of the IntraHB relation which dictates the

order in which the operations enter and leave the Ready set within a process. In

order to address redundancy issues, the POE algorithm must also be able to detect

whether two MPI operations across processes can be in the Ready set at the same

state to form a transition. The POE algorithm does not have this information

available.

The next Section, 5.2, introduces the InterHB relation that will help under-

stand the co-enabledness properties of MPI operations across processes.

5.2 InterHB and Co-enabledness

The MPI runtime (R) transitions that match various MPI operations (RSR,

RSR∗, RBC) are enabled in a state σi depending on the MPI operations available

in Ready(σi). For example, an RSR : {Si,j(k), Rk,l(i)} transition is in enabled(σi)

only when a Si,j(k) ∈ Ready(σi) and Rk,l(i) ∈ Ready(σi). An RBC transition is in

enabled(σi) only when the barriers operations B of all processes are in Ready(σi).

In order to eliminate the redundancy due to the multiple RSR∗ transitions, we

only need to know if there exists any state σi such that Si,j(k) and its matching

wildcard receive Rk,l(∗) can both be in Ready(σi). We therefore wish to detect the

co-enabledness of MPI operations where the co-enabledness of two MPI operations

Mi,j and Nk,l (M, N ∈ {S,R,B,W}) is defined as follows:

Definition 5.1 Two MPI operations Mi,j and Nk,l and M, N ∈ {S,R,W, B}

are co-enabled iff {Mi,j, Nk,l} ⊆ Ready(σi) for some state σi.

The Ready set of a state is a function of the IntraHB relation among the

MPI operations defined in section 3.2. The following lemma directly follows from

61

Definition 3.7 of Ready set and the fact that the IntraHB relations do not change

across interleavings.

Lemma 5.2 If two MPI operations Mi,j and Ni,k (j < k, M, N ∈ {S,R,W, B})

are such that 〈Mi,j, Ni,k〉 ∈ IntraHB(σi), then there is no state σi such that

{Mi,j, Ni,k} ⊆ Ready(σi). That is, Mi,j can never be co-enabled with Ni,j.

Lemma 5.2 says that MPI operations that are related by the IntraHB relation

can never be co-enabled. The IntraHB relation is only among MPI operations with

the same MPI process rank.

Since co-enabledness among MPI operations is defined based on their pres-

ence/absence in the Ready set of the states, the only transitions that can cause MPI

operations to be added or removed from the Ready set are RSR,RBC ,RWC ,RSR∗.

We need to find co-enabledness among MPI operations across process ranks to

detect if a wildcard receive and its matching send issued by a different process can

be co-enabled. In order to be able to do this, we now add InterHB relation among

MPI operations across process ranks.

The InterHB relation is defined from the the IntraHB relation and the match-

sets formed between MPI operations in an interleaving. Figure 5.4 shows the

IntraHB and InterHB relation across MPI operations. The IntraHB relation is

shown as a solid line between MPI operations within the same process rank and

InterHB edges are shown as dotted lines between MPI operations across MPI

processes.

(a) InterHB for determin-istic matches

(b) InterHB for nondeter-ministic matches

(c) InterHB for barriermatches

Figure 5.4. InterHB Relation Across Match-sets

62

Consider Figure 5.4(a). Let Ri,j and Sm,n be such that {Ri,j, Sm,n} ⊆ Ready(σ).

From Lemma 5.2, we known that Ri,j and Mi,k can never be co-enabled. Similarly,

Sm,n and Nm,p can never be co-enabled. However, when a match set is formed

between Ri,j and Sm,n, both of them leave their Ready set at the same time. This

means that Sm,n cannot be co-enabled with Mi,k and Ri,j cannot be co-enabled with

Nm,p. We hence show this using a dotted edge (InterHB relation). The same holds

for Figure 5.4(c).

For nondeterministic matches, even if Ri,j(∗) and Sm,n are in the Ready(σ), it

is still possible that the Ri,j(∗) can match with some other send and can cause Sm,n

to remain in the Ready set while Ri,j is removed. Therefore, it is possible for Sm,n

to be co-enabled with Mi,k. However, Ri,j can never be co-enabled with Nm,p which

is shown as a dotted edge in Figure 5.4(b). The InterHB relation is generated

only after an interleaving I = σ0 → σi → . . . → σn is generated using the POE

algorithm.

We now formally define InterHB.

Definition 5.3 For an Interleaving I = σ0 → σi → . . . → σn,

InterHB(σn as 〈I,M, C, R, ls〉) ⊆ I × I is defined as follows:

• If {Ri,j(m), Sm,n(i)} ⊆ Ready(σj) where σj is some state in I, then for all

Mi,k ∈ Descendants(σn, Ri,j) and Nm,p ∈ Descendants(σn, Sm,n) we have

that 〈Ri,j, Nm,p〉 ∈ InterHB(σn) and 〈Sm,n, Ni,k〉 ∈ InterHB(σn).

• If {Ri,j(∗), Sm,n(i)} ⊆ Ready(σj) where σj is some state in I, then for all

Nm,p ∈ Descendants(σn, Sm,n) we have that 〈Ri,j, Nm,p〉 ∈ InterHB(σn).

• If {Bi,j, Bm,n} ⊆ Ready(σj) where σj is some state in I, then for all Mi,k ∈

Descendants(σn, Bi,j) and Nm,p ∈ Descendants(σn, Bm,n) we have that 〈Bi,j, Nm,p〉 ∈

InterHB(σn) and 〈Bm,n, Ni,k〉 ∈ InterHB(σn).

Definition 5.4 Given an interleaving I = σ0 → σi → . . . → σn, HB(I) =

IntraHB(σn) ∪ InterHB(σn).

63

Let HB∗(I) denote the transitive closure of HB(I). When the context is clear,

we denote HB(I) as HB and HB∗(I) as HB∗.

The following lemma follows directly from the construction of InterHB from

IntraHB.

Lemma 5.5 Let I and I ′ be two equivalent interleavings. For two MPI opera-

tions, Mi,j and Nk,m if 〈Mi,j, Nk,m〉 ∈ HB∗(I), then 〈Mi,j, Nk,m〉 ∈ HB∗(I ′).

Proof : Since I and I ′ are equivalent, the M sets are the same in the final

states of the interleavings. If there are no nondeterministic receives, the InterHB

relation between the operations will also be the same since the deterministic receive

and its matching send must be co-enabled in some state in both I and I ′. If there

is a wildcard receive, then it is possible that there are two sends Si,j(l) and Sm,n(l)

co-enabled in a state with the matching wildcard receive Rl,r(∗) in I but Sm,n is not

co-enabled with with Rl,r in I ′. In this case, Sm,n is matched with Rl,r′ where r′ > r

in I ′ and there is an InterHB relation from Rl,r′ to descendants of Sm,n. Also,

there is an IntraHB relation from Rl,r to Rl,r′ in both I and I ′. By transitivity of

HB, Rl,r and all descendants of Sm,n are also HB related in I ′.

Since the backtrack sets for a state are updated based on the current inter-

leaving I, the Lemma 5.5 facilitates optimizations that can reduce the redundant

interleavings in the POE algorithm.

Figure 5.5 shows the HB relation for the MPI program in Figure 5.3 as a graph

among the MPI operations. The IntraHB relation is shown using solid lines and

InterHB is shown using dashed lines. Note that R1,1 is HB related to the matching

send S3,5 (shown using darker lines). Therefore R1,1 and S3,5 cannot be co-enabled

in an equivalent interleaving by Lemma 5.5. The readers may verify that R1,1 can

indeed never be matched with S3,5. The POE algorithm can now be extended to

update the backtrack(σ) to enabled(σ) in an interleaving only when the receive and

send are not HB related. This can still cause redundant interleavings.

Consider the MPI execution in Figure 5.6 The POE algorithm using the HB

relation would work as follows:

64

Figure 5.5. HB Relation for Figure 5.3 Shown as Graph

Figure 5.6. Redundancy with New POE Algorithm

1. The first interleaving I is generated by GenerateInterleaving.

2. There is a state σi in I such that enabled(σi) = {RSR∗ : {S0,1, R1,1},RSR∗ :

{S3,1, R2,1},RSR∗ : {S5,1, R4,1}}. Let curr(σi) = RSR∗ : {S0,1, R1,1}.

3. The UpdateBacktrack is invoked on σi.

4. Since S3,3 can be matched with R1,1, and S3,3 and R1,1 are not HB(I) related,

backtrack(σi) = enabled(σi).

65

However, note that it is not required to add RSR∗ : {S5,1, R4,1} to backtrack(σi).

Adding redundant transitions to the backtrack set can exponentially increase the

number of interleavings.

The POE algorithm must be able to decide which transitions must be added to

the backtrack in order to co-enable R1,1 and S3,3 in some state instead of adding

every enabled transition to the backtrack set. The algorithm could minimally add

just RSR∗ : {S3,1, R2,1} to backtrack(σi) since R2,1 and S3,3 are HB∗ related. This is

because the HB relation also provides the order in which the MPI operations enter

the Ready set (i.e., enabling order). We use this insight to develop the POEOPT

algorithm described in the next section.

5.3 POE Algorithm Modified

We now present the POEOPT algorithm (Figures 5.7, 5.8, 5.9, 5.10, 5.11) which

extends the POE algorithm to handle the redundant interleavings due to RSR∗

transitions in the backtrack set. The POE algorithm is only different in the way

the backtrack sets are updated. The rest of the algorithm is exactly the same.

UpdateBacktrackset only updates the backtrack set for the states that

have only RSR∗ transitions and invokes AddtoBacktrack. For the rest of

the states, the backtrack sets remain unchanged. Consider the state σi that has

only RSR∗ transitions enabled. The algorithm selects one of the RSR∗ transitions.

The original POE algorithm would then update the backtrack(σi) to enabled(σi).

Instead, the POEOPT algorithm updates the backtrack sets such that, if Ri,j(∗)

is the receive involved in curr(σi) then backtrack(σi) is updated with DTG of

curr(σi) in enabled(σi). In order to detect if it is possible for any other send

Sk,l(i) /∈ Ready(σi) (i.e., Sk,l is not co-enabled with Ri,j), to match with Ri,j, the

HB∗ is used to see if the 〈Ri,k, Sk,l〉 is in HB∗. If 〈Ri,k, Sk,l〉 is in HB∗ then the

backtrack(σi) is not updated. Otherwise, some RSR∗ : {Rp,q(∗), Sm,n(p)} transition

is added to the backtrack(σi) where Rp,q, Sk,l ∈ HB∗.

We now prove that the backtrack sets are persistent for every state in statevec.

66

1: POEOPT(σ0, statevec) {2: statevec.push(σ0);6: curr(σ0) = GetTransition(enabled(σ0));11: backtrack(σ0) = backtrack(σ0) ∪ {curr(σ)};3: while (! statevec.empty()) {5: GenerateInterleaving (statevec);5: UpdateBacktrack (statevec);6: for (i = statevec.size()−1; i ≥ 0; i−−) {7: if (backtrack(statevec[i]) == done(statevec[i])) {8: statevec[i].pop();9: } else {10: break;11: }12: }13: }

Figure 5.7. Pseudocode for POEOPT Algorithm

1: GetTransition(set of transitions T ) {4: if hasnon∗(T )5: return choosenon∗(T );6: else7: return choose∗(T );8: }

Figure 5.8. Pseudocode for GetBacktrack

1: UpdateBacktrack(statevec) {2: for each (σ ∈ statevec) {3: if (enabled(σ) = ∅)4: return;4: ti = curr(σ);5: if (is∗(ti))10: AddtoBacktrack (ti, σ, statevec);11: }12: }


67

1: AddtoBacktrack(Transition ti ,σ, statevec) {2: backtrack(σ) = backtrack(σ) ∪DTG(σ, ti)3: let Ri,j(∗) be the receive operation of ti4: for each Sk,l /∈ Ready(σ) such that 〈Ri,j, Sk,l〉 /∈ HB∗

5: for each (t ∈ enabled(σ)− backtrack(σ)) {6: let Rp,q(∗) be the receive operation of t7: if (〈Rp,q, Sk,l〉 ∈ HB∗) {8: backtrack(σ) = backtrack(σ)∪{t};9: }10: }11: }

Figure 5.10. Pseudocode for AddtoBacktrack

1: GenerateInterleaving(statevec) {2: σ = statevec[0];3: for (i= 0; i < statevec.size()−1; i++) {4: σ = Execute(statevec[i], curr(statevec[i]));5: }6: curr(σ) = GetTransition(backtrack(σ)−done(σ));8: do {9: σ = Execute(σ, curr(σ));10: statevec.push(σ);12: curr(σ) = GetTransition(enabled(σ));11: backtrack(σ) = backtrack(σ) ∪ {curr(σ)};13: done(σ) = done(σ) ∪ {curr(σ)};14: } while (enabled(σ) '= ∅);15: }

Figure 5.11. Pseudocode for GenerateInterleaving

Theorem 5.6 For any state σ generated by the POEOPT algorithm, the backtrack(σ)

is persistent.

Proof : We prove by induction along the postorder.

Basis Case: The final states are persistent because the set of enabled transi-

tions is empty.

Induction Hypothesis: All successors of σi are persistent.

68

Induction Step: Consider the transition ti taken out of σi in the current

interleaving. Clearly, ti is the first transition of many interleavings that start from

σi, all of which have already been explored. Consider a transition t as in Figure 5.10

involving the Rp,q(∗) transition. Suppose we have not formed a persistent set at σi

and thereby have left out t from the persistent set. That is, t starts an interleaving

I in the full state space, but we did not include t in our persistent set.

Since t itself is independent of ti (they are not in the same DTG), this means

that t must lead to a transition tdi that is dependent on ti. Such a transition must

be in the same DTG as ti, and involve a send such as Sk,l(i). Clearly, Sk,l(i) was

not ready at σi (or else tdi would have been in the persistent set at σi). Since Sk,l(i)

did get enabled later (it was the send transition that was part of tdi ), we must have

〈Rp,q, Sk,l〉 ∈ HB∗. But now, since t and ti are independent, we can surmise that an

interleaving equivalent to I, say I ′, was pursued, and that ti is the first transition

of such an interleaving.

There are two relationships between I and I ′:

• They have the same HB relationship (Lemma 5.5),

• Since equivalent transition sequences represent the same matching decisions

between MPI processes, they define the same control flow branching decisions.

The latter fact was tacit in most of our descriptions, but we are explicating it

for clarity now. More specifically, since I is based on a dynamic execution sequence,

the reader may wonder whether the same dynamic execution sequence would exist

as some I ′. It would indeed, as this argument shows. What we are saying is

that the same MPI instructions would be “processed” along I ′also, and these

MPI instructions would be situated in the same HB relation. Furthermore, due to

our induction hypothesis, we can say that all the computations after ti were done

“correctly,” meaning that an equivalent interleaving I ′will indeed be found.

Now that we have established that I ′would also compute the same HB, we

can observe from our algorithm in Figure 5.10 that we would have added t to the

backtrack set at σi, contradicting the fact that t is outside the persistent set.

CHAPTER 6

DETERMINISTIC MPI PROGRAMS

This chapter proves that for deterministic MPI programs, if Ri,j and Sm,n are a

receive and a send operation, if 〈Ri,j, Sm,n〉 /∈ HB∗(I) and 〈Sm,n, Ri,j〉 /∈ HB∗(I),

then Ri,j and Sm,n are co-enabled in some equivalent interleaving I ′.

6.1 Deterministic MPI Programs and HB

Deterministic MPI programs have the properties enumerated below:

• Since there are no wildcard receives in deterministic MPI programs, there

is only one relevant interleaving and every interleaving is equivalent to any

other interleaving of the program.

• The HB∗ relation will remain the same for any interleaving.

• If there is no deadlock in an interleaving, then there can be no deadlocks in

any other equivalent interleaving of the program.

We now prove that for a receive operation Ri,j and send operation Sm,n in a

deterministic program (note that Ri,j and Sm,n need not match; they may not even

target each other) if Ri,j and Sm,n cannot be co-enabled then 〈Ri,j, Sm,n〉 ∈ HB∗

or 〈Sm,n, Ri,j〉 ∈ HB∗.

Lemma 6.1 Consider a deadlock-free interleaving of a deterministic MPI pro-

gram I = σ0t0−→ σ1

t1−→ . . .tn−1−−→ σn. If Ri,j cannot be co-enabled with Sm,n in I,

then 〈Ri,j, Sm,n〉 ∈ HB∗ or 〈Sm,n, Ri,j〉 ∈ HB∗.

Proof : Given an interleaving I = σ0 → σ1 → . . . → σn, we use the notation

σi < σj when i < j to denote that σi was generated before σj in I.

70

For ease of notation, we denote the issue order among MPI operations of a

process as follows: if Mi,j is an MPI operation, we denote Ni,j′ to denote an MPI

operation that is issued after Mi,j where j′ > j. Similarly, we denote an MPI

operation as Fi,j′′ when Fi,j′′ is issued after Ni,j′ where j′′ > j.

Let σa be the state in I where RSR : {Ri,j, Sk,l} is executed for some Sk,l and

σb be the state in I where RSR : {Rp,q, Sm,n} is executed for some Rp,q.

Since Ri,j and Sm,n cannot be co-enabled, either σa < σb or σb < σa. We prove

by contradiction for the case when σa < σb (the other case is similar).

• Assume that 〈Ri,j, Sm,n〉 /∈ HB∗.

• Consider an interleaving I ′ equivalent to I where RSR : {Ri,j, Sk,l} is executed

in σa′ only when enabled(σa′) = {RSR : {Ri,j, Sk,l}} (i.e., RSR : {Ri,j, Sk,l} is

executed only when there is no other transition to be executed).

• Since I and I ′ are equivalent, HB∗(I) and HB∗(I ′) are equal (Lemma 5.5).

We use HB∗ to denote the HB relation for both I and I ′.

• Since RSR : {Ri,j, Sk,l} is the only transition in enabled(σa′), all the processes

must be blocked either at a W or a B.

• If all the processes are blocked at B operation, this will cause a RBC to be

enabled in σa′ . This is not possible since enabled(σa′) is a singleton containing

only RSR : {Ri,j, Sk,l} transition. Hence, at least one of the processes must

be blocked at a W operation.

• If both Pi and Pk are blocked at a B, then executing RSR : {Ri,j, Sk,l} will not

unblock the B operations (since there is some other process that is blocked

on a W ) and hence will result in a deadlock. This is not possible since I is

deadlock-free. Hence, either Pi or Pk must be blocked on a W .

• If both Pi and Pk are blocked at W

– Assume that the process Pi is blocked at Wi,j′ such that 〈Ri,j, Wi,j′〉 /∈

HB∗(I ′) (i.e., Wi,j′ is not the W corresponding to the receive Ri,j).

71

– Similarly, assume that the process Pk is blocked at Wk,l′ such that

〈Sk,l, Wk,l′〉 /∈ HB∗.

– Executing the RSR : {Ri,j, Sk,l} transition from σa′ will not unblock any

of the waits causing a deadlock.

– Since I is deadlock-free, atleast one of the Wi,j′ or Wk,l′ must be such

that 〈Ri,j, Wi,j′〉 ∈ HB∗ or 〈Sk,l, Wk,l′〉 ∈ HB∗ .

• Without loss of generality, assume that 〈Ri,j, Wi,j′〉 ∈ HB∗

• Since {Ri,j, Sk,l} ⊆ Ready(σa′), 〈Sk,l, Wi,j′〉 ∈ HB∗ (by InterHB construc-

tion). Also, for all Fi,j′′ where F ∈ {S,R,W, B}, 〈Sk,l, Fi,j′′〉 ∈ HB∗.

• Executing the RSR : {Ri,j, Sk,l} transition from σa′ will unblock Wi,j′ .

• Some MPI operation Fi,j′′ following Wi,j′ will unblock some process Pr.

– If Pr is blocked at Wr,k, then 〈Fi,j′′ , Wr,k〉 ∈ HB∗ since Fi,j′′ must be

matched with the send or receive corresponding to Wr,k. Also, for all

Mr,k′ , we have 〈Fi,j′′ , Mr,k′〉 ∈ HB∗. Therefore, 〈Ri,j, Mr,k′〉 ∈ HB∗.

– If Pr is blocked at Br,k, then Fi,j′′ = Bi,j′′ . Therefore, 〈Bi,j′′ , Fr,k′〉 ∈ HB∗

for all k′ > k. Therefore, 〈Ri,j, Fr,k′〉 ∈ HB∗.

• Hence, as each process unblocks, there is an HB∗ relation from Ri,j to all the

MPI operations following the blocked W or B operations of every process.

For every process Pl, that is blocked at Wl,p or Bl,p, 〈Ri,j, Fl,p′〉 ∈ HB∗.

• If Sm,n is issued after the blocking W or B of Pm, then 〈Ri,j, Sm,n〉 ∈ HB∗.

This is a contradiction.

• If Sm,n were issued before the blocking W or B of Pm, since Sm,n and Ri,j are

not co-enabled, Sm,n /∈ Ready(σa′). Therefore, there is some MPI operation

Fm,r ∈ Ready(σa′) and r < n and 〈Fm,r, Sm,n〉 ∈ HB∗. Since enabled(σa′)

consists of only a single transition, Fm,r must be matched with an MPI

operation Fl,p′ of some process Pl that is issued after the blocking operation

72

of Pl (Wl,p or Bl,p). Therefore, 〈Fl,p′Sm,n〉 ∈ HB∗. Also, 〈Ri,j, Fl,p′〉 ∈ HB∗.

Therefore, 〈Ri,j, Sm,n〉 ∈ HB∗. This is a contradiction.

Lemma 6.2 Corollary for lemma 6.1: Consider a deadlock-free interleaving of

a deterministic MPI program I = σ0t0−→ σ1

t1−→ . . .tn−1−−→ σn. If 〈Ri,j, Sm,n〉 /∈ HB∗

and 〈Sm,n, Ri,j /∈ HB∗ then Ri,j and Sm,n are co-enabled in some state.

The only difference between the HB∗ for a deterministic MPI program and an

MPI program with wildcard receives is the absence of the InterHB edge between

a send and descendants of the matching receive. Given an interleaving I, we define

the Deterministic(HB(I)) as follows:

Definition 6.3 Given an interleaving I = σ0 → σ1 → . . . → σn, Deterministic(

HB(I)) = HB(I) ∪ {〈Si,j(k), Fk,p′〉} where {Si,j, Rk,p} are in the M set of σn, Fk,p′

∈ Descendants(σn, Rk,p).

By generating a deterministic HB for an interleaving, if there is no path between

a receive and a send in Deterministic(HB), then the receive and send can be

co-enabled in an equivalent interleaving. We use this result to find the sends that

must be buffered to eleminate the HB∗ relations between a send and a matching

wildcard receive so as to co-enable them.

CHAPTER 7

HANDLING SLACK IN MPI PROGRAMS

This chapter deals with the slack-inelastic deadlocks of MPI programs described

in Section 4.4. Section 7.2 provides the reasons why buffering all sends is not a

solution. Section 7.4 characterizes the complexity of finding minimal sets of sends

to be buffered that can guarantee to detect all deadlocks, including head-to-head

deadlocks. Section 7.5 describes the minimal slack enumeration variant of POEOPT,

namely POEMSE, and its proof of correctness.

7.1 Verification for Portability

The importance of verifying a program for portability cannot be over-emphasized.

Given the growing popularity of dynamic verification, many important questions

must be answered:

• Does it matter where we run dynamic verification?

• In particular, having verified a program on one platform, what can we say

about its correctness on another platform?

While it is too ambitious to solve verification for portability in general, this

dissertation offers the following unique contribution: For buffering (slack) sensitive

behavioral variations, POEMSE guarantees that verifying a program to be correct

on any platform implies correctness on all platforms. In effect, POEMSE is able

to compute where slack matters, and simulate slack whenever it matters during

verification.

It is known from [31] (in the context of CHP programs - CSP for hardware) and

[48] (for MPI) that some MPI programs can have more behaviors when buffer

74

sizes are increased (say, when their eager limits are increased). Yet, the MPI

community is still unaware of these results. Commercial tools such as the Intel

Message Checker [22] still look for deadlocks by setting all send buffer sizes to zero

and rely on timeouts to tell when a deadlock has been encountered. However, the

deadlock in Figure 7.1 cannot be revealed by this approach. While one may hope to

detect deadlocks by simulating infinite buffering, the example in Figure 7.2 shows

that sometimes deadlocks are triggered only if some of the sends are buffered. While

we shall detail these examples in Section 7.2, the nasty reality is that one must be

prepared to verify for all combinations of sends with/without buffering. POEMSE

avoids this exponential cost by determining where (for which sends) slack matters,

and only replays the analysis for those.

As a real world example of the complexity of predicting when a send will have

slack, consider the discussion of eager limit computation given in [20]:

The Parallel Environment implementation of MPI uses an eager send protocolfor messages whose size is up to the eager limit. This value can be allowed todefault, or can be specified with the MP_EAGER_LIMIT environment variableor the -eager_limit command-line flag. In an eager send, the entire messageis sent immediately to its destination and the send buffer is returned to theapplication. Since the message is sent without knowing if there is a matchingreceive waiting, the message may need to be stored in the early arrival bufferat the destination, until a matching receive is posted by the application.The MPI standard requires that an eager send be done only if it can beguaranteed that there is sufficient buffer space. If a send is posted at somesource (sender) when buffer space cannot be guaranteed, the send must notcomplete at the source until it is known that there will be a place for themessage at the destination.

P0 P1 P2

S0,1(1) S1,1(2) R2,1(∗)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1)S0,3(2) R1,3(0) R2,3(0)W0,4(h0,3) W1,4(h1,3) W2,4(h2,3)

Figure 7.1. Buffering Sends and Deadlocks

75

Figure 7.2. Specific Buffering Needed

PE MPI uses a credit flow control, by which senders track the buffer spacethat can be guaranteed at each destination. For each source-destination pair,an eager send consumes a message credit at the source, and a match at thedestination generates a message credit. The message credits generated atthe destination are returned to the sender to enable additional eager sends.The message credits are returned piggyback on an application message whenpossible. If there is no return traffic, they will accumulate at the destinationuntil their number reaches some threshold, and then be sent back as a batchto minimize network traffic. When a sender has no message credits, itssends must proceed using rendezvous protocol until message credits becomeavailable. The fall back to rendezvous protocol may impact performance.With a reasonable supply of message credits, most applications will find thatthe credits return soon enough to enable messages that are not larger thanthe eager limit to continue to be sent eagerly.

From this discussion, it must be clear that algorithms such as POEMSE are

essential if we were to avoid the cost of exponential analysis for slack. While

POEMSE’s analysis could, in the worst case, be exponential in the number of sends

issued at runtime, in practice, it is an extremely low number compared to this worst

case.

76

7.2 Introduction to Slack Analysis

The POE algorithms developed until now assumed that all the sends are non-

buffered, i.e., none of the sends are provided any runtime buffering or slack. We

now remove the buffering constraints on the sends and allow sends to be either

buffered by the runtime or nonbuffered by the runtime. Once the send is buffered,

the send can be completed at anytime. We now define the parameterized runtime

send buffering transition RSC transition as follows:

RSC(µ) :Σ(σ as 〈I,M, C, R, ls〉), {Si,j} = µ, Si,j /∈ C, {Si,j, Rk,l} /∈ M

Σ〈I,M, C ∪ {Si,j}, R, ls〉

The RSC(µ) transition completes any send in µ that is not matched yet.

On page 83, we shall define a big-step move transition RSBC(µ) that will

complete all the transitions in µ. We will be feeding as µ sets of sends that are

determined to be minimal. We shall now present examples that illustrate how these

minimal send sets are determined.

7.2.1 Zero Buffering Can Miss Deadlocks

Consider an MPI program execution in Figure 7.1. The MPI program in

Figure 7.1 will be deadlocked only when either S0,1 or S1,1 or both are buffered.

There is no deadlock when both S0,1 and S1,1 remain nonbuffered. The POEOPT

algorithm which executes under zero buffering for all sends will not be able to detect

this deadlock. Also note that it is sufficient to just buffer one of S0,1 or S1,1. The

buffering status of S0,3 will not matter in detecting the deadlock. This example

shows how providing slack to sends can cause communication races with respect to

wildcard receives and hence can result in erroneous behaviors.

One solution to detect all the communication races would be to buffer all the

sends in an MPI program. This will detect all the deadlocks involving communi-

cation races with respect to wildcard receives. Hence, it will now be sufficient to

execute the POE algorithm with full buffering for all sends to detect all commu-

nication races. However, nonbuffered sends are themselves a source of a deadlock.

77

Under insufficient buffering, a send can remain blocked on its wait when there is no

matching receive for the send and hence result in a deadlocked state. By allowing all

sends to be buffered, the POE algorithm will not be able to detect any head-to-head

deadlocks involving nonbuffered sends.

One solution that comes immediately to mind would be to run the POE algo-

rithm twice : once with all sends buffered and once with none of the sends buffered.

However, this will not detect all the deadlocks.

7.2.2 Too Much Buffering Can Miss Deadlocks

The example in Figure 7.2 will not deadlock when none of the sends are buffered

or all the sends are buffered. However, it would deadlock only when S0,1 is buffered

and S1,1 is not buffered.

When the POE algorithm is executed with zero buffering, S1,1 will match with

R2,1. This matching will cause S1,1 and R2,1 to complete and will unblock the waits

W1,2 and W2,2. S0,1 will now be matched with R1,3. This will unblock W0,2 and

W1,4. S0,3 will be matched with R2,3 which will unblock the waits W0,4 and W2,4.

R1,5 will be matched with S2,5 while P0 is blocked on W0,6. W1,6 and W2,6 will

unblock, resulting in matching R2,7 with S0,5. This will unblock W0,6. Finally, R0,7

will be matched with S1,7 which will cause the rest of the waits to unblock. Hence,

the POE algorithm has completed without a deadlock when none of the sends are

buffered.

Now consider the POE algorithm execution when all the sends are buffered.

The POE algorithm still executes with RSR∗ transition having the least priority.

S0,1 and S1,1 get buffered which will unblock W0,2 and W1,2. Similarly S0,3 and S0,5

will be buffered which causes W0,4 and W0,6 to be unblocked. R1,3 is matched with

S0,1. This matching will unblock W1,4. At this point, R2,1 can be matched with

either S1,1 or S0,3. When R2,1 is matched with S1,1, the match sets are the same

as those generated in the zero buffering case. When R2,1 is matched with S0,3, S1,1

will be matched with R2,7 and will terminate the program with no deadlock. Both

the buffering and nonbuffering executions will not detect a deadlock. The deadlock

78

only happens when only S0,1 is buffered and S1,1 and S2,5 are not buffered. The

rest of the sends may or may not be buffered. Since they are always matched,

their buffering status would not matter. We therefore look for those sends whose

buffering status would result in deadlocks.

A naive brute force solution would be to execute the MPI program with all

possible buffering scenarios for all the sends. The example in Figure 7.2 would

result in at least 26 interleavings just for the six sends present in the execution.

This approach will deteriorate rapidly as the number of sends in the program

increase. This chapter extends the POEOPT algorithm to handle slack and detect

any deadlock present with reduced interleavings that would not deteriorate as

rapidly as the number of sends increase.

The crux of our analysis is to be able to tell that S0,1(1) is the only send with

this property in this whole program. To summarize:

• We must discover all minimal sets of sends to buffer so that other sends may

match with wildcard receives. In our example, we buffered S0,1(1), but the

send that matches with the wildcard receive as a result is S0,3(2).

• The minimal number of sends is not unique. Therefore, we must find all

possible such minimal sets of sends, and re-run the analysis for each of them.

• We must not buffer more than this minimal set in each case, because we may

then miss head-to-head deadlocks.

7.3 Using HB to Detect Slack

This section describes how to identify the slack properties of various sends based

on the HB relation. Since the HB relation is built after the program execution, the

first step would be to execute the POEOPT algorithm with all sends nonbuffered.

This will generate the initial HB graph. Buffering sends will also affect the HB

graph. When a send Si,j is buffered, Wi,j′(hi,j) will return immediately. We consider

such waits to have turned into no-ops. This will have the effect of deleting the

IntraHBs associated with these waits, i.e., for j′′

> j′and F ∈ {S, R,W, B}, we

79

will remove 〈Wi,j′ , Fi,j′′〉 from IntraHB. We call these waits culprit-waits and the

sends associated with these waits culprit sends.

Lemma 5.5 can be used to detect if a wildcard receive Ri,j and send Sm,n(i) can

be matched. Using this lemma, Ri,j and Sm,n cannot be co-enabled in any equivalent

interleaving I ′ if 〈Ri,j, Sm,n〉 ∈ HB∗(I) or 〈Sm,n, Ri,j〉 ∈ HB∗(I). Otherwise, they

may be co-enabled.

Of these cases, we need not consider the case of 〈Sm,n, Ri,j〉 ∈ HB∗ for this

simple reason. Suppose 〈Sm,n, Ri,j〉 ∈ HB∗ with respect to the initial nonbuffered

execution. Then, it means that there was an earlier receive in process i with which

Sm,n matched. Thus, nothing can make Sm,n match Ri,j in the buffered execution.

If 〈Ri,j, Sm,n〉 ∈ HB∗, then Ri,j and Sm,n cannot be co-enabled. However,

Lemma 6.2 uses a deterministic HB relation to detect that when 〈Ri,j, Sm,n〉 /∈

Deterministic(HB∗), then they are co-enabled. Therefore, when 〈Ri,j, Sm,n〉 ∈

HB∗, we need to detect the sends that can be buffered so that 〈Ri,j, Sm,n〉 /∈ Determ

inistic(HB∗). This means that it is sufficient to buffer those sends that will cause

the path from Ri,j to Sm,n in Deterministic(HB∗) to be broken.

7.3.1 HB Graph and Paths

We now describe how to detect the sends that need to be buffered to co-enable

a wildcard receive Ri,j and Sm,n(i) when 〈Ri,j, Sm,n〉 ∈ HB. We first convert the

HB relation into a HB graph called GHB defined as follows:

Definition 7.1 GHB = (V, E) where the set of verticies are the MPI operations

invoked by the various processes and if 〈opi, opj〉 ∈ HB, then 〈opi, opj〉 ∈ E.

Hence, if two MPI operations opi and opj are HB related, there is a path

between opi and opj in GHB. When a send is buffered, the HB relation is updated

by removing any edges going out of the W corresponding to the sends and will

break the paths.

Given a wildcard receive and its matching send in an interleaving I, we generate

the GDeterministic(HB(I)) graph and break the paths in GDeterministic(HB(I)). If there

80

is no path between a receive and a send in GDeterministic(HB(I)), we know from

Lemma 6.2 that the receive and send can be co-enabled in some state in an

equivalent interleaving.

If a path contains multiple culprit waits, buffering only one of the waits is

sufficient to break the path.

Figure 7.3 shows the path between R2,1 and its matching send S0,3 which involves

the waits W1,2 and W0,2. It is sufficient to buffer the sends corresponding to either

of the waits as we have described before. We also need to buffer sends in all possible

ways in order to detect deadlocks involving any of the sends. We hence need to

detect all possible ways to break the paths given a set of culprit waits involved in

various paths between a wildcard receive and its matching send. We call these sets

Minimal-Wait sets.

Figure 7.3. Path Breaking

81

7.4 Finding Minimal Waits Sets

Definition 7.2 Let π be a path between two MPI operations in GDeterministic(HB)

(V, E). Let Onpath(π) be the set of Wi,j(hi,k) operations on paths π such that

〈Si,k, Wi,j〉 ∈ E (these are the culprit waits and their associated culprit sends).

Definition 7.3 Let ζ be the set of all paths between Ri,j and Sk,l(i) such that

for every π ∈ ζ, OnPath(π) '= ∅. Let Wall = ∪π∈ζ OnPath(π) be the set of all the

culprit waits on all paths. With respect to Wall, we can now define a minimal wait

set Wmin ⊆ Wall as follows:

Let {w, w′} ⊆ Wall. If {w, w′} ⊆ Wmin then for any path π ∈ ζ, w ∈ OnPath(π) ⇔

w′ /∈ OnPath(π) and ∃c ∈ Wmin such that c ∈ OnPath(π).

That is there is exactly one wait whose send is buffered on every path.

Theorem 7.4 Given a set of paths ζ between Ri,j and Sk,l(i), finding Wmin is

NP-Complete

Proof : We prove by reducing the monotone-1-in-3 SAT to our problem. The

above problem is in NP. Given a certificate Wc, we can easily check that each path

has exactly one wait in Wc in polynomial time. A monotone-1-in-3 SAT formula

f is 3-CNF that has no negations and must be satisfied by assigning exactly one

literal in every clause to true. Given a formula f , let v represent the set of variables

and c be the set of clauses. v represents Wall, i.e., each variable represents a wait

W . We construct a Happens-Before graph G = (V, E) with V = v and for every

clause ci = (x1 ∨ x2 ∨ x3), we add edges 〈x1, x2〉 ∈ E and 〈x2, x3〉 ∈ E. That

is, x1 → x2 → x3 forms a path in the graph. There is a source vertex labeled

Ri,j(∗) and sink vertex Sm,n(i) where 〈Ri,j, x1〉 ∈ E and 〈x3, Sm,n〉 ∈ E. A path

starts at the source vertex and ends at the sink vertex. If there is a Wmin for these

paths, then exactly one wait (one variable) is selected from every path. Setting the

variables corresponding to these waits to true in f will satisfy f . Conversely, if f

can be satisfied, the variables that have been set to true in each clause form Wmin

set.

82

Figure 7.4(a) and Figure 7.4(b) show the construction of GHB graphs for the

Monotone 1-in-3-SAT formulae when there are multiple Wmins possible and when

no Wmin is possible, respectively.

Theorem 7.5 Finding all minimal wait sets is #P-Complete.

Proof : Monotone-1-in-3 SAT is a #P-Complete problem and our reduction

is such that the number of solutions to the SAT problem is equal to the number of

minimal wait sets.

Since finding all possible minimal wait sets in #P-Complete, we propose the

algorithm in Figure 7.5 that finds all the subsets of the waits (i.e., powerset) in

Wall and sorts the subsets by size. Then, it iterates over each subset in the sorted

order and finds if buffering the waits in the set will break all the paths. If so, it

(a) Example with multiple Wmin (b) Example with no Wmin

Figure 7.4. Example Formulas and GHB graphs

83

1: MinimalWaitSets(ζ, Wall) {2: PWall

|S| = SortBySize(PWall);

3: for each (s ∈ PWall|S| ) {

4: if (BreaksAllPaths (ζ, s) {5: PWall

|S| = PWall|S| − {p ∈ PWall

|S| | s ⊂ p};6: } else {7: PWall

|S| = PWall|S| − s;

8: }9: }

10: return PWall|S| ;

11: }

Figure 7.5. Algorithm to Find Minimal Wait Sets

removes all the supersets of the set from the powerset. If the set does not break all

the paths, then the set itself is removed from the powerset.

The buffering transition for a send only buffers one send at a time. However,

the minimal wait sets can contain more than one wait whose send must be buffered.

Since RSC(µ) transitions are independent of all other MPI transitions, we combine

multiple RSC(µ) transitions into a big step transition which given a set of sends

bar, buffers the sends in bar as follows:

RSBC(µ) :Σ(σ as 〈I,M, C, R, pc〉), µ ⊆ {Si,j ∈ I | Si,j /∈ C, {Si,j, Rk,l} /∈ M}

Σ〈I,M, C ∪ µ, R, pc〉

Note that we could add individual RSC transitions to the backtrack sets. This

will only cause redundant interleavings, which can be very inefficient for POE

which is basically a stateless dynamic verification algorithm. The RSBC will avoid

redundant interleavings.

7.5 POEMSE Algorithm

We now provide the POEMSE algorithm that extends the POEOPT algorithm

to handle slack. The algorithm differs from the POEOPT with respect to updating

the backtrack sets. The rest of the the algorithm remains unchanged. We only

84

provide the pseudo-code where there are any changes or additions as shown in

Figures 7.6, 7.7, and 7.8. The GetTransition routine is similar to the Get-

Transition of POE except that the RSC transitions are never executed. This

emulates the zero-buffering behavior of POEOPT .

The pseudo-code for UpdateBacktrack is shown in Figure 7.6. The pseudo-

code invokes AddSlacktoBacktrack when a RSR∗ transition is executed from

that state as before. Figure 7.8 uses the following helper routines:

• GetHBGraph takes a HB relation as input and returns a HB graph. In

the POEMSE algorithm, the Deterministic(HB) graph is passed as input.

• FindPaths takes the HB graph GHB and two MPI operations as input and

returns all the paths between the operations in the graph.

• FindWaits finds all the culprit waits in the paths, i.e., it finds the Wall set.

• GetMinSendSets returns the send sets corresponding to each of the mini-

mal waits sets.

• GetRSBC takes a set of sends as input and returns the RSBC transition for

those sends.

1: UpdateBacktrack(statevec) {2: for each (σ ∈ statevec) {3: if (enabled(σ) = ∅)4: return;5: ti = curr(σ);6: if (is∗(ti)) {7: AddSlacktoBacktrack (ti, σ, statevec);8: AddtoBacktrack (ti, σ, statevec);9: }10: }11: }


85

1: GetTransition(set of transitions T ) {2: TB = {t ∈ T | isRSC(t)}3: if (hasnon∗(T − TB))4: return choosenon∗(T − TB);5: else if (has∗(T − TB))6: return choose∗(T − TB)7: }

Figure 7.7. Pseudocode for GetTransition

1: AddSlacktoBacktrack(Transition ti ,σ, statevec) {2: let Ri,j(∗) be the receive operation of ti and Sm,n(i) be some compatible

send that we want to try and co-enable with Ri,j

3: GHB =GetHBGraph(Deterministic(HB));4: ζ =FindPaths(GHB, Ri,j, Sm,n);5: Wall = FindWaits(ζ);6: mws = MinimalWaitSets(ζ, Wall);7: if (mws = ∅)8: return;9: mss =GetMinSendSets(mws)10: for each (µ ∈ mss ) {11: t = GetRSBC(µ) ;12: backtrack(σ) = backtrack(σ) ∪ {t};13: }14: }

Figure 7.8. Pseudocode for AddSlacktoBacktrack

The backtrack(σ) is updated for the states where a RSR∗ transition is executed

in the current interleaving. From the Deterministic(HB) relation, the GHB graph

is generated. All the paths between the wildcard receive Ri,j and its matching send

Sm,n are found. The minimal wait sets are generated in mws. The msw is converted

into mss where each of the waits in the sets are replaced by their corresponding

sends and the RSBC transition for each of the minimal send sets in mss are added

to the backtrack sets.

We now prove the following invariant for all states σi generated by the POEMSE

algorithm.

86

Lemma 7.6 In the POEMSE algorithm, when a state σi is popped from statevec,

if there exists an RSBC transition ti ∈ enabled(σi) that can lead to

• Rk,l(∗) ∈ Ready(σi), Sm,n(k) /∈ Ready(σi) to be co-enabled and

• RSR∗ : {Si,j, Rk,l} ∈ backtrack(σi).

then ti is in backtrack(σi).

Proof : Induction, by post-order, as with POEMSE.

• Basis case: The final state, has either enabled(σi) = ∅ or contains only RSC

transitions and the above invariant holds vacuously.

• Induction: Assume that the invariant holds for all successors of state σi.

• If RSR∗ : {Si,j, Rk,l} /∈ backtrack(σi)), then the invariant holds vacuously.

These are the states where hasnon∗ is true.

• When RSR∗ : {Si,j, Rk,l} ∈ backtrack(σi), we prove by contradiction. Assume

that there is some ti = RSBC transition involving one or more sends that

was not included in backtrack(σi) but must be included in backtrack(σi)

for Rk,l to be co-enabled with some Sm,n. Let σi+1 be the state reached

after executing ti from σi. Since ti is independent of all transitions, every

transition in enabled(σi)−{ti} are also available in enabled(σi+1). Therefore,

every interleaving generated from σi can also be generated from σi+1. Let

succ(σi) be the set of states generated from σi after executing transitions

from backtrack(σi).

Let DTGk be the dependent transition group of Rk,l. Clearly, Sm,n cannot

be from any process that is already in DTGk since Sm,n is IntraHB-related

to some Si,j(k) ∈ Ready(σi) or Rk,l and IntraHB relation does not change

across interleavings. Therefore, Sm,n must be from a process different from

the processes involved in DTGk. Let Sm,n be a descendant of some other

MPI operation Fm,r that belongs to some DTGm. Since DTGm and DTGk

are independent, DTGm is still enabled in some σl = succ(σi).

87

By induction hypothesis, every code path where Sm,n occurs will be explored

for DTGm from σi (since interleavings from σi also include the interleavings

from succ(σi)). Therefore, every interleaving generated from σi+1 involving

Sm,n is equivalent to some interleaving generated from σi. Interleavings

generated from σi must have one of the following valid:

– Rk,l and Sm,n are HB-related and buffering the sends in ti will not break

the paths between Rk,l and Sm,n, in which case, Rk,l and Sm,n cannot be

co-enabled in an equivalent interleaving (Lemma 5.5). or

– Rk,l and Sm,n are not HB-related and hence can be co-enabled in an

equivalent interleaving without buffering any more sends (Lemma 6.2)

or

– There is a path from Ri,j to Sm,n involving culprit waits corresponding

to the sends in ti and buffering them will break all the paths.

In the first two cases, it is not necessary to buffer any more sends and

therefore, it is not necessary to add ti to backtrack(σi), which contradicts

the assumption that it is necessary to add ti to backtrack(σi). In the last

case, the POEMSE algorithm will find all minimal wait sets and will add ti to

backtrack(σi), which is a contradiction.

Theorem 7.7 For any state σ generated by the POEMSE algorithm, the set

backtrack(σ) is persistent.

Proof : The proof follows directly from Lemma 7.6.

Theorem 7.8 The POEMSE algorithm will find deadlocks at Wi,j′(hi,j) when

Si,j does not have a matching receive.

Proof : The proof follows directly from the fact that POEMSE is persistent

and buffers all possible minimal waits sets at every state σi. Also, the very first

interleaving generated from any state σi executes the POEOPT algorithm with the

88

RSC transitions never executed. Since POEOPT is persistent when the sends are

zero buffered, the POEMSE algorithm will find the head-to-head deadlocks.

CHAPTER 8

EXTENSIONS TO THE FORMAL MODEL

This chapter extends the formal model presented in Chapter 3 to

• support more MPI functions: namely, MPI_Send, MPI_Recv, MPI_Waitall in

Section 8.1 and

• handle communicators and tags (Section 8.2).

8.1 Handling More MPI Functions

This section describes how the formal model can easily be extended to handle

a few very frequently used MPI functions.

8.1.1 MPI Send and MPI Recv

The MPI functions MPI_Send and MPI_Recv have the following prototypes:

int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,

int tag, MPI_Comm comm);

int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int src,

int tag, MPI_Comm comm, MPI_Status *status);

MPI_Send blocks until the send completes. Similarly, MPI_Recv blocks until the

receive operation completes. MPI_Send will be denoted as Sb when the buffering

status of the send is immaterial. Sb will be used to explicitly denote when Sb is

buffered by the runtime.

Op is now extended with Sbi,j(k), Rb

i,j(k), Rbi,j(∗), where i, k ∈ PID, j ∈| Pi |.

The Nonovertake ordering (Definition 3.1) is extended to handle Sb and Rb as

follows:

90

Definition 8.1 Nonovertakeb(σ as 〈I,M, C, R, pc〉) ⊆ I × I =Nonovertake(σ)∪ {〈Si,j(k), Sb

i,j′(k)〉, 〈Sbi,j(k), Si,j′(k)〉, 〈Sb

i,j(k), Sbi,j′(k)〉}

∪ {〈Ri,j(k), Rbi,j′(k)〉, 〈Rb

i,j(k), Ri,j′(k)〉, 〈Rbi,j(k), Rb

i,j′(k)〉}∪ {〈Ri,j(∗), Rb

i,j′(k)〉, 〈Rbi,j(∗), Ri,j′(k)〉, 〈Rb

i,j(∗), Rbi,j′(j)〉}

∪ {〈Ri,j(∗), Rbi,j′(∗)〉, 〈Rb

i,j(∗), Ri,j′(∗)〉, 〈Rbi,j(∗), Rb

i,j′(∗)〉}where i, k ∈ PID, j, j′ ∈| Pi | and j < j′.

The Fence(σ) ordering (Definition 3.3) is updated as follows:

Definition 8.2 Fenceb(σ as 〈I,M, C, R, pc〉) ⊆ I × I =

Fence(σ) ∪ {〈Wi,j, Fi,j′〉, 〈Bi,j, Fi,j′〉, 〈Sbi,j, Fi,j′〉, 〈Rb

i,j, Fi,j′〉}

Where j < j′, F ∈ {S,R,W, B, Sb, Rb}.

The isS(Fi,j) predicate is extended to return true when F = S or F = Sb.

Similarly, isR(Fi,j) is extended to return true when F = R or F = Rb. The PS

and PR transitions will now also include the process transitions for Sb and Rb MPI

operations.

The following runtime transitions are added to support the Sb and Rb MPI

operations:

RSRb :Σ(σ), {Si,j(k), Rb

k,l(i)} ⊆ Ready(σ)

Σ(σ′ as 〈I,M ∪ {{Si,j, RBk,l}}, C, R, ls〉),

Assert : Ready(σ′) = Ready(σ)− {Si,j, Rbk,l}

RSbR :Σ(σ), {Sb

i,j(k), Rk,l(i)} ⊆ Ready(σ)

Σ(σ′ as 〈I,M ∪ {{Sbi,j, Rk,l}}, C, R, ls〉),

Assert : Ready(σ′) = Ready(σ)− {Sbi,j, Rk,l}

RSbRb :Σ(σ), {Sb

i,j(k), Rbk,l(i)} ⊆ Ready(σ)

Σ(σ′ as 〈I,M ∪ {{Sbi,j, R

bk,l}}, C, R, ls〉),

Assert : Ready(σ′) = Ready(σ)− {Sbi,j, R

bk,l}

The RSRb∗, RSbR∗, RSbRb∗ transitions are also similarly added.

The above runtime transitions indicate various possible matching between dif-

ferent types of MPI send and receive operations. However, note that for any send

91

or receive, only one of the above transitions are enabled due to the Nonovertake

rule.

The RSbC rule to complete a nonbuffered send is

RSbC :

Σ(σ),({(Sb

i,j, Rk,l} ∈ M ∨ {Sbi,j, R

bk,l} ∈ M), Sb

i,j /∈ C

Σ〈I,M, C ∪ {Sbi,j}, R ∪ {Sb

i,j}, ls[i ← lsi + 1]〉

RSbC :Σ(σ), Sb

i,j /∈ C

Σ〈I,M, C ∪ {Sbi,j}, R ∪ {Sb

i,j}, ls[i ← lsi + 1]〉

The RRbC rule to complete a blocking receive is:

RRbC :

Σ(σ),({Sb

i,j, Rk,l} ∈ M ∨ {Sbi,j, R

bk,l} ∈ M), Rb

k,l /∈ C

Σ〈I,M, C ∪ {Rbk,l}, R ∪ {Rb

k,l}, ls[i ← lsi + 1]〉

The Rb and Sb operations return only when they are completed, unlike their

nonblocking counterparts S and R that can return at anytime.

8.1.2 MPI Waitall

The MPI Waitall operation has the following prototype:

int MPI_Waitall( int count, MPI_Request array_of_requests[],

MPI_Status array_of_statuses[]);

and its arguments are an array of MPI Request handles where count is the size of

array_of_handles. The handles can be the MPI Request handles of either S or

R. MPI Waitall is denoted as Walli,j′(H) where H is the set of MPI handles where

hi,j ∈ H denotes either Si,j or Ri,j.

The Resource(σ) (Definition 3.2) and Fence(σ) (Definition 3.3) sets are updates

as follows:

Definition 8.3 ResourceWall(σ) =

Resource(σ) ∪⋃

hi,j∈H{〈Si,j, Walli,j′〉, 〈Ri,j, Walli,j′〉 | j < j′}.

92

Definition 8.4 FenceWall(σ) =

Fenceb(σ) ∪ {〈Walli,j, Fi,j′)〉}, | j < j′, F ∈ {S,R,W, B, Sb, Rb}}.

We extend the isW (Fi,j) predicate to return true when F = W or F = Wall

and false otherwise. The PW transition will now include the process transition for

Wall. The runtime transitions for Wall are presented below:

RWallC :Σ(σ), Walli,j ∈ Ready(σ)

Σ(σ′ as 〈I,M, C ∪ {Walli,j}, R}, ls〉),Assert : Ready(σ′) = Ready(σ)− {Wi,j}

RWRet :Σ(σ), Walli,j ∈ C, Walli,j /∈ R

Σ〈I,M, C, R ∪ {Walli,j}, ls[i ← lsi + 1]〉

The MPI Waitany operation behaves like Wall except that the MPI Waitany

can return when at least one of the send or receives corresponding to the handles

are complete. This requires a change to the definition of the Ready set so that the

MPI Waitany enters the Ready(σ) set when at least one of the send or receives are

complete instead of all the sends and receives that need to be complete for Wall.

8.2 Communicators and Tags

The formal model presented in Chapter 3 abstracts away the communicators

and tags. We now describe how the formal model can be extended to handle

communicators and tags. Consider an MPI program execution with n processes.

MPI allows the n processes to be divided into subsets called groups. Every process

can belong to one or more groups. All the processes in a group with m ≤ n processes

are ranked from 0 to m − 1 within the group. Initially, when the MPI processes

execute MPI_Init, all processes by default belong to the group MPI_GROUP_WORLD.

All groups are created as subsets of MPI_GROUP_WORLD.

MPI Groups are created using one of many group construction APIs provided

by the MPI library. For example, subsets of a group can be constructed using:

MPI Group Incl (MPI Group ingroup, int m, int *ranks, MPI Group

*newgroup)

93

where the newgroup contains m processes from ingroup and inranks corresponds

to ranks of m processes in ingroup that must be included in newgroup. Note that

newgroup will also rank its processes from 0 to m−1. Hence, a process in newgroup

can have a different rank in ingroup. MPI also provides various group construction

functions. Since groups are essentially sets of processes, various set operations on

groups are provided to create to new groups. Following is a list of few MPI group

creation functions.

• MPI_Group_Incl creates a subset.

• MPI_Group_Difference sets difference.

• MPI_Group_Union creates a new group that is union of two groups.

• MPI_Group_Intersection creates a new group that is intersection of two

groups.

Processes within the same group can disjointly communicate by creating various

communicators associated with the group. The MPI library by default associates

MPI COMM WORLD communicator with MPI GROUP WORLD. The MPI library provides

communicator creation APIs. Every communicator created is uniquely identified

by the MPI runtime.

The formal model assumes that all the communication happens with comm

=MPI COMM WORLD. The process ranks are hence the ranks of processes in MPI

GROUP WORLD. An MPI Isend sends to another process receiving with the same

communicator comm. The dest field is the process rank in the group associated

with comm. MPI library provides a mapping function that maps the rank of a

process in any group to its rank in MPI_GROUP_WORLD. We therefore assume that a

process rank is always mapped to MPI_GROUP_WORLD. The formal model only needs

to compare that the communicators are the same.

Tags provide more fine-grained communication within the communicator. A

tag is an integer that uniquely identifies a message along with the communicator.

When the messages are matched, it is necessary that the tags also match along with

94

the communicators. The tag field can also be MPI_TAG_ANY, which is a wildcard

tag denoted as “*”.

Note that the communicators and tags can only dictate the IntraHB order

among the operations within a process and do not contribute to any nondetermin-

ism.

8.2.1 Extensions to the Formal Model

We now extend the MPI operation S and R to have a communicator comm

and tag, where, comm ∈ N and tag ∈ N ∪ {∗}. Hence, the set Op

now contains {Si,j(k, comm, tag), Ri,j(k, comm, tag), Ri,j(∗, comm, tag), Wi,j′(hi,j),

Bi,j(comm)}.

The Nonovertake rule (Definition 3.1) is redefined as follows:

Definition 8.5 Nonovertake(σ) =

{〈Si,j(k, commj, tagj), Si,j′(k, commj′ , tagj′)〉, 〈Ri,j(k, commi, tagi), Ri,j′(k, commj′ , tagj′)〉,

〈Ri,j(∗, commj, tagj), Ri,j′(k, commj′ , tagj′)〉〈Ri,j(∗, commj, tagj), Ri,j′(∗, commj′ , tagj′)〉

| (j < j′ ∧ commi = commj′ ∧(tagj = “ ∗ ” ∨ tagj = tagj′)}.

The MPI transitions RSR, RBC and RSR∗ are extended to support MPI com-

municators and tags as follows:

RSR :

Σ(σ),{Si,j′(k, commi, tagi), Rk,l′(i, commk, tagk)} ⊆ Ready(σ)commi = commk

tagi = ∗ =⇒ Si,j(k, commi, tagk) /∈ Ready(σ), (j < j′)tagk = ∗ =⇒ Rk,l(k, commi, tagi) /∈ Ready(σ), (l < l′)tagi '= ∗ ∧ tagk '= ∗ =⇒ tagi = tagk

Σ(σ′ as 〈I,M ∪ {{Si,j′ , Rk,l′(i)}}, C, R, ls〉),Assert : Ready(σ′) = Ready(σ)− {Si,j′ , Rk,l′}

95

RSR∗ :

Σ(σ),{Si,j′(k, commi, tagi), Rk,l′(∗, commk, tagk)} ⊆ Ready(σ)commi = commk

Rk,l(i, commi, ∗) /∈ Ready(σ) ∨Rk,l(i, commi, tagi) /∈ Ready(σ)(l < l′)tagi = ∗ =⇒ Si,j(k, commi, tagk) /∈ Ready(σ), (j < j′)tagk = ∗ =⇒ Rk,l(∗, commi, tagi) /∈ Ready(σ), (l < l′)tagi '= ∗ ∧ tagk '= ∗ =⇒ tagi = tagk

Σ(σ′ as 〈I,M ∪ {{Si,j′ , Rk,l′(i)}}, C, R, ls〉),Assert : Ready(σ′) = Ready(σ)− {Si,j′ , Rk,l′}

RBC :

Σ(σ),bar(comm) as {Bi,j(commi) | Bi,j ∈ Ready(σ) ∧ commi = comm},| bar |= size(comm)

Σ(σ′ as 〈I,M ∪ {bar}, C ∪ bar,R, ls]〉),Assert : Ready(σ′) = Ready(σ)− bar

where size(comm) is the number of processes in the group corresponding to

comm.

CHAPTER 9

ISP: A PRACTICAL DYNAMIC MPI

VERIFIER

This chapter presents our dynamic MPI verification tool that incorporates the

verification algorithms POE, POEOPT and POEMSE. Section 9.1 describes the

architecture of ISP. Section 9.2 describes various implementation tricks incorporated

into ISP in order to implement the verification algorithms. Finally, Section 9.3

presents some experimental results of the three POE algorithm variations.

9.1 ISP Architecture

ISP behaves as an auxiliary MPI runtime and performs the matching of var-

ious MPI functions. ISP uses the actual MPI runtime (henceforth referred to as

MPI library) to transfer data and complete the MPI operations. ISP works by

intercepting the MPI calls made by the target program and making decisions on

when to send the MPI calls to the MPI library. This is accomplished by the two

main components of ISP : the Profiler and the Scheduler. Figure 9.1 provides an

overview of ISP’s components and their interaction with the program as well as the

MPI library.

9.1.1 The Profiler

The interception of MPI calls is accomplished by compiling the ISP profiler

together with the target program’s source code. The profiler makes use of MPI’s

profiling mechanism (PMPI). It provides its own version of MPI f for each corre-

sponding MPI function f . Within each of these MPI f, the profiler communicates

with the scheduler using TCP sockets to send information about the MPI call the

process wants to execute. It will then wait for the scheduler to make a decision

97

!"#$%&'()*&+

,!-'-$"()*&$

./&%#012*&

3-,4(

!%5&6#*&$

+)781*+

3-,'$#80)9&

-3-,4(':;<'7"15&16'+)781*'($"9'05&'!%5&6#*&$=

,!-

Figure 9.1. ISP Architecture

whether the MPI call must be sent into the MPI library or to postpone it until a later

time. When the permission to fire f is given by the scheduler, the corresponding

PMPI f will be issued to the MPI library. Since all MPI libraries come with the

PMPI f for every MPI function, this approach provides a portable and light-weight

instrumentation mechanism for MPI programs verified.

9.1.2 The ISP Scheduler

The ISP scheduler carries out the verification algorithms. Since every process

starts executing with an MPI_Init, every process invokes the MPI Init provided by

the profiler. The MPI Init of the profiler establishes a TCP connection with the

scheduler and communicates its process rank to the scheduler. The TCP connection

is used for all further communication between the process and the scheduler. The

scheduler maintains a mapping between the process rank and its corresponding TCP

connection. Once the connection with the scheduler is established, the processes

execute a PMPI Init into the MPI library. The processes finally return from the

MPI Init of the profiler and continue executing the program.

98

Whenever a process wishes to execute an MPI function, it invokes the MPI f of

the profiler, which communicates this information to the scheduler over the TCP

connection. The profiler does not always execute the PMPI f call into the MPI

library when it calls the profiler’s MPI f. For nonblocking calls like MPI Isend and

MPI Irecv, the profiler code sends the information to the scheduler and stores this

information in a structure in the profiler and returns. When the process executes

a fence instruction like MPI Wait, the scheduler makes various matching decisions

and sends a message to the process to execute the PMPI Isend (or other nonblocking

functions) corresponding to the Wait call. The MPI library is not aware of the

existence of MPI_Isend until this time. Eventually, the scheduler sends a message

to the process to execute the PMPI Wait, at which time the process returns. It must

be noted that the scheduler will allow a process to execute a fence MPI function

only when the Wait can complete and hence return. Otherwise, the scheduler will

detect a deadlock.

9.2 ISP : Implementation Issues

This section briefly describes various implementation decisions made by ISP

inorder to support the verification algorithms.

9.2.1 Out-of-Order Issue

For nonblocking calls, the PMPI f functions are not executed when the MPI f

is executed by the process. The reason behind this decision is the nonblocking

wildcard receive function MPI_Irecv . If the process executing the wildcard receive

into the profiler also executes the PMPI Irecv into the library, the actual matching

of the receive with a send will be decided by the MPI library. Since the scheduler

MUST ensure that the matching that happens in the library is the matching the

scheduler has decided, the scheduler postpones the issue of the wildcard receive into

the MPI library until a later time. Once the scheduler decides that the wildcard

receive must be matched with the send of a particular process rank, it communicates

this decision to the process to execute a PMPI Irecv with the src set to the process

99

rank of the send (we call this Dynamic source rewrite). Note that the MPI

library never knows the existence of a wildcard receive. Since the nondeterminism

is taken away, the library matches the sends and receives as the scheduler decides.

Since the scheduler must know all the sends that can match with a nonblocking

wildcard receive, a wildcard receive may be issued out-of-order into the MPI library.

9.2.2 Scheduling MPI Waitany

Due to the out-of-order issue behavior of ISP, when a nonblocking call such as

MPI Irecv is invoked by the process, the profiler provides a unique MPI Request

handle for the nonblocking receive. When the MPI function MPI Waitany is

invoked by a process with a set of request handles, it is sufficient to complete

only one of the MPI Isend or MPI Irecv operations corresponding to the requests.

Consider an MPI Waitany that has n requests and only i of the sends or receives

have been issued to the MPI library. The MPI library is aware of only these i

requests and has no knowledge of the existence of the rest of n− i requests. When

a PMPI Waitany is called into the MPI library, the library aborts the process with

an error that the request structure is invalid! We get around this issue by updating

all the n− i requests to MPI REQUEST NULL. These requests will be ignored by

the library.

9.2.3 Buffering Sends

Inorder to implement the POEMSE algorithm, the scheduler must be able to

provide buffer to the sends so that the waits corresponding to the sends can unblock.

The scheduler cannot rely on the MPI library to provide buffering according to the

scheduler’s wishes. The solution is implemented into the profiler. The profiler

buffers a send and copies the data into a different heap space. When the wait

corresponding to the send is later issued into the profiler, the wait never issues

a PMPI Wait into the MPI library and instead returns from MPI Wait. The

profiler will eventually execute an PMPI Wait for the buffered send when the send

100

is matched with a receive. Note that the scheduler will allow a send to be issued

into the library only when there is a matching receive.

9.3 Experimental Results

This section presents experimental results when ISP was run on various MPI

programs.

Our experimental results will be reported on the following MPI programs:

• Umpire test suite [64] consists of a set of small MPI programs that capture

various error and deadlock patterns in MPI programs.

• MADRE [53, 49] is a collection of memory aware parallel redistribution

algorithms addressing the problem of efficiently moving data blocks across

processes without exceeding the allotted memory of each process. MADRE

is an interesting target for ISP because it belongs to a class of MPI programs

that make use of wildcard receives which potentially could result in deadlocks

that can easily go undetected.

• ParMETIS [43] is a parallel library that provides implementation of several

effective graph partitioning algorithms. ParMETIS also provides several par-

allel routines that are especially suitable for graph partitioning in a parallel

computing environment. ParMETIS has more than 14k LOC and executes

more than a million MPI calls when run with 32 processes.

We compare the POE algorithm with a well-known MPI testing tool called

Marmot [26]. Marmot detects deadlocks using a timeout mechanism. Marmot’s

architecture is similar to ISP. The process calls are trapped by Marmot and when a

process does not provide Marmot with the next MPI function before a timeout that

is user defined, Marmot signals a deadlock warning. We run the POE algorithm,

Marmot, on the Umpire test suite. The results for a small set of the benchmarks

are shown in Table 9.1. Readers can find the full set of results at [25].

Table 9.1 has three columns. The first column provides the Umpire bench-

mark programs. The second column shows the result of running the Umpire

101

Table 9.1. Comparison of POE with Marmot

Umpire Benchmark POE Marmotany src-can-deadlock7.c Deadlock Detected Deadlock Caught in

2 interleavings 5/10 runsany src-can-deadlock10.c Deadlock Detected Deadlock Caught in

1 interleaving 7/10 runsbasic-deadlock10.c Deadlock Detected Deadlock Caught in

1 interleaving 10/10 runsbasic-deadlock2.c No Deadlock Detected No Deadlock Caught

2 interleavings in 20 runsbasic-deadlock2.c No Deadlock Detected No Deadlock Caught

2 interleavings in 20 runscollective-misorder.c Deadlock Detected Deadlock Caught in

1 interleaving 10/10 runscollective-misorder2.c Deadlock Detected No Deadlock Caught

1 interleaving in 20 runs

benchmark on ISP executing the POE algorithm. We show the number of in-

terleavings generated by POE. The last column shows the result of running the

benchmark with Marmot. The benchmark is run multiple times on Marmot to

see how successfully Marmot detects a deadlock. As can be seen in the results,

Marmot does not necessarily detect the presence of a deadlock every time it is

run. The detection of presence of a deadlock with Marmot is not guaranteed

when the program contains nondeterministic wildcard receives. For a deterministic

program like collective-misorder.c, the deadlock is detected by Marmot in

every interleaving. The reason for this can be directly deduced from the fact that

all interleavings in a deterministic program are equivalent. Hence, if there is a

deadlock in one interleaving, there will be a deadlock in every interleaving. POE

detects a deadlock in collective-misorder2.c, which is a deterministic program,

even when Marmot does not detect any. This is because our formal model strictly

treats all collective MPI functions as Barriers. However, the MPI standard provides

latitude to the MPI libraries implementing other collectives like MPI Bcast where

these operations are not necessarily treated as MPI_Barrier. Our formal model

102

uses the strictest definition possible so all deadlocks are detected irrespective of the

MPI library on which the program will be executed.

We now provide the experimental results for ParMETIS and MADRE. ParMETIS

has no nondeterministic wildcard receives in the program. Hence, for any number

of processes, POE algorithm generates only a single interleaving. Table 9.2 shows

the experimental results on MADRE.

The results are shown when the POE algorithm is run with different programs

in MADRE for different processes. A “-” in a column indicates that ISP did not

terminate even after more than 150,000 interleavings. The results indicate that the

POE algorithm also suffers from the state explosion problem that is inherent in all

verification tools. The benchmarks in MADRE have different DTG groups due to

which POE algorithm explodes. However, the POEOPT algorithm is able to reduce

the number of interleavings in many cases to just 1 interleaving.

For the POEMSE algorithm, the benchmarks did not show the slack-inelastic

patterns. Our first study was to see if the POEMSE algorithm would detect dead-

locks in a small set of hand coded examples. The POEMSE algorithm successfully

detected deadlocks in these programs where POE failed to find them, as shown in

Table 9.3.

Our second study was to detect the overhead of POEMSE on large MPI applica-

tions. ParMETIS∗ is a modified version where a small part of the algorithm was

rewritten using wildcard receives. In most of our benchmarks where no additional

interleavings were needed, and in the presence of wildcard receives, where the new

algorithm has to run extra steps to make sure we have covered all possible matchings

in the presence of buffering, the overhead is less than 3%.

Finally, we study large examples with slack variant patterns inserted into them.

This is shown as PARMETISb where we rewrote the algorithm of ParMETIS

again, this time not only to introduce wildcard receives, but also to allow the

possibility of a different order of matching that can only be discovered by allowing

certain sends to be buffered. Our experiments show that POEMSE successfully

discovered the alternate matches.

103

Table 9.2. Results for POE and POEOPT on MADRE

MADRE POE Interleavings POEOPT Interleavings2 Procs 3 Procs 4 Procs 2 Procs 3 Procs 4 Procs

sbt1 2 6 24 1 1 1sbt2 2 6 24 1 1 1sbt3 20 1680 - 1 1 1sbt4 1 20 1680 1 1 1sbt5 1 20 1680 1 20 1680sbt6 1 20 1680 1 20 1680sbt7 20 - - 1 20 1680sbt8 1 1 1 1 1 1sbt9 - 1 - 1 1 1

Table 9.3. Results for POE and POEMSE

Number of interleavings(notice the extra necessary POEMSE POEinterleavings of POEMSE)

sendbuff.c 5 1sendbuff-1a.c 2 (deadlock caught) 1sendbuff2.c 1 1sendbuff3.c 6 1sendbuff4.c 3 1ParMETISb 2 1

Overhead of POEMSE onParMETIS / ParMETIS∗

(runtime in seconds POEMSE POE(x) denotes x interleavings)

ParMETIS (4procs) 20.9 (1) 20.5 (1)ParMETIS (8procs) 93.4 (1) 92.6 (1)

ParMETIS∗ 18.2 (2) 18.7(2)

CHAPTER 10

CONCLUSIONS

Standardization success stories such as MPI are extremely rare in computing

practices. Enduring standards provide the perfect context within which to create

robust design and debugging techniques. Yet, the formal methods community has

largely ignored MPI and related developments. In contrast, the same community

produces a vast number of papers on shared memory concurrency formalization.

We only have two explanations: (i) the sheer number of MPI API calls (which

exceed 300 in MPI 2.0) seem to have somehow discouraged the CS community

from understanding MPI, and (ii) the problems solved using MPI typically are not

taught in mainstream CS classes.

This dissertation contributes to a deep understanding of the primitives under-

lying MPI. As we show in this dissertation, one can understand MPI through

a small set of primitive notions such as nonblocking sends, receives, waits and

barriers. If one teaches the much smaller primitive basis of MPI to newcomers,

they will have a much easier time thinking about their MPI programs and possible

optimizations. One can also formulate a whole range of analysis problems in terms

of the Happens-Before relation for MPI that we contribute. As a concrete example,

prior to our work, there was no formal systematic way to argue whether this one

line MPI program will deadlock or not – both with and without slack. Using

our Happens-Before relation for MPI, we can very precisely analyze even tricky

“auto-send” examples of this nature.

P0: Irecv(from P0, x, &h); Wait(&h); Barrier; Isend(to P0, 22);

105

10.1 Suggestions for Future Work

• State Space: Though the POE algorithms developed in this dissertation

guarantee a full coverage, unfortunately they suffer from the state space

explosion problem. These verification algorithms will benefit from exploit-

ing either programmer knowledge or static analysis techniques that provide

information on semantic equivalence between various wildcard matches. This

can considerably reduce the state space and the verification time.

• Implementing MPI Waitany: The MPI functions MPI_Waitany and MPI_

Waitsome are sources of nondeterminism. For n request handles, MPI_Waitany

can cause n! interleavings and MPI_Waitsome will cause 2n − 1 interleavings.

Clearly, the presence of these MPI functions in a program can quickly cause

the state space to explode in a verification tool. As a novel solution, we

have employed simple static analysis solutions that look at the control flow

to decide the request on which Waitany or Waitsome must interleave on. Our

initial results [63] are encouraging. From our experience, a full interleaving

exploration for all subsets of requests is in general wasteful. More sohisticated

static analysis can provide support in building better verification tools for

MPI.

• MPI and OpenMP: MPI+OpenMP mixes MPI with threads. Today’s

multi-core architectures provide opportunities for higher parallelism and the

most efficient way to exploit the parallelism is by using threads. Future MPI

programs will exploit this parallelism using well-known thread library imple-

mentations like OpenMP, when each process can execute multiple threads

with each thread invoking an MPI function. These programs suffer from

both traditional data races due to shared memory among threads and with

MPI-related bugs. Debugging multithreaded MPI programs will be even more

challenging. Consider multiple MPI processes with multiple threads with all

threads of one process issuing a wildcard receive and the rest of the threads

of other processes issuing sends. A send can now match a wildcard receive

106

of any thread - in-effect, so the send is a kind of wild-card send. Extending

the algorithms in this dissertation to handle multiple threads will be a good

future work.

REFERENCES

[1] Aananthakrishnan, S., Delisi, M., Vakkalanka, S. S., Vo, A.,Gopalakrishnan, G., Kirby, R. M., and Thakur, R. How formaldynamic verification tools facilitate novel concurrency visualizations. In Eu-roPVM/MPI (2009), pp. 261–270.

[2] Avrunin, G. S., Siegel, S. F., and Siegel, A. R. Finite-state verificationfor high performance computing. In Proceedings of the Second InternationalWorkshop On Software Engineering For High Performance Computing SystemApplications, St. Louis, Missouri, USA, May 15, 2005 (2005), P. M. Johnson,Ed., pp. 68–72.

[3] Ball, T., Cook, B., Levin, V., and Rajamani, S. K. SLAM and staticdriver verifier: Technology transfer of formal methods inside Microsoft. InProceedings of IFM 04: Integrated Formal Methods (April 2004), Springer,pp. 1–20.

[4] CHESS: Find and reproduce heisenbugs in concurrent programs. http://research.microsoft.com/en-us/projects/chess Accessed 12/8/09.

[5] Clarke, E. M., Grumberg, O., and Peled, D. A. Model Checking. MITPress, 2000.

[6] Concurrency education. http://www.cs.utah.edu/formal_verification/Concurrency_Education.

[7] Exascale computing study report. http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/ECS_reports.htmAccessed12/16/09.

[8] Ferrante, J., and McKinley, K. S., Eds. Proceedings of the ACMSIGPLAN 2007 Conference on Programming Language Design and Implemen-tation, San Diego, California, USA, June 10-13, 2007 (2007), ACM.

[9] Flanagan, C., and Godefroid, P. Dynamic partial-order reductionfor model checking software. In POPL ’05: Proceedings of the 32nd ACMSIGPLAN-SIGACT Symposium on Principles of Programming Languages(2005), pp. 110–121.

[10] Godefroid, P. Partial-Order Methods for the Verification of ConcurrentSystems - An Approach to the State-Explosion Problem, vol. 1032 of LectureNotes in Computer Science. Springer, 1996.

108

[11] Godefroid, P. Model checking for programming languages using Verisoft.In POPL 97: Principles of Programming Languages (1997), pp. 174–186.

[12] Godefroid, P., and Wolper, P. Using partial orders for the efficientverification of deadlock freedom and safety properties. Formal Methods inSystem Design 2, 2 (1993), 149–164.

[13] Gopalakrishnan, G. Practical formal verification of MPI and and threadprograms, 2009. Half-day tutorial, 23rd International Conference on Super-computing, ICS 2009,.

[14] Gopalakrishnan, G., and Kirby, R. M. Practical MPI and pthread dy-namic verification, Nov. 2009. Half-day tutorial, 16th International Symposiumon Formal Methods, FM 2009,.

[15] Gopalakrishnan, G., and Kirby, R. M. Dynamic verification of messagepassing and threading, Jan. 2010. Half-day tutorial, 15th ACM SIGPLAN An-nual Symposium on Principles and Practice of Parallel Programming, PPoPP2010,.

[16] Gopalakrishnan, G., Kirby, R. M., and Vo, A. Practical formalverification of MPI and thread programs, Sept. 2009. Full-day tutorial,EuroPVM/MPI 2009,.

[17] Havelund, K., and Pressburger, T. Model checking java programsusing java pathfinder. International Journal on Software Tools for TechnologyTransfer 2, 4 (Apr. 2000).

[18] Holzmann, G. J. The Spin Model Checker. Addison-Wesley, Boston, 2004.

[19] A modest proposal for petascale computing. http://www.hpcwire.com/blogs/17909359.html. Mentions energy costs of Petascale machines.

[20] PE MPI buffer management for eager protocol. http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.pe431.mpiprog.doc/am106_buff.html.

[21] III, J. W., and Bova, S. Where is the overlap? In Mesage Passing InterfaceDeveloper’s and User’s Conference (MPIDC) (1999).

[22] Intel message checker. http://www.intel.com/cd/software/products/asmo-na/eng/227074.htm.

[23] Gem - isp eclipse plugin. http://www.cs.utah.edu/formal_verification/ISP-Eclipse.

[24] Isp. http://www.cs.utah.edu/formal_verification/ISP_Release.

[25] Test results comparing isp, marmot,and mpirun. http://www.cs.utah.edu/fv/ISP_Tests.

109

[26] Krammer, B., Bidmon, K., Mller, M. S., and Resch, M. M. MAR-MOT: An MPI analysis and checking tool. In Parallel Computing 2003 (Sept.2003).

[27] LAM/MPI parallel computing. http://www.lam-mpi.org/.

[28] Lastovetsky, A., Kechadi, T., and Dongarra, J., Eds. RecentAdvances in Parallel Virtual Machine and Message Passing Interface, 15thEuropean PVM/MPI User’s Group Meeting, Proceedings (2008), vol. 5205 ofLNCS, Springer.

[29] Li, G., DeLisi, M., Gopalakrishnan, G., and Kirby, R. M. Formalspecification of the MPI-2.0 standard in tla+. In Principles and Practices ofParallel Programming (PPoPP) (2008), pp. 283–284.

[30] Slouching towards exascale: Programming models for high-performance com-puting. http://www.cs.utah.edu/ec2 Accessed 12/16/09.

[31] Manohar, R., and Martin, A. J. Slack elasticity in concurrent computing.In Proceedings of the Fourth International Conference on the Mathematics ofProgram Construction (1998), Springer-Verlag, pp. 272–285. Lecture Notes inComputer Science 1422.

[32] Matlin, O. S., Lusk, E. L., and McCune, W. Spinning parallel systemssoftware. In Proceedings of the 9th International SPIN Workshop on ModelChecking of Software (London, UK, 2002), Springer-Verlag, pp. 213–220.

[33] MPI 2.1 Standard. MPI Standard 2.1, http://www.mpi-forum.org/docs/.

[34] Mpich2: High performance and widely portable MPI. http://www.mcs.anl.gov/mpi/mpich.

[35] Musuvathi, M., Park, D., Chou, A., Engler, D., and Dill, D. L.Cmc: A pragmatic approach to model checking real code. In Proceedings of theFifth Symposium on Operating System Design and Implementation (December2002).

[36] Musuvathi, M., and Qadeer, S. Iterative context bounding for systematictesting of multithreaded programs. In Programming Languages Design andImplementation (PLDI) 2007 (2007), pp. 446–455.

[37] Open MPI: Open source high performance MPI. http://www.open-mpi.org/.

[38] Pacheco, P. Parallel Programming with MPI. Morgan Kaufmann, 1996.ISBN 1-55860-339-5.

[39] Palmer, R., Barrus, S., Yang, Y., Gopalakrishnan, G., and Kirby,R. M. Gauss: A framework for verifying scientific computing software. InWorkshop on Software Model Checking (2005). Electronic Notes on TheoreticalComputer Science (ENTCS), No. 953.

110

[40] Palmer, R., Delisi, M., Gopalakrishnan, G., and Kirby, R. M. Anapproach to formalization and analysis of message passing libraries. In FormalMethods for Industry Critical Systems (FMICS 2007) (2008), S. Leue andP. Merino, Eds., pp. 164–181. LNCS 4916.

[41] Palmer, R., Gopalakrishnan, G., and Kirby, R. M. Formal specifi-cation and verification using +CAL: An experience report. In Proceedings ofVerify’06 (FLoC 2006) (2006).

[42] Palmer, R., Gopalakrishnan, G., and Kirby, R. M. Semantics drivendynamic partial-order reduction of MPI-based parallel programs. In Paralleland Distributed Systems: Testing and Debugging (PADTAD - V) (2007),pp. 43–53.

[43] ParMETIS - Parallel graph partitioning and fill-reducing matrix ordering.http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview.

[44] Pervez, S., Palmer, R., Gopalakrishnan, G., Kirby, R. M., Thakur,R., and Gropp, W. Practical model checking method for verifying correct-ness of MPI programs. In EuroPVM/MPI (2007), pp. 344–353. LNCS 4757.

[45] Quinlan, D., Vuduc, R., and Misherghi, G. Techniques for the spec-ification of bug patterns. In Parallel and Distributed Systems: Testing andDebugging (PADTAD) (2007).

[46] Sharma, S., Vakkalanka, S., Gopalakrishnan, G., Kirby, R. M.,Thakur, R., and Gropp, W. A formal approach to detect functionallyirrelevant barriers in MPI programs. In Lastovetsky et al. [28].

[47] Sharma, S. V., Gopalakrishnan, G., and Kirby, R. M. A surveyof MPI related debuggers and tools. Tech. Rep. UUCS-07-015, Universityof Utah, School of Computing, 2007. http://www.cs.utah.edu/research/techreports.shtml.

[48] Siegel, S. F. Efficient verification of halting properties for MPI programswith wildcard receives. In Verification, Model Checking, and Abstract Inter-pretation: 6th International Conference, VMCAI 2005, Paris, January 17–19,2005, Proceedings (2005), R. Cousot, Ed., vol. 3385 of LNCS, pp. 413–429.

[49] Siegel, S. F. The MADRE web page. http://vsl.cis.udel.edu/madre,2008.

[50] Siegel, S. F. The MPI-Spin web page. http://vsl.cis.udel.edu/mpi-spin, 2008.

[51] Siegel, S. F., and Avrunin, G. S. Verification of MPI-based software forscientific computation. In Model Checking Software: 11th International SPINWorkshop, Barcelona, Spain, April 1–3, 2004, Proceedings (2004), S. Graf andL. Mounier, Eds., vol. 2989 of LNCS, Springer-Verlag, pp. 286–303.

111

[52] Siegel, S. F., and Avrunin, G. S. Modeling wildcard-free MPI pro-grams for verification. In Proceedings of the ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming (Chicago, IL, June 2005),pp. 95–106.

[53] Siegel, S. F., and Siegel, A. R. MADRE: The Memory-Aware DataRedistribution Engine. In Lastovetsky et al. [28].

[54] “spin—Formal Verification” web site. http://spinroot.com, 2008.

[55] Stack Trace Analysis Tool. https://computing.llnl.gov/code/STAT.

[56] Strout, M. M., Kreaseck, B., and Hovland, P. D. Data-flow analysisfor MPI programs. In International Conference on Parallel Programming(ICPP) (2006), pp. 175–184.

[57] TotalView concurrency tool. http://www.totalviewtech.com.

[58] Vakkalanka, S., DeLisi, M., Gopalakrishnan, G., and Kirby, R. M.Scheduling considerations for building dynamic verification tools for MPI.In Parallel and Distributed Systems - Testing and Debugging (PADTAD-VI)(Seattle, WA, July 2008).

[59] Vakkalanka, S., DeLisi, M., Gopalakrishnan, G., Kirby, R. M.,Thakur, R., and Gropp, W. Implementing efficient dynamic formalverification methods for MPI programs. In Lastovetsky et al. [28].

[60] Vakkalanka, S., Gopalakrishnan, G., and Kirby, R. M. DynamicVerification of MPI Programs with Reductions in Presence of Split Operationsand Relaxed Orderings. In Computer Aided Verification (CAV 2008) (2008),pp. 66–79.

[61] Vakkalanka, S., Sharma, S. V., Gopalakrishnan, G., and Kirby,R. M. ISP: A tool for model checking MPI programs. In Principles andPractices of Parallel Programming (PPoPP) (2008), pp. 285–286.

[62] Vakkalanka, S., Vo, A., Gopalakrishnan, G., and Kirby, R. M.Reduced execution semantics of MPI: From theory to practice. In FM 2009(Nov. 2009), pp. 724–740.

[63] Vakkalanka, S. S., Szubzda, G., Vo, A., Gopalakrishnan, G.,Kirby, R. M., and Thakur, R. Static-analysis assisted dynamic veri-fication of MPI waitany programs (poster abstract). In PVM/MPI (2009),M. Ropo, J. Westerholm, and J. Dongarra, Eds., vol. 5759 of Lecture Notes inComputer Science, Springer, pp. 329–330.

[64] Vetter, J. S., and de Supinski, B. R. Dynamic software testing ofMPI applications with Umpire. In Supercomputing ’00: Proceedings of the2000 ACM/IEEE Conference on Supercomputing (CDROM) (2000), IEEEComputer Society. Article 51.

112

[65] Visser, W., Havelund, K., Brat, G., and Park, S. Model checkingprograms. In The Fifteenth IEEE International Conference on AutomatedSoftware Engineering (ASE’00) (Sept. 2000).

[66] Vo, A., Vakkalanka, S., DeLisi, M., Gopalakrishnan, G., Kirby,R. M., , and Thakur, R. Formal verification of practical MPI programs.In Principles and Practices of Parallel Programming (PPoPP) (2009), pp. 261–269.

[67] Vo, A., Vakkalanka, S., Williams, J., Gopalakrishnan, G., Kirby,R. M., and Thakur, R. Sound and efficient dynamic verification of MPIprograms with probe non-determinism. In EuroPVM/MPI (Sept. 2009),p. 271281.

[68] Vuduc, R., Schulz, M., Quinlan, D., de Supinski, B., and Saeb-jornsen, A. Improved distributed memory applications testing by messageperturbation. In Parallel and Distributed Systems: Testing and Debugging(PADTAD - IV) (2006).

[69] Yang, Y., Chen, X., Gopalakrishnan, G., and Kirby, R. M. Efficientstateful dynamic partial order reduction. In SPIN ’08: Proceedings of the 15thInternational SPIN Workshop on Model Checking Software (2008), LectureNotes in Computer Science, Springer.

efficient dynamic verification algorithms for mpi …

Documents