efficient dynamic verification algorithms for mpi …
TRANSCRIPT
EFFICIENT DYNAMIC VERIFICATION
ALGORITHMS FOR MPI
APPLICATIONS
by
Sarvani Vakkalanka
A dissertation submitted to the faculty ofThe University of Utah
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Computer Science
School of Computing
The University of Utah
August 2010
Copyright © Sarvani Vakkalanka 2010
All Rights Reserved
T h e U n i v e r s i t y o f U t a h G r a d u a t e S c h o o l
STATEMENT OF DISSERTATION APPROVAL
The dissertation of
has been approved by the following supervisory committee members:
, Chair Date Approved
, Member
Date Approved
, Member
Date Approved
, Member
Date Approved
, Member
Date Approved
and by , Chair of
the Department of
and by Charles A. Wight, Dean of The Graduate School.
ABSTRACT
The Message Passing Interface (MPI) Application Programming Interface (API)
is widely used in almost all high performance computing applications. Yet, con-
ventional debugging tools for MPI suffer from two serious drawbacks: they can-
not prevent the exponentially growing number of redundant schedules from being
explored; and they cannot prevent the processes from being locked into a small
subset of schedules, unfortunately often reaching the potentially buggy schedules
only when programs are ported to new platforms.
Dynamic verification methods are the natural choice for debugging real world
MPI programs when model extraction and maintenance are expensive. While many
dynamic verification tools exist for verifying shared memory programs, there are no
corresponding tools that support MPI – the lingua franca of parallel programming.
While interleaving reduction suggests the use of dynamic partial order reduction
(DPOR), four aspects of MPI make previous DPOR algorithms inapplicable: (i)
MPI contains asynchronous calls that can complete out of program order; (ii)
MPI has global synchronization operations that have weak semantics; (iii) the
runtime of MPI cannot, without intrusive modifications, be forced to pursue a
specific interleaving with nondeterministic wildcard receives; and (iv) the progress
of MPI operations can depend on platform-dependent runtime buffering, making
bugs sometimes appear when resources are added to boost performance. This
dissertation provides a formal model for MPI, and introduces a tailor-made no-
tion of Happens-Before ordering for MPI functions. The crucial feature of this
Happens-Before relation is that it elegantly solves all these four problems. MPI
dynamic analysis is turned into a prioritized scheduling algorithm respecting MPI’s
Happens-Before.
This dissertation contributes three algorithms that have been demonstrated
in the context of a practical MPI dynamic verification tool called In-Situ Partial
order (ISP). The Partial Order avoiding Elusive Interleavings (POE) algorithm is
a simple prioritized execution of the MPI transitions and is guaranteed to find
all deadlocks, assertion violations and resource leaks under zero buffering. The
POEOPT algorithm avoids many of the redundant interleavings of POE by fully
exploiting MPI’s Happens-Before. Finally, the POEMSE algorithm discovers all
possible minimal runtime bufferings that guarantee to discover bugs. POEMSE’s
slack analysis has minimal overheads, and offers the power of verifying for safe
portability by considering all relevant bufferings that might exist in various plat-
forms. In effect, a program is dynamically verified not just with respect to the
platform on which the tool is run, but also with respect to all platforms.
iv
To
Surya and Siri
CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTERS
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Specifics of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Dissertation Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Message Passing Interface (MPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 MPI Program Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Necessity of DPOR for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.2 MPI Formal Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4.3 POE Dynamic Verification Algorithm . . . . . . . . . . . . . . . . . . . . 81.4.4 POEOPT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.5 POEMSE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.6 The ISP Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Impact of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Message Passing Interface (MPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.1 MPI Isend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 MPI Irecv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.3 MPI Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.1.4 MPI Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.5 MPI Ordering Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Dynamic Partial Order Reduction (DPOR) . . . . . . . . . . . . . . . . . . . . 202.2.1 DPOR Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Applying DPOR to MPI : Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3. MPI FORMAL MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Formal Transition System for MPI . . . . . . . . . . . . . . . . . . . . . . . . . . 303.1.1 State Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 The State of an MPI Execution . . . . . . . . . . . . . . . . . . . . . . . . . 323.2 MPI Transition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Process Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 MPI Runtime Book-keeping Sets . . . . . . . . . . . . . . . . . . . . . . . . 343.2.3 MPI Runtime Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.4 Conditional Matches-before . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.5 Dynamic Instruction Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.6 One Transition or Multiple? . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.7 Dependent Transition Group . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.8 Selectors and Useful Predicates . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Illustration of the Formal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4 Applying DPOR to MPI Transition System . . . . . . . . . . . . . . . . . . . 42
4. THE POE ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 MPI Transition Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 The POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 Persistent Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.2 Persistent Sets and MPI Program Correctness . . . . . . . . . . . . . 474.2.3 POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Illustration of POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.4 Issues with POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.1 Redundant Interleavings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.2 POE and Buffered Sends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5. POE AND REDUNDANT INTERLEAVINGS . . . . . . . . . . . . . . 57
5.1 POE and Redundant Interleavings . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 InterHB and Co-enabledness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.3 POE Algorithm Modified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6. DETERMINISTIC MPI PROGRAMS . . . . . . . . . . . . . . . . . . . . . 69
6.1 Deterministic MPI Programs and HB . . . . . . . . . . . . . . . . . . . . . . . . 69
7. HANDLING SLACK IN MPI PROGRAMS . . . . . . . . . . . . . . . . 73
7.1 Verification for Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2 Introduction to Slack Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2.1 Zero Buffering Can Miss Deadlocks . . . . . . . . . . . . . . . . . . . . . . 767.2.2 Too Much Buffering Can Miss Deadlocks . . . . . . . . . . . . . . . . . 77
7.3 Using HB to Detect Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.3.1 HB Graph and Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.4 Finding Minimal Waits Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.5 POEMSE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8. EXTENSIONS TO THE FORMAL MODEL . . . . . . . . . . . . . . . . 89
8.1 Handling More MPI Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898.1.1 MPI Send and MPI Recv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898.1.2 MPI Waitall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vii
8.2 Communicators and Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928.2.1 Extensions to the Formal Model . . . . . . . . . . . . . . . . . . . . . . . . 94
9. ISP: A PRACTICAL DYNAMIC MPI VERIFIER . . . . . . . . . . 96
9.1 ISP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.1.1 The Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.1.2 The ISP Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.2 ISP : Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.2.1 Out-of-Order Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.2.2 Scheduling MPI Waitany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999.2.3 Buffering Sends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
10.1 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
viii
LIST OF FIGURES
1.1 Example MPI Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 GEM Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 MPI Ordering Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Example Thread Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 DPOR Illustration: Initial Interleaving . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 DPOR Illustration: Updating Backtrack Set . . . . . . . . . . . . . . . . . . . . 25
2.5 Simple MPI Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Illustration of Surprising MPI Runtime Behavior with DPOR . . . . . . . 27
2.7 Crooked Barrier Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Simple MPI Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Execution of Figure 3.1 with MPI Transitions . . . . . . . . . . . . . . . . . . . 41
3.3 MPI Execution with a Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 MPI Execution of Figure 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Pseudocode for POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Pseudocode for GetTransition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Pseudocode for UpdateBacktrack . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Pseudocode for GenerateInterleaving . . . . . . . . . . . . . . . . . . . . . 50
4.5 Crooked Barrier Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 POE Interleaving 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 POE Interleaving 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Redundant POE Interleavings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.9 POE and Persistent Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.10 Buffering Sends and POE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Redundant POE Interleavings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 POE and Persistent Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Simple Optimization and Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 InterHB Relation Across Match-sets . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 HB Relation for Figure 5.3 Shown as Graph . . . . . . . . . . . . . . . . . . . . 64
5.6 Redundancy with New POE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 64
5.7 Pseudocode for POEOPT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.8 Pseudocode for GetBacktrack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.9 Pseudocode for UpdateBacktrack . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.10 Pseudocode for AddtoBacktrack . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.11 Pseudocode for GenerateInterleaving . . . . . . . . . . . . . . . . . . . . . 67
7.1 Buffering Sends and Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Specific Buffering Needed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3 Path Breaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4 Example Formulas and GHB graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.5 Algorithm to Find Minimal Wait Sets . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.6 Pseudocode for UpdateBacktrack . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.7 Pseudocode for GetTransition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.8 Pseudocode for AddSlacktoBacktrack . . . . . . . . . . . . . . . . . . . . . 85
9.1 ISP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
x
LIST OF TABLES
9.1 Comparison of POE with Marmot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.2 Results for POE and POEOPT on MADRE . . . . . . . . . . . . . . . . . . . . . 103
9.3 Results for POE and POEMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
ACKNOWLEDGMENTS
This dissertation would not be complete without the help and support of faculty,
friends and family. I had the good fortune of meeting some of the smartest as well
as the most warm-hearted people during my graduate studies at the University of
Utah. The foremost among them is my advisor, Prof. Ganesh Gopalakrishnan. As
I was hunting in the department for a good advisor, it was universally acknowledged
by many graduate students that Prof. Ganesh is one of the best advisors in the
department. During my PhD studies, I also came to understand that he was more
than a good advisor. He is one of the most humble and warm-hearted persons and
a friend to all the students in the Gauss group. I would like to thank him for all
the opportunities, exposure, guidance and support he provided me right from the
very beginning of my PhD studies under him.
I would also like to thank Prof. Mike Kirby, who is my co-advisor, for his
support and encouragement. This dissertation would not be in its current form
without the valuable suggestions from Prof. Suresh Venkatasubramanian, Prof.
Matt Might and Prof. Stephen Siegel, who were also members of my dissertation
committee. I thank my dissertation committee from the bottom of my heart for
their understanding and support during the last year of my PhD when I was going
through some medical complications.
The best part of my PhD studies was being associated with the smart graduate
and undergraduate students in the Gauss group. Each person in the group is
special and I think I became a better person by my association with them. I have
to especially thank Subodh Sharma, who never refused to help me in anyway he
could. I can never repay him for the rides to and from the Salt Lake City airport,
whether it was day or night. I also thank Anh Vo, Michael Delisi, Geof Sawaya,
Sriram Ananthakrishnan and Guodong Li for their help and input on my research
during the entire course of my PhD studies.
I would like to thank Rajeev Thakur of Argonne National Labs and Bronis de
Supinsky of Lawrence Livermore National Labs for their support. Our tool ISP
would not be as successful without their intelligent and expert input.
This dissertation would not have reached the thesis editor without the help of
Karen Feinauer. I will always remember her as a person with a warm, welcoming
smile. My deepest gratitude goes to Karen for all the help with scanning the
corrections. This dissertation would literally not have seen the light of day without
her help.
There were times during the last 3.5 years when I was impossible to live with.
The ups and downs of research and my corresponding mood swings were felt more
by my family than anyone else. The person who took the brunt of all this is
my husband Surya who jumped with joy for me when I was successful and also
encouraged me when things did not go so well. He is the pillar who provided me
with immense support through some of the toughest times in my life. I know that
saying a mere “Thank You” is not sufficient. I only hope that I can be as good a
friend as he has been.
A decade ago, I would not have even dreamed that I was capable of doing a PhD.
My only goal was to finish my undergraduate studies and start on a well-paying job
to support my family financially. It was my twin sister Sridevi who showed me the
way to study while providing financial support. She is one of the most courageous
women who achieves her goals through sheer determination. It was though her
encouragement that I moved to the US for a PhD. I think there is no better place
to tell her that I love her and that my life would be incomplete without her presence
right from the day we were born.
xiii
CHAPTER 1
INTRODUCTION
It is no exaggeration to say that computer software already governs everything
we do as a human society. All computer software – regardless of its purpose –
must be correct as well as efficient. What differentiates various types of software
is the price we are willing to pay for achieving these goals, and for whom these
goals ultimately matter. Clearly, inefficient software (not producing results on
time; consuming excessive amounts of energy, etc.) is also “buggy.” We believe
in allowing humans to make such efficiency-related decisions, and focus on helping
them ensure the functional correctness of their designs (for now, “correct” can be
taken to mean “free of assertion violations and deadlocks”).
This dissertation is focussed on correctness issues that arise in software that
underlies large-scale scientific simulations. Such simulations are responsible for
virtually all the high performance computing (HPC) simulation experiments that
scientists and engineers perform on a virtually unlimited class of problems (weather
modeling, earthquake prediction, safety of nuclear stockpiles, drug discovery, testing
nascent theories in Physics, to name a few). Our goal is to contribute tools that
practitioners in HPC can employ in their day-to-day work to ensure that their HPC
simulation programs are correct.
Day-to-day software development in HPC is still an arduous process, often
relying on primitive debugging methods such as “printf debugging.” Modern
commercial tools in this area (e.g., TotalView [57], STAT [55], etc.) are extremely
helpful for debugging errors after a crash has been recorded. However, these tools
have no analytical power that lets them study a piece of software over the fewest
number of concurrent interleavings or data inputs, and locate bugs with formal
2
assurance. They rely on human ingenuity for test data input selection – known
to be unreliable and nonscalable. They rely on the concurrency schedules that
naturally occur in the test environment for concurrency coverage – known to be very
inadequate even from simple studies [68]. Future HPC software will be far more
complex, employing, for example, innovative techniques for energy management
and load balancing. All these additions to the inherent complexity of the core
software will overwhelm even the best available methods.
The HPC community – comprised of scientists and engineers who do not neces-
sarily have a computer science background – have expressed that today’s available
methods are incapable of providing the required levels of correctness. The 2009
ExaScale Software Study [7] points out the sheer complexity of Extreme Scale
computing system designs which will witness an increased use of different system
components all the way from core-to-core communication protocols to middle ware
that manages multiproblem integration. This study asserts, ...Handling such com-
ponents in a seamless way and allowing programmers to pursue efficiency while
still providing multiple safety nets are all open challenges, needing the use of
formal methods. In his recent talk entitled Slouching Towards Exascale: Pro-
gramming Models for High Performance Computing [30], Lusk observes, Formal
methods provide the only truly scalable approach to developing correct code in this
complex [Exascale] programming environment. Such statements are easily justified
considering the economic- and opportunity-costs of errant HPC simulations. For
example, today’s Petascale system installations can cost millions of dollars just in
energy costs [19].
1.1 Specifics of this Dissertation
This dissertation aims to develop practical concurrency verifiers based on formal
principles that will ensure that High Performance Computing (HPC) programs
written using the Message Passing Interface (MPI) library are free of egregious
and costly errors. We choose MPI because of its dominant position in HPC.
The importance of MPI is well known; it is employed in virtually all scientific
3
explorations requiring parallelism, such as weather simulation, medical imaging,
and earthquake modeling that are run on expensive high performance computing
clusters.
We want our verifiers to be:
nonobtrusive allowing designers to focus on problem solving.
reliable by scaling well.
widely usable by directly working on the designers’ programs (not requiring mod-
els of these programs).
1.1.1 Dissertation Statement
Dynamic formal verification methods incorporating innovative partial order
reduction methods can help develop nonobtrusive, reliable, and widely usable tools
for MPI programs. Such tools can not only verify a given program with respect
to a given platform (machine, runtime) but also reliably predict and flag errors
pertaining to scheduling and buffering variations across all platforms.
1.2 Message Passing Interface (MPI)
The MPI standard [33] is an informal document that provides English descrip-
tions of the individual behaviors of about 300 MPI operations. There are several
popular MPI library implementations [34, 37, 27]. Typical MPI programs are
C/C++/Fortran programs that create a fixed number of processes at inception.
These processes then perform computations in their private stores, invoking various
MPI operations in the MPI library to synchronize and exchange data. MPI supports
the SPMD-based programming model. An example MPI program written in C is
shown in Figure 1.1.
All function calls of the form MPI_XXXX are calls into the MPI library. The MPI
program is executed with the number of processes as a command line input which is
passed as a parameter to MPI_Init (line 12). The MPI_Init library call will create
4
1: #include <stdio.h>2: #define buf_size 1283: int main (int argc, char **argv) {4: int nprocs = -1;5: int rank = -1;6: char processor_name[128];7: int namelen = 128;8: int buf0[buf_size];9: int buf1[buf_size];10: MPI_Status status;11: /* init */12: MPI_Init (&argc, &argv);13: MPI_Comm_size (MPI_COMM_WORLD, &nprocs);14: MPI_Comm_rank (MPI_COMM_WORLD, &rank);15: MPI_Get_processor_name (processor_name, &namelen);16: printf ("(%d) is alive on %s\n", rank, processor_name);17: fflush (stdout);18: MPI_Barrier (MPI_COMM_WORLD);19: if (nprocs < 2) {20: printf ("not enough tasks\n");21: } else if (rank == 0) {22: MPI_Recv (buf1, buf_size, MPI_INT,23: MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, &status);24: MPI_Recv (buf0, buf_size, MPI_INT,25: 1, 0, MPI_COMM_WORLD, &status);26: } else if (rank == 1) {27: memset (buf0, 0, buf_size);28: memset (buf1, 1, buf_size);29: MPI_Send (buf0, buf_size, MPI_INT, 0, 0, MPI_COMM_WORLD);30: MPI_Send (buf1, buf_size, MPI_INT, 0, 0, MPI_COMM_WORLD);31: }32: MPI_Barrier (MPI_COMM_WORLD);33: MPI_Finalize ();34: printf ("(%d) Finished normally\n", rank);35: }
Figure 1.1. Example MPI Program
5
the requested number of processes and each of the processes start executing the
same program code immediately following MPI_Init. Every process is provided a
unique process rank in the range [0 . . . n − 1] where n is the number of processes
provided as the input. Every process begins executing MPI library calls with an
MPI_Init and finishes executing MPI library calls by calling MPI_Finalize.
After executing some MPI library calls (line 13–17), all the processes synchronize
at MPI_Barrier call (line 18). Every process blocks at MPI_Barrier until all other
processes execute the call to MPI_Barrier. The program is then executed based
on the process rank of the executing process. The process with rank 0 executes
MPI_Recv operations (line 21–25) while the process with rank 1 executes MPI_Send
operations (lines 26–31). The rest of the processes with a rank greater than 1 will
block at MPI_Barrier (line 32) until process ranks 0 and 1 eventually execute an
MPI_Barrier. All processes finally execute MPI_Finalize and terminate.
1.3 MPI Program Verification
MPI programs can have many kinds of bugs [64] which can be very hard to
debug using the traditional debugging techniques. Usual debugging techniques for
MPI programs include explicit modifications to the source code, message tracing
and visualization. Programmers typically go through a number of debugging or
testing iterations before a bug is fixed. This iterative analysis and debugging is
time consuming, error prone and complicated, especially if the messages induce
nondeterministic behaviors.
The MPI bugs that arise due to nondeterministic communication races are
among the most difficult to debug. The programmer must be able to enumerate the
full nondeterministic execution scenarios possible and test each of the scenarios for
various possible bugs. Such manual testing of various possible execution scenarios
is usually impractical for large applications. Testing tools [26, 64] are capable
of testing certain execution scenarios but do not guarantee coverage. Given the
complexity of MPI applications and the difficulty debugging them, we are convinced
that there is a need for verification tools for MPI programs. Verification tools
6
usually employ well-known verification algorithms that guarantee coverage and
hence bug detection.
There are two popular forms of formal verification - namely, model-based veri-
fication and dynamic verification. Model-based verification [54, 50] tools usually
require programmers to build a model of their application in a different language
and then verify their model against various properties. Model-based verification
will only help the programmer debug and guarantee bug freedom in the model
but not in the actual program. Also, building a model for a large and complex
MPI program itself can be time consuming. A model-based verification tool called
MPI-SPIN [50] is presently available for MPI.
Dynamic verification tools [4, 11, 24] take as their input the user code provided
with a test harness. Then, using customized scheduling, algorithms enforce specific
classes of concurrent schedules to occur. Such schedules are effective in hunting
down bugs and are often sufficient to provide important formal coverage guaran-
tees. Dynamic verification tools almost always employ techniques such as dynamic
partial order reduction (DPOR) [69, 9, 17], bounded preemption searching [4, 36],
or combinations of DPOR and symmetry reduction [69] to prevent redundant
state/interleaving explorations. While many such tools exist for verifying shared
memory programs, there is a noticeable dearth of dynamic verification tools sup-
porting the scientific programming community that employs the Message Passing
Interface (MPI).
Though model-based verification tools are popular for their guaranteed coverage
for all inputs, we believe that dynamic verification provides a more practical solution
for MPI programs. Most programs are not input-centric and any specific inputs
are usually handled separately in a different code path. It is usually sufficient to
run the dynamic verification tools with possible input test harnesses and get the
required coverage. Additionally, dynamic verification tools are very easy to use
with little to no programmer effort.
7
This dissertation presents novel dynamic verification algorithms for MPI that
have been implemented into a tool called ISP which stands for In-Situ Partial
order.
1.4 Contributions
1.4.1 Necessity of DPOR for MPI
Our first contribution (described in Chapter 2) provides reasons why a new
dynamic partial order reduction algorithm for MPI is necessary. We show through
illustrations how a direct application of the classical DPOR algorithm will not work
for MPI programs. This also forms the motivation behind the algorithms developed
in this dissertation.
1.4.2 MPI Formal Semantics
Our next contribution is simple and intuitive formal semantics for MPI (de-
scribed in Chapter 3). We provide formal transition semantics for four MPI func-
tions: namely, MPI_Irecv, MPI_Isend, MPI_Barrier, and MPI_Wait. The transi-
tion semantics are divided into two parts called the Process transitions and Runtime
transitions. This division among semantics follows directly from the fact that the
MPI program execution environment consists of the MPI processes that execute
the program code and an MPI runtime daemon that serves these processes. The
MPI runtime contains the library that implements the MPI standard. Processes
issue MPI function calls into the MPI runtime. The MPI runtime is responsible
for the actual execution of the MPI operations issued by the processes according
to the standard.
Our formal transition model is constructed from our experience in building a
formal TLA+ model for MPI [40, 42] and reading the MPI standard. The formal
transition model has embedded within it the ordering guarantees among the MPI
operations described by the MPI standard. We call these ordering guarantees
IntraHB (Intra-Happens-Before) ordering since the ordering is described only for
MPI operations within a process. Our MPI verification tool ISP implements the
8
runtime transitions of the formal model. The formal model has been extended to
60 MPI operations which is implemented by our verification tool ISP.
1.4.3 POE Dynamic Verification Algorithm
POE stands for Partial Order under Elusive interleavings. The POE algorithm
(described in Chapter 4) is a prioritized execution of the MPI transitions in the
formal model. The prioritized execution allows the discovery of full nondeterminism
in an MPI program. However, the POE algorithm can only generate interleavings
when the MPI sends are not provided any buffering by the MPI runtime. Also,
the POE algorithm can generate a large number of redundant interleavings which
can unnecessarily increase the verification time. Our tool ISP implements the POE
algorithm and has verified a number of small as well as large MPI programs.
1.4.4 POEOPT Algorithm
The POEOPT algorithm is an optimized POE algorithm (described in Chap-
ter 5) that attempts to reduce the redundant interleavings generated by the POE
algorithm. We found that the IntraHB relation among the MPI operations does
not provide the information required to eliminate the redundant interleavings. We
extend the IntraHB relation with the InterHB (Inter-Happens-Before) relation that
is derived from the formal MPI transitions system and IntraHB relation of MPI
operations. We use both the IntraHB and InterHB analysis of an MPI program
execution to extend the POE algorithm to the POEOPT algorithm.
1.4.5 POEMSE Algorithm
MPI programs exhibit slack inelastic behavior [31]. That is, it is possible
for programs to depict new behaviors where they can deadlock or enter into an
erroneous state when more slack or buffer is provided to the MPI_Isend operations
by the runtime. The MPI_Isend operation sends a message to another process
that receives this message. The messages are copied from the memory space of
the process sending the message to the memory space of the process receiving
9
the message. However, it is possible for the MPI runtime to provide buffer space
to the messages. In this case, the message being sent is copied into the runtime
provided buffer even when there is no process to receive that message. The buffering
availability for a process depends on the current runtime buffer usage by the
process and a configuration parameter called eager limit. The eager limit is usually
configured into the MPI runtime and is purely an implementation decision of the
MPI library implementation. The MPI standard does not specify any rules or
guidelines on the eager limit.
The buffer availability for a message is a dynamic property. A program can show
two different behaviors when executed with two different libraries. One solution is
to buffer all the sends to help discover the new behaviors due to slack. However, the
send operations themselves can contribute to deadlocks when they are not buffered.
Buffering all the sends would hence miss these deadlock behaviors.
A brute force way is to verify the program with every send having no buffering
and full buffering. This is prohibitively expensive since a program contains a large
number of sends in general. The POEMSE algorithm (described in Chapter 7) uses
the InterHB and IntraHB analysis to find the minimal send sets that must be
buffered in order to find the new behaviors due to buffering. In addition to this, it
also finds all the minimal sets of sends to be buffered that can cause a new behavior
and generates an interleaving for every such minimal set. Our experimental results
show that the POEMSE algorithm does well in practice and its verification time is
only slightly more than the POE algorithm.
1.4.6 The ISP Tool
The above algorithms are nonobtrusive, are reliable, and can be used as the
basis for creating widely usable tools. These facts are clearly brought out through
the ISP tool that is also one of the major contributions of this dissertation. The
implementation of ISP is described in Chapter 8.
The impact that our work has had is now described in the next section.
10
1.5 Impact of This Dissertation
• Our publications include [60, 59, 46, 58, 66, 62, 1, 67]
• We have built and released a dynamic formal verification tool for MPI C
programs called In-Situ Partial Order Analysis (ISP [24]). While several
students have contributed toward ISP (for which we are very grateful), this
dissertation provided the core ideas as well as much of the implementation.
There have been over 200 worldwide downloads of ISP.
• We held half-day tutorials featuring ISP at ICS (May’09, [13]), FM09 (Nov’09,
[14]), and PPoPP (Jan’10, [15]).
• We presented a full-day invited tutorial featuring ISP at EuroPVM/MPI
(Sep’09, [16]).
• Our NSF REU undergraduates Alan Humphrey and Chris Derrick built the
Graphical Explorer for Message Passing (GEM [23]). Figure 1.2 presents a
snapshot of GEM’s user interface. GEM was officially accepted as part of the
PTP 3.0 version in 12/09.
• REU UG Sawaya and Atzeni built a Concurrency Education website [6]
containing all examples from a popular MPI textbook [38] for teaching MPI
using ISP.
1.6 Related Work
The area of formal verification has been successfully applied to many applica-
tions that include applications in telecommunication software design (e.g., [18]),
aerospace software (e.g., [65]), device driver design (e.g., [3]) and operating sys-
tem kernels (e.g., [35]). The use of formal methods for HPC software design,
and in particular to MPI-based parallel/distributed program design, has found an
increasing level of activity in the recent years. The earliest use of model checking
in this area is by Matlin et al. [32], who used the SPIN model checker [18] to
11
Figure 1.2. GEM Front-end
verify parts of the MPD process manager. Subsequently, Siegel and Avrunin used
model checking to verify MPI programs that employ a limited set of two-sided MPI
communication primitives [51]. Siegel subsequently published several techniques
for efficiently analyzing MPI programs [52, 48, 2].
Siegel provides a state model for MPI programs and describes how the state
model is incorporated into MPI SPIN [52, 48, 2]. Deadlock properties for de-
terministic programs when the programs have no wildcard receives are proved in
[52]. Siegel later proposed the “Urgent” algorithm to check for deadlocks in MPI
programs with wildcard receives [48].
The “Urgent” algorithm is only defined for blocking (synchronous) mode receives
but it is not clear how the algorithm can be extended for nonblocking mode receives.
12
Also, the “Urgent” algorithm does an exponential search on all the sends for every
buffering possibility, which can be expensive.
Some of the earlier publications of the “Utah Verification” group in this area
pertained to the use of model checking to analyze MPI programs [39, 41], an
executable formal specification of MPI [40, 29] and an efficient model checking
algorithm for MPI [42]. One difficulty in model checking is the need to create an
accurate model of the program being verified. This step is tedious and error prone.
If the model itself is not accurate, the verification will not be accurate. To avoid
this problem, an in-situ model checker ISP was first developed in [44] which dealt
with MPI one-sided communication. Techniques to enhance the efficiency of this
algorithm were reported in [61]. Our recent work [60, 59, 58, 66, 62] introduces
the POE algorithm which is implemented into our tool ISP. We also employed the
POE algorithm for detecting the presence of functionally irrelevant barriers in MPI
programs [46].
Other research groups have approached the formal verification of MPI programs
through schedule perturbation [68], data flow analysis [56] and by detecting bug
patterns [45]. A survey of MPI-related tools and debuggers can be found in [47].
CHAPTER 2
BACKGROUND
This chapter provides a basic introduction to Message Passing Interface (MPI)
along with a detailed description of a small set of MPI functions in Section 2.1.
A detailed description of the classic Dynamic Partial Order Reduction (DPOR)
is provided in Section 2.2. Section 2.3 describes various issues that arise when
classical DPOR is directly applied to MPI. The initial impetus for our work was
the inapplicability of classical DPOR to MPI. This led to a thorough formalization
of MPI and a new understanding of how to handle many aspects of MPI based
on a single unifying formalism: a new Happens-Before order for MPI. Using the
techniques in this dissertation, we can analyze MPI programs that allow out of
order message matching, have collective operations, and whose behavior can alter
significantly in a resource-dependent manner.
2.1 Message Passing Interface (MPI)
This section provides a basic introduction to MPI and four MPI functions:
MPI_Isend, MPI_Irecv, MPI_Wait, MPI_Barrier in English. The reason we chose
these functions is that a thorough understanding of over 60 MPI functions can
be obtained by studying just these four functions. This section is not intended
to replace the MPI standard and the readers are encouraged to read the MPI
standard [33] for a more extensive introduction to the above functions. Most of
the dissertation only deals with these four MPI functions to keep the formal model
simple. The formal model can be easily extended to handle more MPI functions
(Chapter 8). ISP, the tool that implements our formal model, handles over 60 most
frequently used MPI functions. A formal notation for the MPI functions introduced
in this section is provided in Section 3.1.
14
Most MPI programs have two or more processes communicating through MPI
functions. All the processes have MPI process ids called ranks ∈ N0 = {0, 1, . . .}
that range from 0 . . . n − 1 for n processes. In addition to the processes, the
MPI execution environment also has an MPI daemon process which we call MPI
Runtime. The MPI library that implements the MPI standard is a part of the
MPI runtime. The processes issue the MPI functions into the MPI runtime. By
“issue” we mean that that MPI function call is invoked by the MPI process. The
MPI runtime keeps track of the MPI functions issued by the processes, matches
the MPI functions across processes and transfers data across processes according
to the MPI standard. The MPI runtime hence forms the critical component of the
MPI execution environment.
Every MPI process starts execution by issuing MPI_Init (argc, argv). A
process cannot issue any other MPI function unless it issues an MPI_Init. Every
MPI process that issues an MPI_Init must also issue an MPI_Finalize eventually.
No further MPI functions can be issued by a process once it issues an MPI_Finalize
except for MPI_Finalized which checks if a MPI_Finalize has been invoked.
MPI_Finalized is a local process action and we ignore any MPI_Finalized MPI
function that is executed by a program. We assume that all the examples provided
in this dissertation always implicitly issue an MPI_Init at the beginning and an
MPI_Finalize at the end of an execution and do not explicitly show them in any
of the examples. We also assume that all processes are single threaded.
Every MPI function will attain the following states during its lifetime in the
MPI runtime:
• issued : The MPI function has been issued into the MPI runtime.
• returned : The MPI function call has returned and the process that issued
the function can continue executing.
• matched: Since most MPI functions usually work in a group (for example, an
MPI_Isend from one process will be matched with a corresponding MPI_Irecv
from another process), an MPI function is considered matched when the MPI
15
runtime is able to match various MPI functions into a group which we call a
match-set. All the MPI function calls in the match-set will be considered as
having attained the matched state.
• complete: An MPI function can be considered to be complete by the MPI
process that issued the MPI function when all visible memory effects have
occurred (e.g., when an MPI_Isend is buffered by the MPI runtime, the
MPI_Isend can be considered as complete when the message buffer has been
copied out from the process memory space into the runtime memory space).
We adapt the “complete” state from the MPI standard which applies for
MPI_Isend and MPI_Irecv and extend it to MPI_Wait and MPI_Barrier
trivially to keep the state model consistent.
The semantics of an MPI program are determined by the order in which sends
and receives are allowed to match. Matching does not imply that data transfer
has occurred; it is simply a commitment on part of sends and receives to ‘pair
with each other.’ Completion is mainly of local significance. An MPI_Isend that
completes allows the MPI_Wait operation waiting on it to return. This allows later
operations (coming after MPI_Wait in program order) to begin matching. This
is how completion indirectly affects matching – a crucial aspect of MPI behavior
that gets modulated by the amount of runtime buffering. More specifically, early
completion is possible in buffer resource endowed systems that provide higher eager
limits [21]. While the modulation of message passing behavior by the amount of
buffering is a well-known result [31, 48], our dissertation provides the first efficient
analysis of where such buffering matters – based on MPI Happens-Before.
We now describe the MPI functions MPI_Isend, MPI_Irecv, MPI_Barrier, and
MPI_Wait.
2.1.1 MPI Isend
MPI_Isend is a nonblocking send that has the following prototype:
MPI_Isend (void *buff, int count, MPI_Datatype datatype, int dest,
16
int tag, MPI_Comm comm, MPI_Request *handle);
where buff is the starting address of the data buffer that needs to be transferred
to the receiving side, datatype is the abstract type of the data in buff, count is
the number of elements of type datatype in buff, dest is the destination process
rank where the message is to be sent, tag is the message tag, and comm is the
MPI communicator. Tags and communicators provide fine grained communication
across processes. For simplicity, we abstract away the tags and communicators.
Chapter 8 provides a detailed description on communicators and tags and how our
formal model can be extended to handle them. The handle is set by the MPI
runtime and uniquely identifies the MPI_Isend in the MPI runtime.
Notation: We denote MPI_Isend by S.
The function call to S may return immediately (nonblocking) while the actual
send can happen at a later time. An S is considered complete by the process
issuing it if the data from buff is copied out. buff can be either copied out into
the MPI runtime provided buffer or to the buffer space of the MPI process receiving
this message. Buffer availability in the MPI runtime depends on a configuration
parameter called eager limit. An S issued by a process may be buffered if the buffer
size is below the eager limit. However, there is no guarantee that a S with small
message size will always be buffered by the runtime. If the MPI runtime buffer is
available, the S can be completed immediately by the MPI runtime. Otherwise, the
S can be completed by the runtime only after it is matched with a receive operation
issued by the dest process and the data is copied from buff to the receiving buffer
space. It is illegal for the MPI process to reuse the send buffer (buff) before the
send is completed. The completion of a send is detected by the process issuing it
using MPI_Wait.
We use S to denote a buffered send and S to denote a send with no runtime
buffering.
17
2.1.2 MPI Irecv
MPI_Irecv is a nonblocking receive with the following prototype:
MPI_Irecv (void *buff, int count, MPI_Datatype datatype, int src,
int tag, MPI_Comm comm, MPI_Request *handle);
where buff is the starting address of the memory where the data is to be received,
count, datatype have the same semantics as described for MPI_Isend, and src is
the rank of the process from where the message is to be received. The src can also
be MPI_ANY_SOURCE which indicates that the receive can be matched with an S from
any process when S’s dest is the same as receiving process rank. It is customary
to call receives with src set to MPI_ANY_SOURCE as wildcard receives and for ease
of notation we denote MPI_ANY_SOURCE as ‘*’. The data is received into buff and
handle is returned by the MPI runtime which uniquely identifies the receive in the
MPI runtime.
Notation: We denote an MPI_Irecv by R.
The function call to an R may return immediately and is considered complete
when all the data is copied into buff. It is illegal to reuse buff before the receive
completes. The completion of a receive is detected by the process using MPI_Wait.
2.1.3 MPI Wait
MPI_Wait is a blocking call and is used to detect the completion of a send (S)
or a receive (R) and has the following prototype:
MPI_Wait (MPI_Request *handle, MPI_Status *status),
where handle is returned in an S or an R and status describes the status of the
S or R corresponding to handle.
Notation: We denote an MPI_Wait by W .
The MPI runtime blocks the call to W until the send or receive is complete. The
MPI runtime resources associated with the handle are freed when a W returns and
handle is set to a special field called MPI_REQUEST_NULL. A W call with handle
18
set to MPI_REQUEST_NULL is ignored by the MPI runtime. An S or R without an
eventual W is considered as a resource leak.
2.1.4 MPI Barrier
A barrier call has prototype MPI_Barrier (MPI_Comm comm).
Notation: We denote MPI_Barrier by B.
B is a blocking function and is used to synchronize MPI processes.
A process blocks after issuing the barrier until all the participating processes
with the same comm also issue their respective barriers. Note that unlike the
traditional barriers used in threads where all the instructions before the thread
barrier must also be complete when the barrier returns, the MPI B does not provide
any such guarantees. An MPI B can be considered as a weak fence instruction.
2.1.5 MPI Ordering Guarantees
The ordering guarantees provided by the MPI runtime according to the MPI
standard define the order in which MPI program execution proceeds. MPI requires
all MPI library implementations (i.e., MPI runtime) to provide the following FIFO
ordering guarantees:
• For any two sends Sj and Sk, j < k from the same process i (i.e., Sj is issued
before Sk by process i) targeting the same destination (say process rank l),
the earlier send Sj is always matched with a receive before the later send Sk.
Note that this order is irrespective of the buffering status of the sends, i.e.,
the sends Sj and Sk can complete out-of-order. Consider the MPI execution in
Figure 2.1(a). Pi and Pl are two processes with ranks i and l, respectively. Pi
issues two sends to process l, S1 and S2, respectively, where S1 sends a million
bytes in data buffer d1 while S2 sends 10 bytes in data buffer d2. The W3(2)
is the W corresponding to S2 and W4(1) is the wait operation corresponding
to S1. Since S2 only sends 10 bytes, it is possible that S2 is provided MPI
runtime buffer and hence complete before S1 which is completed only after
the million bytes of data is copied into the d1 of R1. The solid directed edge
19
between S1 and S2 shows that S1 will be matched before S2. Hence, even if
S2 is complete before S1, S1 is always matched with the first matching R1
(shown by the dotted line between S1 and R1.
• For any two receives Rj, Rk j < k from the same process l receiving from
the same source (say i), the earlier receive Rj is always matched with a send
before the later receive Rk. Note that the receives can complete out-of-order.
Figure 2.1(a) shows two MPI receive operations of Pl: R1 and R2 that receive
messages from Pi. R1 is matched before R2 (shown as a solid directed edge
from R1 to R2). Since R2 receives only 10 bytes, it is possible that W3(2)
unblocks immediately since R2 has received the data while W4(1) which
corresponds to R1 remains blocked until R1 completes.
• For any two receives Rj, Rk, j < k from the same process l, when the first
receive Rj can receive from any source (called wildcard receive), the first
receive Rj is always matched with a send before the later receive Rk. This
scenario is depicted in Figure 2.1(b).
• For any two receives Rj, Rk, j < k from the same process l, where Rj is a
nonwildcard receive and Rk is a wildcard receive, Rj is matched before Rk
only when a matching send is available. Otherwise, Rk can be matched with
a send before Rj. In a sense, Rk(∗) has the ability to “reach over” Rj and
match. We call this behavior conditional-matches-before. This scenario is
shown in Figure 2.1(c). R1 receives message from Pm’s S1. Since there is a
matching S1 of Pm available, R1 is matched before R2. However, if Pm did
not have S1 available, then R2 can match before R1 and Pl will block on
W4(1) until a send from Pm is available. Since the matching is dependent on
the availability of a matching S, we call this as “conditional matches-before”
ordering.
The MPI standard requires that two messages sent from process i towards
process l must be matched in the same order (called nonovertaking in MPI). The role
20
(a) S and R Ordering (b) Wildcard Receive Ordering
(c) Conditional Ordering
Figure 2.1. MPI Ordering Guarantees
of matching order guarantee is to constrain the order in which sends and receives
match so as to guarantee nonovertaking. Notice that the sends and receives need
only to be matched in order. However, since the completion is only detected by a
W , the MPI standard does not enforce any order on the completion of the sends
and receives, leaving the choice to the MPI library implementation. Also note that
all the orders are defined on MPI functions within a process (i.e., all these are Intra
orders).
2.2 Dynamic Partial Order Reduction (DPOR)
Dynamic Partial Order Reduction (DPOR) [9] dynamically tracks various inter-
actions between threads/processes and generates only the Mazurkiewicz traces [5]
(called relevant interleavings henceforth). This is done by identifying the back-
tracking points in the interleavings and updating the backtrack points dynamically
until at the end of the execution, persistent [9, 10] sets have been formed at every
such point.
21
In multithreaded programming, the most common bugs are deadlocks and data
races. Deadlocks arise due to improper lock and unlock operations on mutexes
and data races occur when a shared memory is accessed concurrently by one or
more threads. Deadlocks and data races can be notoriously hard to debug. Many
approaches were proposed to discover data races and deadlocks in programs [8, 36].
However, applying the DPOR algorithm is the only method that can guarantee cov-
erage. We describe the classical DPOR algorithm in the context of multithreaded
programs in this section. Note that the DPOR algorithm will only help generate
interleavings. A more sophisticated analysis on the interleavings generated may be
required to actually detect the bugs. For example, detecting data races will require
a lock-set or Happens-Before analysis on the interleavings to actually detect the
presence of the data race in the interleaving. We limit this section to describe how
DPOR can be used to generate interleavings only.
Let σi denote a state. A state is identified by the values assumed by the variables
in that state. Let σ0 be the initial state where the values assumed by all variables are
⊥ (undefined). Let enabled(σi) be the set of program instructions (called transitions
henceforth) that can be executed in σi. Let backtrack(σi) ⊆ enabled(σi) be the
backtrack points that denote the transitions that must be executed from σi in
order to explore all relevant interleavings. An interleaving I is shown as σ0t0−→
σ1t1−→ . . .
tn−1−−→ σn where σ0 is the start state and σn is the terminating
state. σiti−→ σi+1 is a state transition in I from σi to σi+1 when transition ti is
executed from σi. proc(ti) denotes the process or thread executing the transition
ti. When backtrack(σi) = enabled(σi) for every state σi, then the entire state space
is explored.
DPOR algorithm works by identifying the backtrack points based on two no-
tions:
1. Co-enabledness of transitions.
2. Dependence between transitions.
22
Definition 2.1 A transition t1 is co-enabled with a transition t2 if there exists
some state σi such that t1, t2 ∈ enabled(σi) [9]
Definition 2.2 Two transitions t1, t2 are independent implies that the following
properties hold for all states σi:
1. if t1 ∈ enabled(σi) and σit1−→ σj, then t2 ∈ enabled(σi) iff t2 ∈ enabled(σj).
2. if t1, t2 ∈ enabled(σi), then there is a unique state σj such that σit1t2−−→ σj and
σit2t1−−→ σj.
Otherwise, the transitions are dependent. [9]
Independent transitions are also known as commuting transitions. DPOR re-
quires that the transition dependence and co-enabledness are correctly or conserva-
tively identified. Two lock transitions on the same mutex by different threads are
dependent whereas locks on different mutexes are independent. Also, a write access
and read/write access to a shared variable by two different threads are dependent
whereas accesses to distinct shared variables are independent. Though identi-
fying dependence/independence between transitions is straightforward, detecting
co-enabledness can be more involved. For multithreaded programs, dependent tran-
sitions are always conservatively considered co-enabled. The conservative approach
can cause redundant interleavings but will not affect the correctness or complete-
ness. Like classical Partial Order Reduction, since only dependent transitions can
cause the exploration of new states, we describe how the DPOR algorithm fills the
backtrack set after generating an interleaving σ0t0−→ σ1
t1−→ . . .tn−1−−→ σn. We only
provide a simple description of the DPOR algorithm. For more details, readers are
encouraged to read [9].
1. The DPOR algorithm first generates an interleaving I = σ0t0−→ σi
t1−→ . . .tn−1−−→
σn.
2. The algorithm maintains a stack of the states generated in an interleaving
where σn is at the top of the stack.
23
3. For every state σi, the backtrack sets are updated - goto step 7.
4. Pop the states out of the stack until a state where backtrack(σi) '= ∅ is found.
If the stack is empty, then there are no more interleavings to be explored.
Hence, exit. Otherwise, goto step 5.
5. Restart all the processes and regenerate the interleaving by executing t0 . . . ti−1
until σi is reached. Now explore the transitions in backtrack(σi) and generate
an interleaving.
6. goto step 2.
7. For transition ti executed from σi in I, find the dependent transitions T ⊆
{t0 . . . ti−1} in I such that every transition in T is dependent and may be
co-enabled with ti.
8. Find the transition tj ∈ T such that j ≥ k for all tk ∈ T .
9. Update backtrack(σj) with proc(ti)’s transition in enabled(σj). If no such
transition exists, then let backtrack(σj) = enabled(σj).
10. goto step 4.
We now explain the DPOR algorithm described through an illustrative example.
2.2.1 DPOR Illustration
Consider the multithreaded program execution of two threads p1 and p2 shown
in Figure 2.2.
Variables x and y are shared variables. Only two possible final states for the
thread executions are x = 2, y = 1 and x = 3, y = 1. We now illustrate the
p1 : lock(l); x = 1; x = 2; unlock(l);p2 : lock(l); y = 1; x = 3; unlock(l);
Figure 2.2. Example Thread Execution
24
execution of DPOR algorithm for the example in Figure 2.2 when only the lock
operations are considered dependent. Figure 2.3 shows the generation of the first
interleaving. The very first interleaving is generated by arbitrarily selecting some
transition in enabled(σi) for all states. The program terminates (or deadlocks)
when enabled(σn) = ∅. In the example, the initial state σ0 has the variables x, y set
to ⊥ and enabled(σ0) has the lock(l) operations of p1 and p2. Note that for state σ1,
Figure 2.3. DPOR Illustration: Initial Interleaving
25
enabled(σ1) does not contain p2’s lock operation because the p2’s lock instruction
remains disabled until p1 eventually unlocks (state σ4). The execution proceeds
until the final state σ8 is reached. At this point, the DPOR algorithm updates
the backtrack sets for all the states. Note that the only dependence is between
transitions t4 and t0, as shown in Figure 2.4. Once the dependence is recognized,
the backtrack set of σ0 (shown in bold font) is updated with p2’s transition enabled
Figure 2.4. DPOR Illustration: Updating Backtrack Set
26
in σ0 which is p2 : lock(l). No other backtrack sets are updated. The DPOR
algorithm now pops all the states from σ8 through σ1 since all their backtrack sets
are empty. A new execution is restarted from σ0 with p2’s lock operation and will
result in a new final state where x = 2 and y = 1.
The DPOR algorithm works for multithreaded programs but will fail when
applied to MPI programs. Multithreaded programs are guaranteed sequential
consistency for the atomic lock and unlock operations which also behave as strong
fence instructions. However, MPI does not provide any such guarantees. The
following section illustrates these issues in more detail.
2.3 Applying DPOR to MPI : Issues
We now describe the issues in applying DPOR to MPI program through illus-
trative examples. Consider the example program shown in Figure 2.5
The example in Figure 2.5 shows the dynamic execution of three MPI processes
P0, P1 and P2. Process P0 issues an MPI_Isend (shown as S0) to dest = P1 with
the buffer d0 having the value 0. Similarly, P2 issues a send (S2) to dest = P1 with
the buffer d2 as 2. Process P1 issues a MPI_Irecv R1 which is a wildcard receive
(src = ∗) and with the receiving buffer d1. (Note that the MPI functions have
only the arguments necessary to explain the examples.) The wildcard receive R1
can receive data from either S0 or S2. However, P1 has an error when R1 receives
from S2. In order to discover this error, the program must enter into a state where
d1 is 2. The goal is to explore all the possible nondeterminism due to wildcard
receives. That is, it is necessary to match a wildcard receive with all the possible
P0 P1 P2
S0(P1, d0 = 0); R1(∗, d1); S2(P1, d2 = 2);if (d1 == 2) error
Figure 2.5. Simple MPI Example
27
sends. Since DPOR helps explore all relevant interleavings, we apply DPOR to the
above simple MPI program, as shown in Figure 2.6.
The first interleaving generated is shown in Figure 2.6(a). In this interleaving,
S0 is issued first, followed by R1 and finally, S2 is issued. The value of d1 is 0 as
expected since S0 is issued early. The dependence between the two sends matching
the same wildcard receive R1 causes the backtrack set of σ0 to be updated with
S2. Therefore, in the next interleaving, S2 will be issued earlier instead of S0
so that R1 receives from S2 instead of S0. The second interleaving is shown in
Figure 2.6(b). However, note that d1 can still receive a value of 0 (shown in bold
font) instead of 2 contrary to what is expected. This happens because the issue order
of the sends do not indicate the matching order when the sends are from different
processes. The MPI runtime decides the matching between the sends and receives.
In [68], the author presents how skewed MPI runtime matches are in real world
MPI execution environments. Unfortunately, the authors’ solution in [68] is both
highly wasteful (introduces random delays in the main computational path) and
(a) Update Backtrack for first In-terleaving
(b) Surprising result with DPOR
Figure 2.6. Illustration of Surprising MPI Runtime Behavior with DPOR
28
still does not guarantee that the offending message matches will be enforced. Since
there are exponentially more useless schedule perturbations that Jitterbug causes
than useful perturbations, the designer cannot have any hopes that the method
(even ignoring its slowing of operations) will yield benefits.
Therefore, even when S2 is issued earlier, it is possible for the MPI runtime to
match the later issued send S0 with R1. Hence, it is possible that the error is never
caught since DPOR erroneously assumes that all nondeterministic code paths are
explored.
DPOR when applied to multithreaded programs depends on the fact that the
underlying cache coherence protocol ensures that when only one process executes
an instruction at a time, then multiple writes to a shared variable happen in the
issue order. This is not true for MPI. Therefore, applying classic DPOR to the
example in Figure 2.5 can result in a bug omission.
However, let us assume that it is possible to implement an MPI runtime that has
verification support so that the MPI runtime matches the sends with receives in the
order they are issued. This should solve the problem with DPOR. Unfortunately,
this is not sufficient. MPI programs can have complex behaviors due to which it
will not be possible for a send to be issued earlier even though it can match with a
wildcard receive. We illustrate this with our “Crooked Barrier” example shown in
Figure 2.7.
In the example shown in Figure 2.7, B0, B1 and B2 are the matching MPI_Barrier
operations issued by processes P0, P1, P2, respectively. Note that the barrier B0 is
issued after S0 is issued. Processes P1 and P2 are blocked at their barriers B1 and
P0 P1 P2
S0(P1, d0 = 0); B1; B2
B0; R1(∗, d1); S2(P1, d2 = 2);if (d1 == 2) error
Figure 2.7. Crooked Barrier Example
29
B2 until P0 issues its barrier. Once P0’s barrier is issued, all the barriers unblock
so that R1 and S2 are issued. The verification-based MPI runtime matches S0 with
R1 since S0 is issued before S2. However, it is possible for S2 to also match with R1.
In order to accomplish this, it is necessary for S2 to be issued into the MPI runtime
before S0. This is impossible because B0 cannot be issued unless S0 is issued.
Hence, in any execution of the program in Figure 2.7, S0 is always issued before S2.
Hence, even with a verification-based MPI runtime, it will not be possible to explore
all nondeterministic code paths in an MPI program. The example in Figure 2.7
clearly illustrates that the DPOR algorithm for the multithreaded programs cannot
be applied as is to MPI programs.
We need specialized formal verification tools for important domains. Each do-
main has its own computational model. This dissertation discovers and presents the
MPI computational model uniformly in terms of Happens-Before relation between
MPI operations.
The DPOR algorithms developed in this dissertation do not require any changes
to MPI programs or the MPI library implementations to support our dynamic
verification algorithms. We enforce the requisite matches during replays of our
DPOR for MPI by dynamically determinizing MPI receive operations. This way, we
can “fire and forget” MPI receives fully knowing that they will match the intended
sends. In a nutshell, this dissertation contributes two key ideas in this area:
• Dynamically determine all send matches: This is a guaranteed algo-
rithm that will find all senders that can ever match a wildcard receive.
• Replay over all dynamic rewrites: We dynamically replay the execution
for each sender by dynamically rewriting the receive to match that sender.
ISP employs similar techniques for handling other sources of nondeterminism
such as introduced by MPI_Iprobe.
CHAPTER 3
MPI FORMAL MODEL
This chapter presents a formal transition system for MPI (Section 3.2). To keep
the formal model simple, this chapter assumes that all the sends are unbuffered,
i.e., none of the sends are provided with any runtime buffering. Buffered sends are
dealt with in Chapter 7. We illustrate the application of our formal model to a
small MPI program in Section 3.3. Section 3.4 illustrates why the classical DPOR
is still not directly adaptable to the MPI transitions described in this chapter.
3.1 Formal Transition System for MPI
Let N0 denote {0, 1, . . .} and let N denote {1, 2, . . .}. As in set theory, we often
write k ∈ n to mean k ∈ {0, . . . , n− 1}.Consider MPI program execution with PID ∈ N MPI processes, each denoted
by Pi for i ∈ PID. View each Pi as a sequence. Thus, Pi,j can be regarded as the
jth member of the sequence Pi, denoting the jth MPI operation issued by the ith
MPI process. Sequence Pi is of length |Pi|. We assume that our MPI programs
terminate and therefore, execution sequences are finite.
Let Op denote the set of all MPI operations. An MPI operation belonging to
Op is one of these:
1. Si,j(k) for i, k ∈ PID and j ∈ |Pi|. This is the MPI call MPI_Isend(to:k)
issued as the jth call by MPI process i.
2. Ri,j(k) for i, k ∈ PID and j ∈ |Pi|. This is the MPI call MPI_Irecv(from:k)
issued as the jth call by MPI process i.
3. Ri,j(∗) for i, k ∈ PID and j ∈ |Pi|. This is the MPI call MPI_Irecv(MPI_ANY_SRC)
issued as the jth call by MPI process i.
31
4. Wi,j′(hi,j) for i ∈ PID and j, j′ ∈ |Pi| and j < j′. This is the MPI call
MPI_Wait(handle) where handle is the wait handle returned by an earlier
issued Si,j(k), Ri,j(k), or Ri,j(∗).
5. Bi,j for i ∈ PID and j ∈ |Pi|. This is the MPI call MPI_Barrier.
Recognizers for members of Op:
• isS(Fi,j) is true when F = S and false otherwise.
• isR(Fi,j) is true when F = R and false otherwise.
• isW (Fi,j) is true when F = W and false otherwise.
• isB(Fi,j) is true when F = B and false otherwise.
3.1.1 State Model
Every MPI function is in one of the following four execution states:
• issued (I): The MPI function has been issued into the MPI runtime.
• returned (R): The MPI function has returned and the process that issued this
function can continue executing.
• matched (M): Since most MPI functions usually work in a group (for example,
a S from one process will be matched with a corresponding R from another
process), an MPI function is considered matched when the MPI runtime is able
to match the various MPI functions into a group which we call a match-set.
All the function calls in the match-set will be considered as having attained
the matched state.
• complete (C): An MPI function can be considered to be complete according to
the MPI process that issues the MPI function when all visible memory effects
have occurred (e.g., in case the MPI runtime has sufficient buffering, we can
consider an MPI_Isend to complete when it has copied out the memory buffer
32
into the runtime buffer). The completion condition is different for different
MPI functions (e.g., the R matching a S may not have seen the data yet, but
still the S can complete on the send side when the send is buffered).
3.1.2 The State of an MPI Execution
Our formal model does not model the actual MPI programs or the underlying
language semantics. Instead, we model dynamic execution sequences which are
presented by an “Oracle” that understands the underlying language semantics.
We communicate with the “Oracle” asking it for the next MPI operation that is
executed by a process. Hence, the formal model presented here abstracts away the
program variable values or local process states. Our formal model finds out various
send and receive matches which implicitly models the program data. Note that the
formal model directly applies to our ISP scheduler that is only aware of the dynamic
MPI operations executed by a process but the rest of the program instructions
remain invisible to the scheduler. All values retuned by an MPI call (e.g., MPI
receive data, status flags) are assumed to be available to this Oracle which can
base conditional branches based on them to decide the next MPI operation.
The state of an MPI program execution is denoted by the record
{I : Op, M : 2Op, C : Op, R : Op, pc : 2N}
or more compactly as the tuple
〈I,M, C, R, pc〉.
Here, I denotes those instructions that have been issued. Set R denotes those
instructions whose calls have returned to the calling process. Set M denotes those
calls that have matched, and C denotes calls that have completed. M will consist
of sets of matching MPI calls: either sets of the form {Si,j(k), Rk,l(i)}, containing
matching sends and receives, or {Bi,j | i ∈ PID, j ∈ |Pi|}, showing matching
barriers.
33
The initial state of our transition system, σ0, is
〈∅, ∅, ∅, λi.0〉.
Since every process starts execution with MPI_Init, Pi,0 is MPI_Init for i ∈
PID.
A transition moves the system from state σ to the next state σ′, and is written
σit−→ σi+1. The MPI execution system consists of process transitions and MPI run-
time transitions. The MPI transition system provided in this section is very generic;
an actual MPI runtime can follow any specific scheduling strategy consistent with
transitions described here.
3.2 MPI Transition System
We are now ready to present the MPI transition system.
3.2.1 Process Transitions
The process transitions consist of issuing the visible MPI operations into the
MPI runtime. We have four process transitions for each of the MPI functions: PS,
PR, PW and PB. Let Σ be the reached states predicate.
The process transitions are defined using a rule that infers the new reached
state set Σ. For a process Pi let Curi denote the instruction being executed by Pi
at program counter pci in state 〈I,M, C, R, pc〉. The process transition for a S for
i ∈ PID is as follows:
PS :Σ(σ as 〈I,M, C, R, pc〉), isS(Curi)
Σ〈I ∪ {Curi}, M,C,R, pc〉
When process Pi has to issue a send Si,j(k) at its current program counter, the
state transition occurs by issuing the send into the MPI runtime which involves
updating the I set with Si,j(k). Note that except for the change in the runtime
state, there is no change in the local process state. Even though the send can
return immediately, the PS transitions does not show any increment in the program
34
counter for Pi. This is because the MPI standard does not provide any restrictions
on how immediately the send has to return.
The process transition when the operation issued by Pi is Ri,j(k) is shown below.
PR :Σ(σ as 〈I,M, C, R, pc〉), isR(Curi)
Σ〈I ∪ {Curi}, M,C,R, pc〉
The PR transition is similar to the PS transition and the state transition involves
updating the I set of the runtime state with Curi.
PW :Σ(σ as 〈I,M, C, R, pc〉), isW (Curi)
Σ〈I ∪ {Curi}, M,C,R, pc〉
The PW transition shows the state transition when Pi issues Wi,j′(hi,j) where
hi,j refers to an earlier send Si,j or receive Ri,j. The state transition results in the
state with the I set updated with Curi.
Finally, the PB transition is shown below when Pi issues a Barrier. The state
transition for PB is similar to rest of the process transitions.
PB :Σ(σ as 〈I,M, C, R, pc〉), isB(Curi)
Σ〈I ∪ {Curi}, M,C,R, pc〉
3.2.2 MPI Runtime Book-keeping Sets
As the processes issue the MPI operations into the MPI runtime, at every state
σ, the MPI runtime also maintains certain book-keeping sets. These sets help the
runtime transitions follow the MPI ordering guarantees described in Section 2.1.5.
The book-keeping sets are defined for a state σ = 〈I,M, C, R, pc〉 and all the
bookkeeping sets are ⊆ I × I.
Definition 3.1 Nonovertake(σ as 〈I,M, C, R, pc〉) ⊆ I × I =
{〈Si,j(k), Si,j′(k)〉, 〈Ri,j(k), Ri,j′(k)〉, 〈Ri,j(∗), Ri,j′(k)〉 〈Ri,j(∗), Ri,j′(∗)〉 | j < j′}.
The Nonovertake set, as the name suggests, tracks the nonovertaking sends and
receives that must be matched in a particular order. When a send is matched with
35
a receive, the send and the receive tuple will enter the M set to signify that a send
and receive have been matched. The Nonovertake set will allow the sends/receive
to enter the M set in the program order when the nonovertaking property needs to
be maintained according to the standard. For example, Si,j′ or Ri,j′ cannot enter
the M set before Si,j or Ri,j, respectively.
Definition 3.2 Resource(σ as 〈I,M, C, R, pc〉) ⊆ I × I =
{〈Si,j, Wi,j′(hi,j)〉, 〈Ri,j, Wi,j′(hi,j)〉 | j < j′}.
The Resource set is to track the completion order of a send (receive) and its
corresponding W . When an MPI function is completed, then the MPI runtime
moves it to the C set. However, the completion of Wi,j′ depends on when the Si,j
(Ri,j completes (i.e., the send buffer is copied out of the process Pi’s buffer space
and Ri,j receives the data into its receive buffer). Wi,j′ can enter the C set only
after Si,j (Ri,j) enters the C set. We call the set the Resource set because the W
frees resources assigned to its handle.
Definition 3.3 Fence(σ as 〈I,M, C, R, pc〉)⊆ I × I =
{〈Wi,j, Fi,j′)〉, 〈Bi,j, Fi,j′)〉 | j < j′, F ∈ {S, R,W, B}}.
The Fence set indicates that the blocking MPI functions W and B act as fences.
This set indicates that when a Wi,j or Bi,j is issued, then no later MPI instructions
Fi,j′ can be issued until Wi,j (Bi,j) move into the C set.
Definition 3.4 IntraHB(σ as 〈I,M, C, R, pc〉) ⊆ I × I is defined as follows:
IntraHB(σ as 〈I,M, C, R, pc〉) = NonOvertake(σ) ∪ Resource(σ) ∪ Fence(σ)
The IntraHB relation is used by the MPI runtime to move the MPI operations
into various sets (and hence cause state transitions). Note that the IntraHB is a
relation across the MPI operations issued by the same process. Hence, the name
Intra-HappensBefore. Before we present the full set of MPI Runtime transitions,
we define Ancestor, Descendant and Ready sets.
36
Definition 3.5 Ancestor(σ : state, y : op) = {x | 〈x, y〉 ∈ (IntraHB(σ))+}
Definition 3.6 Descendant(σ : state, x : op) = {y | 〈x, y〉 ∈ (IntraHB(σ))+}
Definition 3.7 Ready(σ as〈I,M, C, R, pc〉) =
{x ∈ I | ∀y : (y ∈ Ancestor(x) ∧ y /∈ Ready(σ)) ∧(¬isW (x) ∧ (∃m ∈ M ∧ y ∈ m)) ∨ (isW (x) ∧ y ∈ C)}.
The Ready set defines the set of MPI operations in I that are ready to be
matched so that they can enter the M set when the matching MPI operations are
found. The Ready set contains the set of MPI operations whose ancestors have all
been matched (i.e., the ancestors are in the M set) when the operations are not W
operations. When the operations are W operations, the ancestors of W must be in
the C set for W to enter the C set.
3.2.3 MPI Runtime Transitions
We are now ready to present the MPI runtime transitions. We first present
the RSRet and RRRet transitions which stand for the MPI Runtime Send Return
transitions and runtime Receive Return transition.
RSRet :Σ(σ as 〈I,M, C, R, pc〉), Si,j ∈ I ∧ Si,j /∈ R
Σ〈I,M, C, R ∪ {Si,j}, pc[i ← pci + 1]〉
RRRet :Σ(σ as 〈I,M, C, R, pc〉), Ri,j ∈ I ∧Ri,j /∈ R
Σ〈I,M, C, R ∪ {Ri,j}, pc[i ← pci + 1]〉
The RSRet and RRRet transitions define the control transfer back to the processes
when they issue the Si,j and Ri,j instructions. The control transfer is shown by
incrementing and updating the program counter of process Pi and by updating the
R set of the MPI runtime state.
RSR :Σ(σ as 〈I,M, C, R, pc〉), {Si,j(k), Rk,l(i)} ⊆ Ready(σ)
Σ(σ′ as 〈I,M ∪ {{Si,j, Rk,l}}, C, R, pc〉)Assert : Ready(σ′) = Ready(σ)− {Si,j, Rk,l}
37
The MPI runtime transition RSR shows the formation of a send and a receive
match set when they are both ready to be matched (i.e., Si,j and Rk,l ∈ Ready(σ)).
The transition matches the send and receive and moves them to the M set. By
virtue of Definition 3.7, Ready(σ′) will satisfy the assertion shown. This assertion
(separately provable from Definition 3.7) shows how matched items are removed
from Ready. The MPI runtime transition to complete the send and receive once
they are matched is as follows:
RSC :Σ(σ as 〈I,M, C, R, pc〉), {Si,j, Rk,l} ∈ M ∧ Si,j /∈ C
Σ〈I,M, C ∪ {Si,j}, R, pc]〉
RRC :Σ(σ as 〈I,M, C, R, pc〉), {Si,j, Rk,l} ∈ M ∧Rk,l /∈ C
Σ〈I,M, C ∪ {Rk,l}, R, pc]〉
The RSC and RRC transitions look for the sends and receives that have been
matched and update the C set with the send and receive operations. Note that the
completion and matching can happen in the same transitions. However, the MPI
runtime can first match and then do the actual data transfer (which completes the
send and receive operations) later. This could be due to various reasons like large
data buffers, busy network or performance optimizations. Our runtime transition
system captures this latitude provided by the MPI standard.
RWC :Σ(σ as 〈I,M, C, R, pc〉), Wi,j ∈ Ready(σ)
Σ(σ′ as 〈I,M ∪ {{Wi,j}}, C ∪ {Wi,j}, R}, pc〉)Assert : Ready(σ′) = Ready(σ)− {Wi,j}
RWC is the transition that completes a W operation. When the W operation
is in Ready(σ), then the W is ready to be completed by the virtue of the definition
of Ready set. This is because a W operation enters the Ready set only when its
corresponding send or receive has completed and is in the C set.
RBC :Σ(σ as 〈I,M, C, R, pc〉), bar as {Bi,j | Bi,j ∈ Ready(σ)}, | bar |= PID
Σ(σ′ as 〈I,M ∪ {bar}, C ∪ bar,R, pc]〉)Assert : Ready(σ′) = Ready(σ)− bar
38
The RBC transition matches and completes a B operation. When the Ready(σ)
contains a B operation for all the processes in PID, then the transition matches
all the barriers by updating the M with {bar} and also updates the C set. The
Ready(σ′) set is also appropriately updated.
RWRet :Σ(σ as 〈I,M, C, R, pc〉), Wi,j ∈ C ∧Wi,j /∈ R
Σ〈I,M, C, R ∪ {Wi,j}, pc[i ← pci + 1]〉
RBRet :Σ(σ as 〈I,M, C, R, pc〉), Bi,j ∈ C ∧Bi,j /∈ R
Σ〈I,M, C, R ∪ {Bi,j}, pc[i ← pci + 1]〉
The RWRet and RBRet return the control back to the processes issuing the B
and W operations once the B and W are completed. The final runtime transition
is the RSR∗ transition that matches a send and a wildcard receive.
RSR∗ :
Σ(σ as 〈I,M, C, R, pc〉),{Si,j(k), Rk,l′(∗)} ⊆ Ready(σ),¬∃l < l′ : Rk,l(i) ∈ Ready(σ)
Σ(σ′ as 〈I,M ∪ {{Si,j, Rk,l′(i)}}, C, R, pc〉)Assert : Ready(σ′) = Ready(σ)− {Si,j, Rk,l′}
3.2.4 Conditional Matches-before
The RSR∗ transition matches a send with a wildcard receive. However, the send
can be matched only when there is no nonwildcard receive that can match it. This
satisfies the conditional matches-before requirement. For two receives Rk,l(i) and
Rk,l′(∗), (l < l′) issued by the Pk, the later wildcard receive Rk,l′ cannot be matched
with an available send Si,j(k) without matching the earlier receive Rk,l. Note that
〈Rk,l, Rk,l′〉 /∈ IntraHB.
This makes it possible for both Rk,l and Rk,l′ to be in Ready(σ) at the same
time. By checking that Rk,l is not in the Ready(σ), the conditional matches-before
order is preserved.
39
3.2.5 Dynamic Instruction Rewriting
Also note that the M set now has Rk,l′(i) which rewrites the src field of the
receive to the send’s process rank with which it is matched.
3.2.6 One Transition or Multiple?
One may wonder whether RSR∗ is a single transition or a collection of one or
more. The answer is that it is a single transition! The rule RSR∗ of course defines
a family of transitions, one per sender that can match the wildcard receive. These
transitions are the only ones that are dependent – a notion that will be defined
formally in the next chapter. For this reason, we call all the transitions denoted by
RSR∗ the dependent transition groups.
3.2.7 Dependent Transition Group
Let Rk,l(∗) be a wildcard receive statement in an MPI process execution Pk,
for some k ∈ PID. Let σ ∈ Σ, and let Si,j(k) (for various i, j, k) represent sends
such that transition RSR∗ can fire by virtue of {Si,j(k), Rk,l(∗)} ⊆ Ready(σ). Let
all these transitions be denoted as τ . We say that τ is in the same dependent
transition group (DTG) of state σ. isDtg(τ) is true precisely for such sets τ .
3.2.8 Selectors and Useful Predicates
Several predicates associated with these transition rules are now defined. These
will be used in subsequent chapters.
• is∗(t): for a transition t, is∗(t) is true exactly when t is an RSR∗ transition.
• has∗(τ): for a set of transitions τ , has∗(τ) is true if there is a t ∈ τ such that
is∗(t).
• hasnon∗(τ): for a set of transitions τ , hasnon∗(τ) is true if there is a t ∈ τ such
that ¬is∗(t).
• choose∗(τ) denotes t ∈ τ such that is∗(t) (like Hilbert’s choice operator).
40
• choosenon∗(τ) denotes t ∈ τ such that ¬is∗(t) (like Hilbert’s choice operator).
• all∗(τ) denotes the dependent transition groups in τ . That is,
all∗(τ) = {g | isDtg(g)}.
Notice that multiple wildcard moves may be enabled at a state σ. That is,
there could be multiple DTGs at a state. Also notice that in the “crooked barrier”
example presented earlier (Figure 2.7), it is possible to have a barrier transition
and the wildcard receive transition both enabled at a state. Thus, it is possible to
have both DTGs and regular (deterministic) transitions both enabled at the same
time. Our algorithm POE is one that prioritizes deterministic transitions until we
reach a state with only DTGs. POEOPT optimizes POE by considering the DTGs
to be independent as far as possible.
This concludes the MPI runtime transitions, the next section illustrates MPI
program execution using the MPI runtime transitions provided in this section.
3.3 Illustration of the Formal Model
We now illustrate the working of the formal MPI model as a state transition.
Consider the simple MPI execution shown in Figure 3.1. Figure 3.2 shows the
execution of the MPI program Figure 3.1. Each state σi is labeled with
• The I,M, C, R sets that identify the state σi.
• enabled(σi) is the set of transitions that can be executed from σi.
• The IntraHB relation among the MPI operations in I and
P0 P1
S0,1(1); R1,1(0);W0,2(h0,1); W1,2(h1,1);
Figure 3.1. Simple MPI Program
41
Figure 3.2. Execution of Figure 3.1 with MPI Transitions
42
• Ready(σi)
The MPI execution of Figure 3.2 proceeds as follows:
• σ0 has the process transitions enabled to issue S0,1 and R1,1. The rest of the
sets are empty. The process transitions are denoted as PS : S0,1 and PR : R1,1
that instantiates particular PS and PR transitions, respectively.
• The PS : S0,1 transition is executed from state σ0 and reaches state σ1 with
S0,1 in σ1.I. Since S0,1 has no ancestors, i.e., Ancestors(σ1, S0,1) = ∅, S0,1 is
also in Ready(σ1).
• RSRet : S0,1 is now enabled in σ1 which is the transition executed from σ1 to
generate σ2.
• Since S0,1 returned, W0,2 is now ready to be issued which is evident from
PW : W0,2 ∈ enabled(σ2).
• PW : W0,2 is executed from σ2 to generate σ3.
• Note that W0,2 will not be in Ready(σ3) due to the IntraHB relation between
S0,1 and W0,2 and the fact that S0,1 /∈ σ3.C
• W0,2 will enter Ready(σ7) after S0,1 completes by executing RSC : S0,1 in σ6.
• The rest of the execution can be understood similarly. The execution ends
when there are no more transitions to be executed, i.e., enabled(σ13) = ∅.
3.4 Applying DPOR to MPI Transition System
We now present the DPOR algorithm applied to the MPI transition system and
present issues that arise when DPOR is applied to the MPI transitions for MPI
programs. We redo the example presented in Figure 2.5 with minor changes. Note
that these changes do not in any way change the semantics of the examples.
Consider the example in Figure 3.3. Note that the example deadlocks when R1,1
is matched with S0,1. An MPI execution for this example is shown in Figure 3.4.
43
P0 P1 P2
S0,1(1); R1,1(∗) S2,1(1);W0,2(h0,1); R1,2(0); W2,2(h2,1);
W1,3(h1,1);W1,4(h1,2);
Figure 3.3. MPI Execution with a Deadlock
Figure 3.4. MPI Execution of Figure 3.3
44
By the time the execution reaches state σi, the process transitions PS : S2,1, S0,1
and PR : R1,1 and runtime transitions RRRet : R1,1 and SSRet : S2,1, S0,1 have been
executed. Ready(σi) contains {S2,1, R1,1, S0,1} which enables two RSR∗ transitions
as shown in enabled(σi). At σi the RSR∗ : {S2,1, R1,1} transition is executed.
S0,1 will be eventually matched with R1,2. Once the interleaving is generated,
the DPOR algorithm starts updating the backtrack sets. The only dependent
transitions are RSR∗ : {S0,1, R1,1} and RSR∗ : {S2,1, R1,1} since executing one of
them will disable the other since R1,1 will be removed from Ready(σi). However,
unlike thread programs, once a transition is disabled, it will never get enabled. The
RSR∗ : {S0,1, R1,1} transition never gets enabled in the current interleaving. In
thread-based programs, if a thread instruction remains disabled, this will lead to a
deadlock. This is not necessarily true for MPI, as seen in this example. Since the
dependent transition is never enabled, the DPOR algorithm will never update the
backtrack sets and the deadlock remains undetected. Note that it is possible to
have an interleaving (execution) where one of the RSR∗ transitions is never enabled
in the execution. This can happen if the RSR∗ : {S2,1, R1,1} is executed before the
PS : S0,1 is executed.
CHAPTER 4
THE POE ALGORITHM
This chapter presents the POE algorithm which stands for Partial Order avoid-
ing Elusive Interleavings. This version of POE is applicable to MPI programs
where sends do not have runtime buffering. Section 4.1 presents the dependence
properties of MPI transitions. Section 4.2 presents the POE algorithm along with
the proof of correctness. Section 4.3 illustrates the working of POE algorithm.
Finally, Section 4.4 presents two drawbacks of the POE algorithm which concludes
the chapter.
4.1 MPI Transition Dependence
This section presents dependence and independence properties of MPI transi-
tions. Transition independence is presented in Definition 2.2.
Definition 4.1 An MPI transition t is enabled in a state σi (written t ∈
enabled(σi)) when it can be fired according to a MPI transition rule (presented
in Section 3.2).
Definition 4.2 Two transitions t1 and t2 are co-enabled when there is a state
σi such that {t1, t2} ⊆ enabled(σi).
Definition 4.3 Let t(σ) denote the state attained after transition t fires. Tran-
sitions t1 and t2 are independent exactly when for all states σ, t2 ∈ enabled(σ) ⇔
t2 ∈ enabled(t1(σ)), and further t1(t2(σ)) = t1(t2(σ)). Two transitions are depen-
dent if they are not independent.
46
The definition in [5, Chapter 10] allows transitions to be independent even if
one transition can enable the other. We can follow this stricter definition along the
lines of [9].
Lemma 4.4 All transitions of τ such that isDtg(τ) are pairwise dependent.
All other transition pairs are independent.
Lemma 4.4 is very important in the development of dynamic verification algo-
rithms for MPI programs. The above lemma implies that in programs that do not
have any wildcard receives, all the transitions are independent. Hence, the number
of relevant interleavings is only one. For programs with wildcard receives, the
number of relevant interleavings is governed by the dependent transition groups
generated by RSR∗ transitions. Compared to shared memory programs, MPI
programs have far fewer relevant interleavings. In particular, operations involving
different communicators are completely independent. This explains why random
delay tricks such as in [68] are far less effective for message passing programs, and
why algorithms such as POE are even more important.
4.2 The POE Algorithm
This section presents the POE algorithm and its proof of correctness. Before
describing the POE algorithm, we first define and explain the notion of persistent
sets which underlies almost all our correctness arguments. In almost all cases, our
proof goal will be to show that for every reachable state σi, our algorithms compute
persistent backtrack sets.
4.2.1 Persistent Sets
Definition 4.5 For a reachable state σi ∈ Σ of an MPI program, a set of
transitions τ ⊆ enabled(σ) is persistent iff, for all nonempty sequences of transitions
σiti−→ σi+1
ti+1−−→ σi+2 . . .tp−1−−→ σp that lie outside τ , tp is independent of all transitions
in τ .
47
4.2.2 Persistent Sets and MPI Program Correctness
Our interest is in detecting two classes of bugs:
• Deadlocks (states where all MPI processes are stuck). Even “partial dead-
locks” (deadlocks involving a proper subset of processes) will turn into full
deadlocks because we expect our terminating MPI processes to call MPI_Finalize.
• Violations of C assert statements placed within MPI processes. C assert
violations are equivalent to deadlocks because following [12], we can check
each assertion as a precondition of MPI transitions, and prevent the transition
from firing in case the precondition is false.
It is shown in [10] that persistent set-based search will reveal all deadlocks. That
is, if there is any path in the global state space where σ is reachable, then there
is a path traversing through only persistent sets where σ is reachable. Thus, the
correctness of all our POE variants will be argued by showing that they compute
persistent backtrack sets.
4.2.3 POE Algorithm
This section presents the POE algorithm and proves that the backtrack set at
every state generated by the POE algorithm is persistent.
The POE algorithm has some book-keeping sets, variables and helper routines
that are defined below:
• backtrack(σi) is the set of transitions that indicate the backtrack set of state
σi. The backtrack(σi) has the same semantics as in the classical DPOR
algorithm.
• done(σi) ⊆ backtrack(σi) tracks the transitions of the backtrack(σi) that have
already been executed from σi.
• statevec is a vector of states explored in an interleaving. statevec also behaves
as a stack where a new state is pushed at the top of statevec and a state is
popped from the top of statevec. Initially, the statevec is empty.
48
• curr(σi) denotes the MPI transition that was executed from σi in the inter-
leaving generated in statevec. Note that curr(σi) ∈ backtrack(σi).
• Execute(σi, curr(σi)) is a helper routine which indicates that the MPI
transition curr(σi) has been executed in the current state σi. The function
returns the next state after the MPI transition curr(σi) is executed in the
current state σi.
• GetTransition takes a set of MPI transitions as its argument and returns
a transition from the argument set as per the pseudo-code.
We are now ready to describe the full POE algorithm. Figures 4.1, 4.2, 4.3, 4.4
provide the pseudo-code for the full POE algorithm. Figure 4.1 is the main POE
routine that is invoked with two inputs : the first argument is the initial state
σ0 and the second argument is an empty statevec. The initial state σ0 is pushed
into statevec and GetTransition is invoked by POE to get the transition to be
executed from σ0. backtrack(σ0) is updated with curr(σ0) (line 2–4 Figure 4.1).
POE then invokes GenerateInterleaving. GenerateInterleaving gener-
ates an interleaving by selecting the transitions in a prioritized manner (line 6, 10)
by invoking GetTransition. All RSR∗ transitions have the least priority. The
rest of the transitions have the same priority.
Once an interleaving is generated, the backtrack sets are updated by the POE
algorithm as shown by the routine UpdateBacktrack in Figure 4.3. The al-
gorithm to generate the backtrack sets is simple : In a given state σ, if curr(σ)
has some transition t that is not an RSR∗ transition, then the backtrack(σ) is a
singleton set {t}. Otherwise, backtrack(σ) = enabled(σ).
After the backtrack set of every state is updated, the POE algorithm then starts
popping off states from statevec until a state σ = statevec[i] is reached where there
is some transition in backtrack(σ) '= done(σ). If no such state is present, then the
POE algorithm will pop out all the states out of statevec and the POE algorithm
ends, signalling that there are no more interleavings to be explored. Otherwise,
49
1: POE(σ0, statevec) {2: statevec.push(σ0);3: curr(σ0) = GetTransition(enabled(σ0));4: backtrack(σ0) = backtrack(σ0) ∪ {curr(σ)};5: while (! statevec.empty()) {6: GenerateInterleaving (statevec);7: UpdateBacktrack (statevec);8: for (i = statevec.size()−1; i ≥ 0; i−−) {9: if (backtrack(statevec[i]) == done(statevec[i])) {10: statevec[i].pop();11: } else {12: break;13: }14: }15: }
Figure 4.1. Pseudocode for POE Algorithm
1: GetTransition(set of transitions T ) {2: if hasnon∗(T )3: return choosenon∗(T );4: else5: return choose∗(T )6: }
Figure 4.2. Pseudocode for GetTransition
1: UpdateBacktrack(statevec) {2: for each (σ ∈ statevec) {3: if (enabled(σ) = ∅)4: return;5: if (is∗(curr(σ)))6: backtrack(σ) = enabled(σ)};7: }8: }
Figure 4.3. Pseudocode for UpdateBacktrack
50
1: GenerateInterleaving(statevec) {2: σ = statevec[0];3: for (i= 0; i < statevec.size()−1; i++) {4: σ = Execute(statevec[i], curr(statevec[i]));5: }6: curr(σ) = GetTransition(backtrack(σ)−done(σ));7: do {8: σ = Execute(σ, curr(σ));9: statevec.push(σ);10: curr(σ) = GetTransition(enabled(σ));11: backtrack(σ) = backtrack(σ) ∪ {curr(σ)};12: done(σ) = done(σ) ∪ {curr(σ)};13: } while (enabled(σ) '= ∅);14: }
Figure 4.4. Pseudocode for GenerateInterleaving
the algorithm invokes GenerateInterleaving on the statevec which results in
restarting of all the MPI processes.
The algorithm for GenerateInterleaving is shown in Figure 4.4. Lines 2–5
show the state generation when the program is restarted. The algorithm executes
the same transitions curr(σ) for all but the state that is at the top of statevec.
From this state, a new transition is executed from backtrack(σ) − done(σ) (lines
6–7 Figure 4.4). This will cause new states to be generated which are pushed
onto the statevec along with the backtrack sets generated for these states. The
states are generated until enabled(σ) = ∅ which is the terminating state (lines 8–14
Figure 4.4).
We now prove the following theorem:
Theorem 4.6 For any state σ generated by the POE algorithm, backtrack(σ)
is persistent.
Proof : Induction, in post order (by the successor relation).
Basis case: The state after MPI_Finalize is persistent, as it has an empty
enabled set of transitions.
51
Induction Hypothesis: Pick some state σi and assume that all its successors
are persistent.
Induction Step: Now consider σi.
Our algorithm guarantees that either ¬has∗(backtrack(σi)) or backtrack(σi) =
enabled(σi). The latter case preserves persistence. The former case also preserves
persistence as any state satisfying ¬has∗(backtrack(σi)) cannot have any transition
in backtrack(σi) or be dependent with any other transition (Lemma 4.4).
Note: This proof did not need the induction hypothesis because this version of
POE ensures locally (for each state) that only persistent sets are chosen. We will
employ the same proof structure for later versions of POE, and in those proofs, we
will need the induction hypothesis.
4.3 Illustration of POE Algorithm
We illustrate the POE algorithm on the “Crooked Barrier” example shown in
Figure 4.5.
Figure 4.6 and Figure 4.7 show the interleavings generated by the POE algo-
rithm for Figure 4.5. The POE algorithm will select one of the process transitions
in enabled(σ0) and will add it to backtrack(σ0) and updates curr(σ0). Gener-
ateInterleaving is now invoked with σ0 at the top of statevec.
GenerateInterleaving then executes curr(σ0) (line 8) and generates the
next state σ1. One of the transitions in enabled(σ1) is selected using GetTran-
sition that implements the prioritized execution semantics of POE. GetTran-
P0 P1 P2
S0,1(1) B1,1 B2,1
B0,2 R1,2(∗) S2,2(1)W0,3(h0,1) R1,3(2) W2,3(h2,2)
W1,4(h1,2)W1,5(h1,3)
Figure 4.5. Crooked Barrier Example
52
Figure 4.6. POE Interleaving 1
sition always selects a non-RSR∗ transition if available. This can be seen in the
interleaving in Figure 4.6 where the transitions executed from σ0 to σl to σm are
non-RSR∗ transitions. σm has only RSR∗ transitions in enabled(σm). GetTransi-
tion arbitrarily selects RSR∗ : {S0,1, R1,2} and generates the rest of the interleaving
and returns. The POE algorithm then invokes UpdateBacktrack for each of
the states generated. UpdateBacktrack only updates a state when is∗(curr(σ))
is true. In this case, it updates the backtrack(σm) with enabled(σm). For the rest of
the states, backtrack(σ) = {curr(σ)}. The POE algorithm will start popping off
states from statevec until it reaches a state where backtrack(σ) − done(σ) '= ∅.
53
Figure 4.7. POE Interleaving 2
All the states from σr to σm+1 get popped out. Since statevec is not empty,
GenerateInterleaving is now invoked and generates the second interleaving
shown in Figure 4.7. GenerateInterleaving will now re-execute the same
set of transitions from σ0 until σm (lines 3–5). The transition to be executed
from σm is selected from backtrack(σm) − done(σm) which is RSR∗ : {S2,2, R1,2}.
The second interleaving will eventually reach the final deadlocked state σf where
Ready(σf ) '= emptyset and the processes have not executed MPI_Finalize.
54
4.4 Issues with POE Algorithm
Though the POE algorithm is guaranteed to detect deadlocks and generate all
relevant interleavings for an MPI program, it does so under the assumption that
none of the sends have any runtime buffering. Also, the POE algorithm can result
in a number of redundant interleavings as will be made evident in this section.
4.4.1 Redundant Interleavings
The POE algorithm will cause multiple interleavings only when there is a state
with multiple RSR∗ transitions. Consider the MPI program in Figure 4.8
The POE algorithm execution will result in a state σi where enabled(σi) =
{RSR∗ : {S0,1, R1,1}, RSR∗ : {S2,1, R3,1}}. The POE algorithm would now make
backtrack(σi) = enabled(σi) which will result in the two interleavings while the
number of relevant interleavings for the program in Figure 4.8 is only 1. Note that
for n such independent RSR∗ transitions co-enabled in a state, the POE algorithm
will cause n! interleavings while just 1 interleaving is sufficient. However, this
redundancy cannot be eliminated by not adding transitions from different DTG
groups to backtrack sets. Consider the example program shown in Figure 4.9.
The POE algorithm would enter a state σi where enabled(σi) = {RSR∗ :
{S0,1, R1,1}, RSR∗ : {S2,1, R3,1}}. Now consider the scenario where the POE
algorithm would consider the two RSR∗ transitions as independent and add only
one of them to the backtrack set, say RSR∗ : {S0,1, R1,1}. In this case, it is possible
to take a transition in enabled(σi) − backtrack(σi) = {RSR∗ : {S2,1, R3,1}} and
enter a state σj where enabled(σj) = {RSR∗ : {S0,1, R1,1},RSR∗ : {S2,3, R1,1}}. The
transition RSR∗ : {S0,1, R1,1} can be disabled by the transition RSR∗ : {S2,3, R1,1}.
P0 P1 P2 P3
S0,1(1); R1,1(∗) S2,1(3); R3,1(∗)W0,2(h0,1); W1,2(h1,1); W2,2(h2,1); W3,2(h3,1);
Figure 4.8. Redundant POE Interleavings
55
P0 P1 P2 P3
S0,1(1); R1,1(∗) S2,1(3); R3,1(∗)W0,2(h0,1); W1,2(h1,1); W2,2(h2,1); W3,2(h3,1);S0,3(3); R1,3(∗) S2,3(1); R3,3(∗)W0,4(h0,3); W1,4(h1,3); W2,4(h2,3); W3,4(h3,3);
Figure 4.9. POE and Persistent Sets
An identical scenario occurs when backtrack(σi) contains only RSR∗ : {S2,1, R3,1}.
This make the backtrack(σi) nonpersistent. The POE algorithm hence makes
backtrack(σi) = enabled(σi) to make the backtrack sets persistent for RSR∗ transi-
tions.
Chapter 5 extends the POE algorithm to the POEOPT algorithm to reduce this
redundancy.
4.4.2 POE and Buffered Sends
The POE algorithm works only when the sends do not have adequate buffering.
However, if sends can be buffered, it misses the deadlocks present in the program.
Consider the MPI example shown in Figure 4.10.
When none of the sends are buffered, only S1,1 can match the wildcard receive
R2,1 and there is no deadlock. However, when S1,1 is buffered, the W1,2 can complete
even before the send is matched. This enables S0,1 and R1,3 to match and since S0,1
is matched, it can complete unblocking the W0,2. Now, S0,3 is issued and since the
P0 P1 P2
S0,1(1) S1,1(2) R2,1(∗)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1)S0,3(2) R1,3(0) R2,3(0)W0,4(h0,3) W1,4(h1,3) W2,4(h2,3)
Figure 4.10. Buffering Sends and POE
56
wildcard receive is not yet matched, it can be matched with S0,3 and result in a
deadlock since R2,3 will not have a matching send. Note that this deadlock cannot
happen when none of the sends are buffered. We call this the slack inelastic [31]
property of MPI. One solution would be to buffer all the sends. However, this will
mean that any deadlocks corresponding to nonbuffered sends will not be detected
by POE. Since buffer allocation is a dynamic property, our goal is to extend POE
so that it can detect all deadlocks. Chapter 7 extends the POE algorithm to handle
buffered sends.
CHAPTER 5
POE AND REDUNDANT
INTERLEAVINGS
This chapter extends the POE algorithm to reduce the redundant interleavings
generated by the POE algorithm. Section 5.1 provides a few examples on how
POE generates redundant interleavings. We then define the InterHB relation in
Section 5.2 and use this to derive co-enabledness properties of MPI operations.
Section 5.3 then describes the POEOPT algorithm that uses the co-enabledness
properties derived in Section 5.2 to reduce the redundant interleavings in POE.
Section 5.3 also provides the proof that the backtrack set of every state generated
by the POEOPT algorithm is persistent.
5.1 POE and Redundant Interleavings
This section presents a few examples to describe scenarios where the POE
algorithm can contribute to redundant interleavings. The POE algorithm generates
multiple interleavings only in the presence of wildcard receives. For deterministic
MPI programs with no wildcard receives, the POE algorithm optimally produces
only a single interleaving. Hence, the programs of interest are those MPI programs
that have wildcard receives.
Consider the MPI program execution shown in Figure 5.1.
The POE algorithm executes Figure 5.1 from state σ0 as follows:
• The PS transitions are executed : σ0PS :{S0,1}−−−−−→ σ1
PS :{S2,1}−−−−−→ σ2PS :{S4,1}−−−−−→ σ3.
• The PR transitions are executed : σ3PR:{R1,1}−−−−−−→ σ4
PR:{R3,1}−−−−−−→ σ5.
• The PW transitions are executed : σ5PW :{W0,2}−−−−−−→ σ6
PW :{W1,2}−−−−−−→ σ7PW :{W2,2}−−−−−−→
σ8PW :{W3,2}−−−−−−→ σ9
PW :{W4,2}−−−−−−→ σ10.
58
P0 P1 P2 P3 P4
S0,1(1) R1,1(∗) S2,1(1) R3,1(∗) S4,1(3)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1) W3,2(h3,1) W4,2(h4,1)
R1,3(2)W1,4(h1,3)
Figure 5.1. Redundant POE Interleavings
• There are no more process transitions available at σ10 and enabled(σ10) = {
RSR∗ : {S0,1, R1,1}, RSR∗ : {S2,1, R1,1}, RSR∗ : {S4,1, R3,1}}.
• For state σ10, backtrack(σ10) = enabled(σ10).
• The POE algorithm will generate one interleaving for each transition in
backtrack(σ10), resulting in 3 interleavings.
However, the above program requires only 2 interleavings. Also, if there were
two more processes such that RSR∗ : {S5,1(6), R6,1(∗)} were also enabled in σ10,
the POE algorithm would generate 10 interleavings while only 2 interleavings are
sufficient to detect the deadlock present in the program.
The redundancy in the POE algorithm arises when there are multiple RSR∗
transitions enabled in a state σi and the wildcard receives involved in the RSR∗
transitions are different. POE generates these redundant interleavings in order to
keep the backtrack sets persistent. Consider the example in Figure 5.2.
The POE algorithm will execute Figure 5.2 as follows:
P0 P1 P2 P3
S0,1(1) R1,1(∗) S2,1(3) R3,1(∗)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1) W3,2(h3,1)S0,3(3) R1,3(∗) S2,3(1) R3,3(∗)W0,4(h0,3) W1,4(h1,3) W2,4(h2,3) W3,4(h3,3)
Figure 5.2. POE and Persistent Sets
59
• The PS transitions are executed : σ0PS :{S0,1}−−−−−→ σ1
PS :{S2,1}−−−−−→ σ2.
• The PR transitions are executed : σ2PR:{R1,1}−−−−−−→ σ3
PR:{R3,1}−−−−−−→ σ4.
• The PW transitions are executed : σ4PW :{W0,2}−−−−−−→ σ5
PW :{W1,2}−−−−−−→ σ6PW :{W2,2}−−−−−−→
σ7PW :{W3,2}−−−−−−→ σ8.
• There are no more process transitions available at σ8 and enabled(σ8) = {
RSR∗ : {S0,1, R1,1}, RSR∗ : {S2,1, R3,1}}.
If backtrack(σ8) '= enabled(σ8), the POE algorithm would generate only 3 inter-
leavings when there are 4 relevant interleavings that match {R1,1, S0,1}, {S2,3, R1,1},
{S2,1, R3,1}, {S0,3, R3,1}. The backtrack(σ8) is therefore not persistent.
The goal of this chapter is to reduce the redundant interleavings while keeping
the backtrack sets persistent. One simple optimization would be to look at the
generated interleaving I = σ → σ1 → . . . → σn and check if there is some send
Sm,n(i) such that for all σi generated in I, RSR∗ : {Sm,n(i), Ri,j(∗)} /∈ enabled(σi)
and update the backtrack set to enabled set only if such Sm,n exists. This will
fix the redundant interleavings in Figure 5.1 and will also maintain the persistent
backtrack sets for Figure 5.2. Now let us apply the simple optimization to the
example in Figure 5.3.
When the simple optimization is applied to the POE algorithm execution of
Figure 5.3, the optimization would find that S3,5 can be matched with R1,1 and
P0 P1 P2 P3
S0,1(1) R1,1(∗) R2,1(∗) S3,1(2)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1) W3,2(h3,1)
S1,3(3) R3,3(1)W1,4(h1,3) W3,4(h3,3)R1,5(∗) S3,5(1)W1,6(h1,5) W3,6(h3,5)
Figure 5.3. Simple Optimization and Redundancy
60
that there is no RSR∗ : {S3,5, R1,1} transition enabled in any state. This will cause
both RSR∗ : {S0,1, R1,1} and RSR∗ : {S3,1, R2,1} to be added to the backtrack set.
However, notice that S3,5 and R1,1 will never be in Ready(σi) for any state σi to
form a RSR∗ transition. The number of relevant interleavings for Figure 5.3 is one
while two interleavings are generated even by applying the simple optimization.
The POE algorithm is only aware of the IntraHB relation which dictates the
order in which the operations enter and leave the Ready set within a process. In
order to address redundancy issues, the POE algorithm must also be able to detect
whether two MPI operations across processes can be in the Ready set at the same
state to form a transition. The POE algorithm does not have this information
available.
The next Section, 5.2, introduces the InterHB relation that will help under-
stand the co-enabledness properties of MPI operations across processes.
5.2 InterHB and Co-enabledness
The MPI runtime (R) transitions that match various MPI operations (RSR,
RSR∗, RBC) are enabled in a state σi depending on the MPI operations available
in Ready(σi). For example, an RSR : {Si,j(k), Rk,l(i)} transition is in enabled(σi)
only when a Si,j(k) ∈ Ready(σi) and Rk,l(i) ∈ Ready(σi). An RBC transition is in
enabled(σi) only when the barriers operations B of all processes are in Ready(σi).
In order to eliminate the redundancy due to the multiple RSR∗ transitions, we
only need to know if there exists any state σi such that Si,j(k) and its matching
wildcard receive Rk,l(∗) can both be in Ready(σi). We therefore wish to detect the
co-enabledness of MPI operations where the co-enabledness of two MPI operations
Mi,j and Nk,l (M, N ∈ {S,R,B,W}) is defined as follows:
Definition 5.1 Two MPI operations Mi,j and Nk,l and M, N ∈ {S,R,W, B}
are co-enabled iff {Mi,j, Nk,l} ⊆ Ready(σi) for some state σi.
The Ready set of a state is a function of the IntraHB relation among the
MPI operations defined in section 3.2. The following lemma directly follows from
61
Definition 3.7 of Ready set and the fact that the IntraHB relations do not change
across interleavings.
Lemma 5.2 If two MPI operations Mi,j and Ni,k (j < k, M, N ∈ {S,R,W, B})
are such that 〈Mi,j, Ni,k〉 ∈ IntraHB(σi), then there is no state σi such that
{Mi,j, Ni,k} ⊆ Ready(σi). That is, Mi,j can never be co-enabled with Ni,j.
Lemma 5.2 says that MPI operations that are related by the IntraHB relation
can never be co-enabled. The IntraHB relation is only among MPI operations with
the same MPI process rank.
Since co-enabledness among MPI operations is defined based on their pres-
ence/absence in the Ready set of the states, the only transitions that can cause MPI
operations to be added or removed from the Ready set are RSR,RBC ,RWC ,RSR∗.
We need to find co-enabledness among MPI operations across process ranks to
detect if a wildcard receive and its matching send issued by a different process can
be co-enabled. In order to be able to do this, we now add InterHB relation among
MPI operations across process ranks.
The InterHB relation is defined from the the IntraHB relation and the match-
sets formed between MPI operations in an interleaving. Figure 5.4 shows the
IntraHB and InterHB relation across MPI operations. The IntraHB relation is
shown as a solid line between MPI operations within the same process rank and
InterHB edges are shown as dotted lines between MPI operations across MPI
processes.
(a) InterHB for determin-istic matches
(b) InterHB for nondeter-ministic matches
(c) InterHB for barriermatches
Figure 5.4. InterHB Relation Across Match-sets
62
Consider Figure 5.4(a). Let Ri,j and Sm,n be such that {Ri,j, Sm,n} ⊆ Ready(σ).
From Lemma 5.2, we known that Ri,j and Mi,k can never be co-enabled. Similarly,
Sm,n and Nm,p can never be co-enabled. However, when a match set is formed
between Ri,j and Sm,n, both of them leave their Ready set at the same time. This
means that Sm,n cannot be co-enabled with Mi,k and Ri,j cannot be co-enabled with
Nm,p. We hence show this using a dotted edge (InterHB relation). The same holds
for Figure 5.4(c).
For nondeterministic matches, even if Ri,j(∗) and Sm,n are in the Ready(σ), it
is still possible that the Ri,j(∗) can match with some other send and can cause Sm,n
to remain in the Ready set while Ri,j is removed. Therefore, it is possible for Sm,n
to be co-enabled with Mi,k. However, Ri,j can never be co-enabled with Nm,p which
is shown as a dotted edge in Figure 5.4(b). The InterHB relation is generated
only after an interleaving I = σ0 → σi → . . . → σn is generated using the POE
algorithm.
We now formally define InterHB.
Definition 5.3 For an Interleaving I = σ0 → σi → . . . → σn,
InterHB(σn as 〈I,M, C, R, ls〉) ⊆ I × I is defined as follows:
• If {Ri,j(m), Sm,n(i)} ⊆ Ready(σj) where σj is some state in I, then for all
Mi,k ∈ Descendants(σn, Ri,j) and Nm,p ∈ Descendants(σn, Sm,n) we have
that 〈Ri,j, Nm,p〉 ∈ InterHB(σn) and 〈Sm,n, Ni,k〉 ∈ InterHB(σn).
• If {Ri,j(∗), Sm,n(i)} ⊆ Ready(σj) where σj is some state in I, then for all
Nm,p ∈ Descendants(σn, Sm,n) we have that 〈Ri,j, Nm,p〉 ∈ InterHB(σn).
• If {Bi,j, Bm,n} ⊆ Ready(σj) where σj is some state in I, then for all Mi,k ∈
Descendants(σn, Bi,j) and Nm,p ∈ Descendants(σn, Bm,n) we have that 〈Bi,j, Nm,p〉 ∈
InterHB(σn) and 〈Bm,n, Ni,k〉 ∈ InterHB(σn).
Definition 5.4 Given an interleaving I = σ0 → σi → . . . → σn, HB(I) =
IntraHB(σn) ∪ InterHB(σn).
63
Let HB∗(I) denote the transitive closure of HB(I). When the context is clear,
we denote HB(I) as HB and HB∗(I) as HB∗.
The following lemma follows directly from the construction of InterHB from
IntraHB.
Lemma 5.5 Let I and I ′ be two equivalent interleavings. For two MPI opera-
tions, Mi,j and Nk,m if 〈Mi,j, Nk,m〉 ∈ HB∗(I), then 〈Mi,j, Nk,m〉 ∈ HB∗(I ′).
Proof : Since I and I ′ are equivalent, the M sets are the same in the final
states of the interleavings. If there are no nondeterministic receives, the InterHB
relation between the operations will also be the same since the deterministic receive
and its matching send must be co-enabled in some state in both I and I ′. If there
is a wildcard receive, then it is possible that there are two sends Si,j(l) and Sm,n(l)
co-enabled in a state with the matching wildcard receive Rl,r(∗) in I but Sm,n is not
co-enabled with with Rl,r in I ′. In this case, Sm,n is matched with Rl,r′ where r′ > r
in I ′ and there is an InterHB relation from Rl,r′ to descendants of Sm,n. Also,
there is an IntraHB relation from Rl,r to Rl,r′ in both I and I ′. By transitivity of
HB, Rl,r and all descendants of Sm,n are also HB related in I ′.
Since the backtrack sets for a state are updated based on the current inter-
leaving I, the Lemma 5.5 facilitates optimizations that can reduce the redundant
interleavings in the POE algorithm.
Figure 5.5 shows the HB relation for the MPI program in Figure 5.3 as a graph
among the MPI operations. The IntraHB relation is shown using solid lines and
InterHB is shown using dashed lines. Note that R1,1 is HB related to the matching
send S3,5 (shown using darker lines). Therefore R1,1 and S3,5 cannot be co-enabled
in an equivalent interleaving by Lemma 5.5. The readers may verify that R1,1 can
indeed never be matched with S3,5. The POE algorithm can now be extended to
update the backtrack(σ) to enabled(σ) in an interleaving only when the receive and
send are not HB related. This can still cause redundant interleavings.
Consider the MPI execution in Figure 5.6 The POE algorithm using the HB
relation would work as follows:
64
Figure 5.5. HB Relation for Figure 5.3 Shown as Graph
Figure 5.6. Redundancy with New POE Algorithm
1. The first interleaving I is generated by GenerateInterleaving.
2. There is a state σi in I such that enabled(σi) = {RSR∗ : {S0,1, R1,1},RSR∗ :
{S3,1, R2,1},RSR∗ : {S5,1, R4,1}}. Let curr(σi) = RSR∗ : {S0,1, R1,1}.
3. The UpdateBacktrack is invoked on σi.
4. Since S3,3 can be matched with R1,1, and S3,3 and R1,1 are not HB(I) related,
backtrack(σi) = enabled(σi).
65
However, note that it is not required to add RSR∗ : {S5,1, R4,1} to backtrack(σi).
Adding redundant transitions to the backtrack set can exponentially increase the
number of interleavings.
The POE algorithm must be able to decide which transitions must be added to
the backtrack in order to co-enable R1,1 and S3,3 in some state instead of adding
every enabled transition to the backtrack set. The algorithm could minimally add
just RSR∗ : {S3,1, R2,1} to backtrack(σi) since R2,1 and S3,3 are HB∗ related. This is
because the HB relation also provides the order in which the MPI operations enter
the Ready set (i.e., enabling order). We use this insight to develop the POEOPT
algorithm described in the next section.
5.3 POE Algorithm Modified
We now present the POEOPT algorithm (Figures 5.7, 5.8, 5.9, 5.10, 5.11) which
extends the POE algorithm to handle the redundant interleavings due to RSR∗
transitions in the backtrack set. The POE algorithm is only different in the way
the backtrack sets are updated. The rest of the algorithm is exactly the same.
UpdateBacktrackset only updates the backtrack set for the states that
have only RSR∗ transitions and invokes AddtoBacktrack. For the rest of
the states, the backtrack sets remain unchanged. Consider the state σi that has
only RSR∗ transitions enabled. The algorithm selects one of the RSR∗ transitions.
The original POE algorithm would then update the backtrack(σi) to enabled(σi).
Instead, the POEOPT algorithm updates the backtrack sets such that, if Ri,j(∗)
is the receive involved in curr(σi) then backtrack(σi) is updated with DTG of
curr(σi) in enabled(σi). In order to detect if it is possible for any other send
Sk,l(i) /∈ Ready(σi) (i.e., Sk,l is not co-enabled with Ri,j), to match with Ri,j, the
HB∗ is used to see if the 〈Ri,k, Sk,l〉 is in HB∗. If 〈Ri,k, Sk,l〉 is in HB∗ then the
backtrack(σi) is not updated. Otherwise, some RSR∗ : {Rp,q(∗), Sm,n(p)} transition
is added to the backtrack(σi) where Rp,q, Sk,l ∈ HB∗.
We now prove that the backtrack sets are persistent for every state in statevec.
66
1: POEOPT(σ0, statevec) {2: statevec.push(σ0);6: curr(σ0) = GetTransition(enabled(σ0));11: backtrack(σ0) = backtrack(σ0) ∪ {curr(σ)};3: while (! statevec.empty()) {5: GenerateInterleaving (statevec);5: UpdateBacktrack (statevec);6: for (i = statevec.size()−1; i ≥ 0; i−−) {7: if (backtrack(statevec[i]) == done(statevec[i])) {8: statevec[i].pop();9: } else {10: break;11: }12: }13: }
Figure 5.7. Pseudocode for POEOPT Algorithm
1: GetTransition(set of transitions T ) {4: if hasnon∗(T )5: return choosenon∗(T );6: else7: return choose∗(T );8: }
Figure 5.8. Pseudocode for GetBacktrack
1: UpdateBacktrack(statevec) {2: for each (σ ∈ statevec) {3: if (enabled(σ) = ∅)4: return;4: ti = curr(σ);5: if (is∗(ti))10: AddtoBacktrack (ti, σ, statevec);11: }12: }
Figure 5.9. Pseudocode for UpdateBacktrack
67
1: AddtoBacktrack(Transition ti ,σ, statevec) {2: backtrack(σ) = backtrack(σ) ∪DTG(σ, ti)3: let Ri,j(∗) be the receive operation of ti4: for each Sk,l /∈ Ready(σ) such that 〈Ri,j, Sk,l〉 /∈ HB∗
5: for each (t ∈ enabled(σ)− backtrack(σ)) {6: let Rp,q(∗) be the receive operation of t7: if (〈Rp,q, Sk,l〉 ∈ HB∗) {8: backtrack(σ) = backtrack(σ)∪{t};9: }10: }11: }
Figure 5.10. Pseudocode for AddtoBacktrack
1: GenerateInterleaving(statevec) {2: σ = statevec[0];3: for (i= 0; i < statevec.size()−1; i++) {4: σ = Execute(statevec[i], curr(statevec[i]));5: }6: curr(σ) = GetTransition(backtrack(σ)−done(σ));8: do {9: σ = Execute(σ, curr(σ));10: statevec.push(σ);12: curr(σ) = GetTransition(enabled(σ));11: backtrack(σ) = backtrack(σ) ∪ {curr(σ)};13: done(σ) = done(σ) ∪ {curr(σ)};14: } while (enabled(σ) '= ∅);15: }
Figure 5.11. Pseudocode for GenerateInterleaving
Theorem 5.6 For any state σ generated by the POEOPT algorithm, the backtrack(σ)
is persistent.
Proof : We prove by induction along the postorder.
Basis Case: The final states are persistent because the set of enabled transi-
tions is empty.
Induction Hypothesis: All successors of σi are persistent.
68
Induction Step: Consider the transition ti taken out of σi in the current
interleaving. Clearly, ti is the first transition of many interleavings that start from
σi, all of which have already been explored. Consider a transition t as in Figure 5.10
involving the Rp,q(∗) transition. Suppose we have not formed a persistent set at σi
and thereby have left out t from the persistent set. That is, t starts an interleaving
I in the full state space, but we did not include t in our persistent set.
Since t itself is independent of ti (they are not in the same DTG), this means
that t must lead to a transition tdi that is dependent on ti. Such a transition must
be in the same DTG as ti, and involve a send such as Sk,l(i). Clearly, Sk,l(i) was
not ready at σi (or else tdi would have been in the persistent set at σi). Since Sk,l(i)
did get enabled later (it was the send transition that was part of tdi ), we must have
〈Rp,q, Sk,l〉 ∈ HB∗. But now, since t and ti are independent, we can surmise that an
interleaving equivalent to I, say I ′, was pursued, and that ti is the first transition
of such an interleaving.
There are two relationships between I and I ′:
• They have the same HB relationship (Lemma 5.5),
• Since equivalent transition sequences represent the same matching decisions
between MPI processes, they define the same control flow branching decisions.
The latter fact was tacit in most of our descriptions, but we are explicating it
for clarity now. More specifically, since I is based on a dynamic execution sequence,
the reader may wonder whether the same dynamic execution sequence would exist
as some I ′. It would indeed, as this argument shows. What we are saying is
that the same MPI instructions would be “processed” along I ′also, and these
MPI instructions would be situated in the same HB relation. Furthermore, due to
our induction hypothesis, we can say that all the computations after ti were done
“correctly,” meaning that an equivalent interleaving I ′will indeed be found.
Now that we have established that I ′would also compute the same HB, we
can observe from our algorithm in Figure 5.10 that we would have added t to the
backtrack set at σi, contradicting the fact that t is outside the persistent set.
CHAPTER 6
DETERMINISTIC MPI PROGRAMS
This chapter proves that for deterministic MPI programs, if Ri,j and Sm,n are a
receive and a send operation, if 〈Ri,j, Sm,n〉 /∈ HB∗(I) and 〈Sm,n, Ri,j〉 /∈ HB∗(I),
then Ri,j and Sm,n are co-enabled in some equivalent interleaving I ′.
6.1 Deterministic MPI Programs and HB
Deterministic MPI programs have the properties enumerated below:
• Since there are no wildcard receives in deterministic MPI programs, there
is only one relevant interleaving and every interleaving is equivalent to any
other interleaving of the program.
• The HB∗ relation will remain the same for any interleaving.
• If there is no deadlock in an interleaving, then there can be no deadlocks in
any other equivalent interleaving of the program.
We now prove that for a receive operation Ri,j and send operation Sm,n in a
deterministic program (note that Ri,j and Sm,n need not match; they may not even
target each other) if Ri,j and Sm,n cannot be co-enabled then 〈Ri,j, Sm,n〉 ∈ HB∗
or 〈Sm,n, Ri,j〉 ∈ HB∗.
Lemma 6.1 Consider a deadlock-free interleaving of a deterministic MPI pro-
gram I = σ0t0−→ σ1
t1−→ . . .tn−1−−→ σn. If Ri,j cannot be co-enabled with Sm,n in I,
then 〈Ri,j, Sm,n〉 ∈ HB∗ or 〈Sm,n, Ri,j〉 ∈ HB∗.
Proof : Given an interleaving I = σ0 → σ1 → . . . → σn, we use the notation
σi < σj when i < j to denote that σi was generated before σj in I.
70
For ease of notation, we denote the issue order among MPI operations of a
process as follows: if Mi,j is an MPI operation, we denote Ni,j′ to denote an MPI
operation that is issued after Mi,j where j′ > j. Similarly, we denote an MPI
operation as Fi,j′′ when Fi,j′′ is issued after Ni,j′ where j′′ > j.
Let σa be the state in I where RSR : {Ri,j, Sk,l} is executed for some Sk,l and
σb be the state in I where RSR : {Rp,q, Sm,n} is executed for some Rp,q.
Since Ri,j and Sm,n cannot be co-enabled, either σa < σb or σb < σa. We prove
by contradiction for the case when σa < σb (the other case is similar).
• Assume that 〈Ri,j, Sm,n〉 /∈ HB∗.
• Consider an interleaving I ′ equivalent to I where RSR : {Ri,j, Sk,l} is executed
in σa′ only when enabled(σa′) = {RSR : {Ri,j, Sk,l}} (i.e., RSR : {Ri,j, Sk,l} is
executed only when there is no other transition to be executed).
• Since I and I ′ are equivalent, HB∗(I) and HB∗(I ′) are equal (Lemma 5.5).
We use HB∗ to denote the HB relation for both I and I ′.
• Since RSR : {Ri,j, Sk,l} is the only transition in enabled(σa′), all the processes
must be blocked either at a W or a B.
• If all the processes are blocked at B operation, this will cause a RBC to be
enabled in σa′ . This is not possible since enabled(σa′) is a singleton containing
only RSR : {Ri,j, Sk,l} transition. Hence, at least one of the processes must
be blocked at a W operation.
• If both Pi and Pk are blocked at a B, then executing RSR : {Ri,j, Sk,l} will not
unblock the B operations (since there is some other process that is blocked
on a W ) and hence will result in a deadlock. This is not possible since I is
deadlock-free. Hence, either Pi or Pk must be blocked on a W .
• If both Pi and Pk are blocked at W
– Assume that the process Pi is blocked at Wi,j′ such that 〈Ri,j, Wi,j′〉 /∈
HB∗(I ′) (i.e., Wi,j′ is not the W corresponding to the receive Ri,j).
71
– Similarly, assume that the process Pk is blocked at Wk,l′ such that
〈Sk,l, Wk,l′〉 /∈ HB∗.
– Executing the RSR : {Ri,j, Sk,l} transition from σa′ will not unblock any
of the waits causing a deadlock.
– Since I is deadlock-free, atleast one of the Wi,j′ or Wk,l′ must be such
that 〈Ri,j, Wi,j′〉 ∈ HB∗ or 〈Sk,l, Wk,l′〉 ∈ HB∗ .
• Without loss of generality, assume that 〈Ri,j, Wi,j′〉 ∈ HB∗
• Since {Ri,j, Sk,l} ⊆ Ready(σa′), 〈Sk,l, Wi,j′〉 ∈ HB∗ (by InterHB construc-
tion). Also, for all Fi,j′′ where F ∈ {S,R,W, B}, 〈Sk,l, Fi,j′′〉 ∈ HB∗.
• Executing the RSR : {Ri,j, Sk,l} transition from σa′ will unblock Wi,j′ .
• Some MPI operation Fi,j′′ following Wi,j′ will unblock some process Pr.
– If Pr is blocked at Wr,k, then 〈Fi,j′′ , Wr,k〉 ∈ HB∗ since Fi,j′′ must be
matched with the send or receive corresponding to Wr,k. Also, for all
Mr,k′ , we have 〈Fi,j′′ , Mr,k′〉 ∈ HB∗. Therefore, 〈Ri,j, Mr,k′〉 ∈ HB∗.
– If Pr is blocked at Br,k, then Fi,j′′ = Bi,j′′ . Therefore, 〈Bi,j′′ , Fr,k′〉 ∈ HB∗
for all k′ > k. Therefore, 〈Ri,j, Fr,k′〉 ∈ HB∗.
• Hence, as each process unblocks, there is an HB∗ relation from Ri,j to all the
MPI operations following the blocked W or B operations of every process.
For every process Pl, that is blocked at Wl,p or Bl,p, 〈Ri,j, Fl,p′〉 ∈ HB∗.
• If Sm,n is issued after the blocking W or B of Pm, then 〈Ri,j, Sm,n〉 ∈ HB∗.
This is a contradiction.
• If Sm,n were issued before the blocking W or B of Pm, since Sm,n and Ri,j are
not co-enabled, Sm,n /∈ Ready(σa′). Therefore, there is some MPI operation
Fm,r ∈ Ready(σa′) and r < n and 〈Fm,r, Sm,n〉 ∈ HB∗. Since enabled(σa′)
consists of only a single transition, Fm,r must be matched with an MPI
operation Fl,p′ of some process Pl that is issued after the blocking operation
72
of Pl (Wl,p or Bl,p). Therefore, 〈Fl,p′Sm,n〉 ∈ HB∗. Also, 〈Ri,j, Fl,p′〉 ∈ HB∗.
Therefore, 〈Ri,j, Sm,n〉 ∈ HB∗. This is a contradiction.
Lemma 6.2 Corollary for lemma 6.1: Consider a deadlock-free interleaving of
a deterministic MPI program I = σ0t0−→ σ1
t1−→ . . .tn−1−−→ σn. If 〈Ri,j, Sm,n〉 /∈ HB∗
and 〈Sm,n, Ri,j /∈ HB∗ then Ri,j and Sm,n are co-enabled in some state.
The only difference between the HB∗ for a deterministic MPI program and an
MPI program with wildcard receives is the absence of the InterHB edge between
a send and descendants of the matching receive. Given an interleaving I, we define
the Deterministic(HB(I)) as follows:
Definition 6.3 Given an interleaving I = σ0 → σ1 → . . . → σn, Deterministic(
HB(I)) = HB(I) ∪ {〈Si,j(k), Fk,p′〉} where {Si,j, Rk,p} are in the M set of σn, Fk,p′
∈ Descendants(σn, Rk,p).
By generating a deterministic HB for an interleaving, if there is no path between
a receive and a send in Deterministic(HB), then the receive and send can be
co-enabled in an equivalent interleaving. We use this result to find the sends that
must be buffered to eleminate the HB∗ relations between a send and a matching
wildcard receive so as to co-enable them.
CHAPTER 7
HANDLING SLACK IN MPI PROGRAMS
This chapter deals with the slack-inelastic deadlocks of MPI programs described
in Section 4.4. Section 7.2 provides the reasons why buffering all sends is not a
solution. Section 7.4 characterizes the complexity of finding minimal sets of sends
to be buffered that can guarantee to detect all deadlocks, including head-to-head
deadlocks. Section 7.5 describes the minimal slack enumeration variant of POEOPT,
namely POEMSE, and its proof of correctness.
7.1 Verification for Portability
The importance of verifying a program for portability cannot be over-emphasized.
Given the growing popularity of dynamic verification, many important questions
must be answered:
• Does it matter where we run dynamic verification?
• In particular, having verified a program on one platform, what can we say
about its correctness on another platform?
While it is too ambitious to solve verification for portability in general, this
dissertation offers the following unique contribution: For buffering (slack) sensitive
behavioral variations, POEMSE guarantees that verifying a program to be correct
on any platform implies correctness on all platforms. In effect, POEMSE is able
to compute where slack matters, and simulate slack whenever it matters during
verification.
It is known from [31] (in the context of CHP programs - CSP for hardware) and
[48] (for MPI) that some MPI programs can have more behaviors when buffer
74
sizes are increased (say, when their eager limits are increased). Yet, the MPI
community is still unaware of these results. Commercial tools such as the Intel
Message Checker [22] still look for deadlocks by setting all send buffer sizes to zero
and rely on timeouts to tell when a deadlock has been encountered. However, the
deadlock in Figure 7.1 cannot be revealed by this approach. While one may hope to
detect deadlocks by simulating infinite buffering, the example in Figure 7.2 shows
that sometimes deadlocks are triggered only if some of the sends are buffered. While
we shall detail these examples in Section 7.2, the nasty reality is that one must be
prepared to verify for all combinations of sends with/without buffering. POEMSE
avoids this exponential cost by determining where (for which sends) slack matters,
and only replays the analysis for those.
As a real world example of the complexity of predicting when a send will have
slack, consider the discussion of eager limit computation given in [20]:
The Parallel Environment implementation of MPI uses an eager send protocolfor messages whose size is up to the eager limit. This value can be allowed todefault, or can be specified with the MP_EAGER_LIMIT environment variableor the -eager_limit command-line flag. In an eager send, the entire messageis sent immediately to its destination and the send buffer is returned to theapplication. Since the message is sent without knowing if there is a matchingreceive waiting, the message may need to be stored in the early arrival bufferat the destination, until a matching receive is posted by the application.The MPI standard requires that an eager send be done only if it can beguaranteed that there is sufficient buffer space. If a send is posted at somesource (sender) when buffer space cannot be guaranteed, the send must notcomplete at the source until it is known that there will be a place for themessage at the destination.
P0 P1 P2
S0,1(1) S1,1(2) R2,1(∗)W0,2(h0,1) W1,2(h1,1) W2,2(h2,1)S0,3(2) R1,3(0) R2,3(0)W0,4(h0,3) W1,4(h1,3) W2,4(h2,3)
Figure 7.1. Buffering Sends and Deadlocks
75
Figure 7.2. Specific Buffering Needed
PE MPI uses a credit flow control, by which senders track the buffer spacethat can be guaranteed at each destination. For each source-destination pair,an eager send consumes a message credit at the source, and a match at thedestination generates a message credit. The message credits generated atthe destination are returned to the sender to enable additional eager sends.The message credits are returned piggyback on an application message whenpossible. If there is no return traffic, they will accumulate at the destinationuntil their number reaches some threshold, and then be sent back as a batchto minimize network traffic. When a sender has no message credits, itssends must proceed using rendezvous protocol until message credits becomeavailable. The fall back to rendezvous protocol may impact performance.With a reasonable supply of message credits, most applications will find thatthe credits return soon enough to enable messages that are not larger thanthe eager limit to continue to be sent eagerly.
From this discussion, it must be clear that algorithms such as POEMSE are
essential if we were to avoid the cost of exponential analysis for slack. While
POEMSE’s analysis could, in the worst case, be exponential in the number of sends
issued at runtime, in practice, it is an extremely low number compared to this worst
case.
76
7.2 Introduction to Slack Analysis
The POE algorithms developed until now assumed that all the sends are non-
buffered, i.e., none of the sends are provided any runtime buffering or slack. We
now remove the buffering constraints on the sends and allow sends to be either
buffered by the runtime or nonbuffered by the runtime. Once the send is buffered,
the send can be completed at anytime. We now define the parameterized runtime
send buffering transition RSC transition as follows:
RSC(µ) :Σ(σ as 〈I,M, C, R, ls〉), {Si,j} = µ, Si,j /∈ C, {Si,j, Rk,l} /∈ M
Σ〈I,M, C ∪ {Si,j}, R, ls〉
The RSC(µ) transition completes any send in µ that is not matched yet.
On page 83, we shall define a big-step move transition RSBC(µ) that will
complete all the transitions in µ. We will be feeding as µ sets of sends that are
determined to be minimal. We shall now present examples that illustrate how these
minimal send sets are determined.
7.2.1 Zero Buffering Can Miss Deadlocks
Consider an MPI program execution in Figure 7.1. The MPI program in
Figure 7.1 will be deadlocked only when either S0,1 or S1,1 or both are buffered.
There is no deadlock when both S0,1 and S1,1 remain nonbuffered. The POEOPT
algorithm which executes under zero buffering for all sends will not be able to detect
this deadlock. Also note that it is sufficient to just buffer one of S0,1 or S1,1. The
buffering status of S0,3 will not matter in detecting the deadlock. This example
shows how providing slack to sends can cause communication races with respect to
wildcard receives and hence can result in erroneous behaviors.
One solution to detect all the communication races would be to buffer all the
sends in an MPI program. This will detect all the deadlocks involving communi-
cation races with respect to wildcard receives. Hence, it will now be sufficient to
execute the POE algorithm with full buffering for all sends to detect all commu-
nication races. However, nonbuffered sends are themselves a source of a deadlock.
77
Under insufficient buffering, a send can remain blocked on its wait when there is no
matching receive for the send and hence result in a deadlocked state. By allowing all
sends to be buffered, the POE algorithm will not be able to detect any head-to-head
deadlocks involving nonbuffered sends.
One solution that comes immediately to mind would be to run the POE algo-
rithm twice : once with all sends buffered and once with none of the sends buffered.
However, this will not detect all the deadlocks.
7.2.2 Too Much Buffering Can Miss Deadlocks
The example in Figure 7.2 will not deadlock when none of the sends are buffered
or all the sends are buffered. However, it would deadlock only when S0,1 is buffered
and S1,1 is not buffered.
When the POE algorithm is executed with zero buffering, S1,1 will match with
R2,1. This matching will cause S1,1 and R2,1 to complete and will unblock the waits
W1,2 and W2,2. S0,1 will now be matched with R1,3. This will unblock W0,2 and
W1,4. S0,3 will be matched with R2,3 which will unblock the waits W0,4 and W2,4.
R1,5 will be matched with S2,5 while P0 is blocked on W0,6. W1,6 and W2,6 will
unblock, resulting in matching R2,7 with S0,5. This will unblock W0,6. Finally, R0,7
will be matched with S1,7 which will cause the rest of the waits to unblock. Hence,
the POE algorithm has completed without a deadlock when none of the sends are
buffered.
Now consider the POE algorithm execution when all the sends are buffered.
The POE algorithm still executes with RSR∗ transition having the least priority.
S0,1 and S1,1 get buffered which will unblock W0,2 and W1,2. Similarly S0,3 and S0,5
will be buffered which causes W0,4 and W0,6 to be unblocked. R1,3 is matched with
S0,1. This matching will unblock W1,4. At this point, R2,1 can be matched with
either S1,1 or S0,3. When R2,1 is matched with S1,1, the match sets are the same
as those generated in the zero buffering case. When R2,1 is matched with S0,3, S1,1
will be matched with R2,7 and will terminate the program with no deadlock. Both
the buffering and nonbuffering executions will not detect a deadlock. The deadlock
78
only happens when only S0,1 is buffered and S1,1 and S2,5 are not buffered. The
rest of the sends may or may not be buffered. Since they are always matched,
their buffering status would not matter. We therefore look for those sends whose
buffering status would result in deadlocks.
A naive brute force solution would be to execute the MPI program with all
possible buffering scenarios for all the sends. The example in Figure 7.2 would
result in at least 26 interleavings just for the six sends present in the execution.
This approach will deteriorate rapidly as the number of sends in the program
increase. This chapter extends the POEOPT algorithm to handle slack and detect
any deadlock present with reduced interleavings that would not deteriorate as
rapidly as the number of sends increase.
The crux of our analysis is to be able to tell that S0,1(1) is the only send with
this property in this whole program. To summarize:
• We must discover all minimal sets of sends to buffer so that other sends may
match with wildcard receives. In our example, we buffered S0,1(1), but the
send that matches with the wildcard receive as a result is S0,3(2).
• The minimal number of sends is not unique. Therefore, we must find all
possible such minimal sets of sends, and re-run the analysis for each of them.
• We must not buffer more than this minimal set in each case, because we may
then miss head-to-head deadlocks.
7.3 Using HB to Detect Slack
This section describes how to identify the slack properties of various sends based
on the HB relation. Since the HB relation is built after the program execution, the
first step would be to execute the POEOPT algorithm with all sends nonbuffered.
This will generate the initial HB graph. Buffering sends will also affect the HB
graph. When a send Si,j is buffered, Wi,j′(hi,j) will return immediately. We consider
such waits to have turned into no-ops. This will have the effect of deleting the
IntraHBs associated with these waits, i.e., for j′′
> j′and F ∈ {S, R,W, B}, we
79
will remove 〈Wi,j′ , Fi,j′′〉 from IntraHB. We call these waits culprit-waits and the
sends associated with these waits culprit sends.
Lemma 5.5 can be used to detect if a wildcard receive Ri,j and send Sm,n(i) can
be matched. Using this lemma, Ri,j and Sm,n cannot be co-enabled in any equivalent
interleaving I ′ if 〈Ri,j, Sm,n〉 ∈ HB∗(I) or 〈Sm,n, Ri,j〉 ∈ HB∗(I). Otherwise, they
may be co-enabled.
Of these cases, we need not consider the case of 〈Sm,n, Ri,j〉 ∈ HB∗ for this
simple reason. Suppose 〈Sm,n, Ri,j〉 ∈ HB∗ with respect to the initial nonbuffered
execution. Then, it means that there was an earlier receive in process i with which
Sm,n matched. Thus, nothing can make Sm,n match Ri,j in the buffered execution.
If 〈Ri,j, Sm,n〉 ∈ HB∗, then Ri,j and Sm,n cannot be co-enabled. However,
Lemma 6.2 uses a deterministic HB relation to detect that when 〈Ri,j, Sm,n〉 /∈
Deterministic(HB∗), then they are co-enabled. Therefore, when 〈Ri,j, Sm,n〉 ∈
HB∗, we need to detect the sends that can be buffered so that 〈Ri,j, Sm,n〉 /∈ Determ
inistic(HB∗). This means that it is sufficient to buffer those sends that will cause
the path from Ri,j to Sm,n in Deterministic(HB∗) to be broken.
7.3.1 HB Graph and Paths
We now describe how to detect the sends that need to be buffered to co-enable
a wildcard receive Ri,j and Sm,n(i) when 〈Ri,j, Sm,n〉 ∈ HB. We first convert the
HB relation into a HB graph called GHB defined as follows:
Definition 7.1 GHB = (V, E) where the set of verticies are the MPI operations
invoked by the various processes and if 〈opi, opj〉 ∈ HB, then 〈opi, opj〉 ∈ E.
Hence, if two MPI operations opi and opj are HB related, there is a path
between opi and opj in GHB. When a send is buffered, the HB relation is updated
by removing any edges going out of the W corresponding to the sends and will
break the paths.
Given a wildcard receive and its matching send in an interleaving I, we generate
the GDeterministic(HB(I)) graph and break the paths in GDeterministic(HB(I)). If there
80
is no path between a receive and a send in GDeterministic(HB(I)), we know from
Lemma 6.2 that the receive and send can be co-enabled in some state in an
equivalent interleaving.
If a path contains multiple culprit waits, buffering only one of the waits is
sufficient to break the path.
Figure 7.3 shows the path between R2,1 and its matching send S0,3 which involves
the waits W1,2 and W0,2. It is sufficient to buffer the sends corresponding to either
of the waits as we have described before. We also need to buffer sends in all possible
ways in order to detect deadlocks involving any of the sends. We hence need to
detect all possible ways to break the paths given a set of culprit waits involved in
various paths between a wildcard receive and its matching send. We call these sets
Minimal-Wait sets.
Figure 7.3. Path Breaking
81
7.4 Finding Minimal Waits Sets
Definition 7.2 Let π be a path between two MPI operations in GDeterministic(HB)
(V, E). Let Onpath(π) be the set of Wi,j(hi,k) operations on paths π such that
〈Si,k, Wi,j〉 ∈ E (these are the culprit waits and their associated culprit sends).
Definition 7.3 Let ζ be the set of all paths between Ri,j and Sk,l(i) such that
for every π ∈ ζ, OnPath(π) '= ∅. Let Wall = ∪π∈ζ OnPath(π) be the set of all the
culprit waits on all paths. With respect to Wall, we can now define a minimal wait
set Wmin ⊆ Wall as follows:
Let {w, w′} ⊆ Wall. If {w, w′} ⊆ Wmin then for any path π ∈ ζ, w ∈ OnPath(π) ⇔
w′ /∈ OnPath(π) and ∃c ∈ Wmin such that c ∈ OnPath(π).
That is there is exactly one wait whose send is buffered on every path.
Theorem 7.4 Given a set of paths ζ between Ri,j and Sk,l(i), finding Wmin is
NP-Complete
Proof : We prove by reducing the monotone-1-in-3 SAT to our problem. The
above problem is in NP. Given a certificate Wc, we can easily check that each path
has exactly one wait in Wc in polynomial time. A monotone-1-in-3 SAT formula
f is 3-CNF that has no negations and must be satisfied by assigning exactly one
literal in every clause to true. Given a formula f , let v represent the set of variables
and c be the set of clauses. v represents Wall, i.e., each variable represents a wait
W . We construct a Happens-Before graph G = (V, E) with V = v and for every
clause ci = (x1 ∨ x2 ∨ x3), we add edges 〈x1, x2〉 ∈ E and 〈x2, x3〉 ∈ E. That
is, x1 → x2 → x3 forms a path in the graph. There is a source vertex labeled
Ri,j(∗) and sink vertex Sm,n(i) where 〈Ri,j, x1〉 ∈ E and 〈x3, Sm,n〉 ∈ E. A path
starts at the source vertex and ends at the sink vertex. If there is a Wmin for these
paths, then exactly one wait (one variable) is selected from every path. Setting the
variables corresponding to these waits to true in f will satisfy f . Conversely, if f
can be satisfied, the variables that have been set to true in each clause form Wmin
set.
82
Figure 7.4(a) and Figure 7.4(b) show the construction of GHB graphs for the
Monotone 1-in-3-SAT formulae when there are multiple Wmins possible and when
no Wmin is possible, respectively.
Theorem 7.5 Finding all minimal wait sets is #P-Complete.
Proof : Monotone-1-in-3 SAT is a #P-Complete problem and our reduction
is such that the number of solutions to the SAT problem is equal to the number of
minimal wait sets.
Since finding all possible minimal wait sets in #P-Complete, we propose the
algorithm in Figure 7.5 that finds all the subsets of the waits (i.e., powerset) in
Wall and sorts the subsets by size. Then, it iterates over each subset in the sorted
order and finds if buffering the waits in the set will break all the paths. If so, it
(a) Example with multiple Wmin (b) Example with no Wmin
Figure 7.4. Example Formulas and GHB graphs
83
1: MinimalWaitSets(ζ, Wall) {2: PWall
|S| = SortBySize(PWall);
3: for each (s ∈ PWall|S| ) {
4: if (BreaksAllPaths (ζ, s) {5: PWall
|S| = PWall|S| − {p ∈ PWall
|S| | s ⊂ p};6: } else {7: PWall
|S| = PWall|S| − s;
8: }9: }
10: return PWall|S| ;
11: }
Figure 7.5. Algorithm to Find Minimal Wait Sets
removes all the supersets of the set from the powerset. If the set does not break all
the paths, then the set itself is removed from the powerset.
The buffering transition for a send only buffers one send at a time. However,
the minimal wait sets can contain more than one wait whose send must be buffered.
Since RSC(µ) transitions are independent of all other MPI transitions, we combine
multiple RSC(µ) transitions into a big step transition which given a set of sends
bar, buffers the sends in bar as follows:
RSBC(µ) :Σ(σ as 〈I,M, C, R, pc〉), µ ⊆ {Si,j ∈ I | Si,j /∈ C, {Si,j, Rk,l} /∈ M}
Σ〈I,M, C ∪ µ, R, pc〉
Note that we could add individual RSC transitions to the backtrack sets. This
will only cause redundant interleavings, which can be very inefficient for POE
which is basically a stateless dynamic verification algorithm. The RSBC will avoid
redundant interleavings.
7.5 POEMSE Algorithm
We now provide the POEMSE algorithm that extends the POEOPT algorithm
to handle slack. The algorithm differs from the POEOPT with respect to updating
the backtrack sets. The rest of the the algorithm remains unchanged. We only
84
provide the pseudo-code where there are any changes or additions as shown in
Figures 7.6, 7.7, and 7.8. The GetTransition routine is similar to the Get-
Transition of POE except that the RSC transitions are never executed. This
emulates the zero-buffering behavior of POEOPT .
The pseudo-code for UpdateBacktrack is shown in Figure 7.6. The pseudo-
code invokes AddSlacktoBacktrack when a RSR∗ transition is executed from
that state as before. Figure 7.8 uses the following helper routines:
• GetHBGraph takes a HB relation as input and returns a HB graph. In
the POEMSE algorithm, the Deterministic(HB) graph is passed as input.
• FindPaths takes the HB graph GHB and two MPI operations as input and
returns all the paths between the operations in the graph.
• FindWaits finds all the culprit waits in the paths, i.e., it finds the Wall set.
• GetMinSendSets returns the send sets corresponding to each of the mini-
mal waits sets.
• GetRSBC takes a set of sends as input and returns the RSBC transition for
those sends.
1: UpdateBacktrack(statevec) {2: for each (σ ∈ statevec) {3: if (enabled(σ) = ∅)4: return;5: ti = curr(σ);6: if (is∗(ti)) {7: AddSlacktoBacktrack (ti, σ, statevec);8: AddtoBacktrack (ti, σ, statevec);9: }10: }11: }
Figure 7.6. Pseudocode for UpdateBacktrack
85
1: GetTransition(set of transitions T ) {2: TB = {t ∈ T | isRSC(t)}3: if (hasnon∗(T − TB))4: return choosenon∗(T − TB);5: else if (has∗(T − TB))6: return choose∗(T − TB)7: }
Figure 7.7. Pseudocode for GetTransition
1: AddSlacktoBacktrack(Transition ti ,σ, statevec) {2: let Ri,j(∗) be the receive operation of ti and Sm,n(i) be some compatible
send that we want to try and co-enable with Ri,j
3: GHB =GetHBGraph(Deterministic(HB));4: ζ =FindPaths(GHB, Ri,j, Sm,n);5: Wall = FindWaits(ζ);6: mws = MinimalWaitSets(ζ, Wall);7: if (mws = ∅)8: return;9: mss =GetMinSendSets(mws)10: for each (µ ∈ mss ) {11: t = GetRSBC(µ) ;12: backtrack(σ) = backtrack(σ) ∪ {t};13: }14: }
Figure 7.8. Pseudocode for AddSlacktoBacktrack
The backtrack(σ) is updated for the states where a RSR∗ transition is executed
in the current interleaving. From the Deterministic(HB) relation, the GHB graph
is generated. All the paths between the wildcard receive Ri,j and its matching send
Sm,n are found. The minimal wait sets are generated in mws. The msw is converted
into mss where each of the waits in the sets are replaced by their corresponding
sends and the RSBC transition for each of the minimal send sets in mss are added
to the backtrack sets.
We now prove the following invariant for all states σi generated by the POEMSE
algorithm.
86
Lemma 7.6 In the POEMSE algorithm, when a state σi is popped from statevec,
if there exists an RSBC transition ti ∈ enabled(σi) that can lead to
• Rk,l(∗) ∈ Ready(σi), Sm,n(k) /∈ Ready(σi) to be co-enabled and
• RSR∗ : {Si,j, Rk,l} ∈ backtrack(σi).
then ti is in backtrack(σi).
Proof : Induction, by post-order, as with POEMSE.
• Basis case: The final state, has either enabled(σi) = ∅ or contains only RSC
transitions and the above invariant holds vacuously.
• Induction: Assume that the invariant holds for all successors of state σi.
• If RSR∗ : {Si,j, Rk,l} /∈ backtrack(σi)), then the invariant holds vacuously.
These are the states where hasnon∗ is true.
• When RSR∗ : {Si,j, Rk,l} ∈ backtrack(σi), we prove by contradiction. Assume
that there is some ti = RSBC transition involving one or more sends that
was not included in backtrack(σi) but must be included in backtrack(σi)
for Rk,l to be co-enabled with some Sm,n. Let σi+1 be the state reached
after executing ti from σi. Since ti is independent of all transitions, every
transition in enabled(σi)−{ti} are also available in enabled(σi+1). Therefore,
every interleaving generated from σi can also be generated from σi+1. Let
succ(σi) be the set of states generated from σi after executing transitions
from backtrack(σi).
Let DTGk be the dependent transition group of Rk,l. Clearly, Sm,n cannot
be from any process that is already in DTGk since Sm,n is IntraHB-related
to some Si,j(k) ∈ Ready(σi) or Rk,l and IntraHB relation does not change
across interleavings. Therefore, Sm,n must be from a process different from
the processes involved in DTGk. Let Sm,n be a descendant of some other
MPI operation Fm,r that belongs to some DTGm. Since DTGm and DTGk
are independent, DTGm is still enabled in some σl = succ(σi).
87
By induction hypothesis, every code path where Sm,n occurs will be explored
for DTGm from σi (since interleavings from σi also include the interleavings
from succ(σi)). Therefore, every interleaving generated from σi+1 involving
Sm,n is equivalent to some interleaving generated from σi. Interleavings
generated from σi must have one of the following valid:
– Rk,l and Sm,n are HB-related and buffering the sends in ti will not break
the paths between Rk,l and Sm,n, in which case, Rk,l and Sm,n cannot be
co-enabled in an equivalent interleaving (Lemma 5.5). or
– Rk,l and Sm,n are not HB-related and hence can be co-enabled in an
equivalent interleaving without buffering any more sends (Lemma 6.2)
or
– There is a path from Ri,j to Sm,n involving culprit waits corresponding
to the sends in ti and buffering them will break all the paths.
In the first two cases, it is not necessary to buffer any more sends and
therefore, it is not necessary to add ti to backtrack(σi), which contradicts
the assumption that it is necessary to add ti to backtrack(σi). In the last
case, the POEMSE algorithm will find all minimal wait sets and will add ti to
backtrack(σi), which is a contradiction.
Theorem 7.7 For any state σ generated by the POEMSE algorithm, the set
backtrack(σ) is persistent.
Proof : The proof follows directly from Lemma 7.6.
Theorem 7.8 The POEMSE algorithm will find deadlocks at Wi,j′(hi,j) when
Si,j does not have a matching receive.
Proof : The proof follows directly from the fact that POEMSE is persistent
and buffers all possible minimal waits sets at every state σi. Also, the very first
interleaving generated from any state σi executes the POEOPT algorithm with the
88
RSC transitions never executed. Since POEOPT is persistent when the sends are
zero buffered, the POEMSE algorithm will find the head-to-head deadlocks.
CHAPTER 8
EXTENSIONS TO THE FORMAL MODEL
This chapter extends the formal model presented in Chapter 3 to
• support more MPI functions: namely, MPI_Send, MPI_Recv, MPI_Waitall in
Section 8.1 and
• handle communicators and tags (Section 8.2).
8.1 Handling More MPI Functions
This section describes how the formal model can easily be extended to handle
a few very frequently used MPI functions.
8.1.1 MPI Send and MPI Recv
The MPI functions MPI_Send and MPI_Recv have the following prototypes:
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm);
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int src,
int tag, MPI_Comm comm, MPI_Status *status);
MPI_Send blocks until the send completes. Similarly, MPI_Recv blocks until the
receive operation completes. MPI_Send will be denoted as Sb when the buffering
status of the send is immaterial. Sb will be used to explicitly denote when Sb is
buffered by the runtime.
Op is now extended with Sbi,j(k), Rb
i,j(k), Rbi,j(∗), where i, k ∈ PID, j ∈| Pi |.
The Nonovertake ordering (Definition 3.1) is extended to handle Sb and Rb as
follows:
90
Definition 8.1 Nonovertakeb(σ as 〈I,M, C, R, pc〉) ⊆ I × I =Nonovertake(σ)∪ {〈Si,j(k), Sb
i,j′(k)〉, 〈Sbi,j(k), Si,j′(k)〉, 〈Sb
i,j(k), Sbi,j′(k)〉}
∪ {〈Ri,j(k), Rbi,j′(k)〉, 〈Rb
i,j(k), Ri,j′(k)〉, 〈Rbi,j(k), Rb
i,j′(k)〉}∪ {〈Ri,j(∗), Rb
i,j′(k)〉, 〈Rbi,j(∗), Ri,j′(k)〉, 〈Rb
i,j(∗), Rbi,j′(j)〉}
∪ {〈Ri,j(∗), Rbi,j′(∗)〉, 〈Rb
i,j(∗), Ri,j′(∗)〉, 〈Rbi,j(∗), Rb
i,j′(∗)〉}where i, k ∈ PID, j, j′ ∈| Pi | and j < j′.
The Fence(σ) ordering (Definition 3.3) is updated as follows:
Definition 8.2 Fenceb(σ as 〈I,M, C, R, pc〉) ⊆ I × I =
Fence(σ) ∪ {〈Wi,j, Fi,j′〉, 〈Bi,j, Fi,j′〉, 〈Sbi,j, Fi,j′〉, 〈Rb
i,j, Fi,j′〉}
Where j < j′, F ∈ {S,R,W, B, Sb, Rb}.
The isS(Fi,j) predicate is extended to return true when F = S or F = Sb.
Similarly, isR(Fi,j) is extended to return true when F = R or F = Rb. The PS
and PR transitions will now also include the process transitions for Sb and Rb MPI
operations.
The following runtime transitions are added to support the Sb and Rb MPI
operations:
RSRb :Σ(σ), {Si,j(k), Rb
k,l(i)} ⊆ Ready(σ)
Σ(σ′ as 〈I,M ∪ {{Si,j, RBk,l}}, C, R, ls〉),
Assert : Ready(σ′) = Ready(σ)− {Si,j, Rbk,l}
RSbR :Σ(σ), {Sb
i,j(k), Rk,l(i)} ⊆ Ready(σ)
Σ(σ′ as 〈I,M ∪ {{Sbi,j, Rk,l}}, C, R, ls〉),
Assert : Ready(σ′) = Ready(σ)− {Sbi,j, Rk,l}
RSbRb :Σ(σ), {Sb
i,j(k), Rbk,l(i)} ⊆ Ready(σ)
Σ(σ′ as 〈I,M ∪ {{Sbi,j, R
bk,l}}, C, R, ls〉),
Assert : Ready(σ′) = Ready(σ)− {Sbi,j, R
bk,l}
The RSRb∗, RSbR∗, RSbRb∗ transitions are also similarly added.
The above runtime transitions indicate various possible matching between dif-
ferent types of MPI send and receive operations. However, note that for any send
91
or receive, only one of the above transitions are enabled due to the Nonovertake
rule.
The RSbC rule to complete a nonbuffered send is
RSbC :
Σ(σ),({(Sb
i,j, Rk,l} ∈ M ∨ {Sbi,j, R
bk,l} ∈ M), Sb
i,j /∈ C
Σ〈I,M, C ∪ {Sbi,j}, R ∪ {Sb
i,j}, ls[i ← lsi + 1]〉
RSbC :Σ(σ), Sb
i,j /∈ C
Σ〈I,M, C ∪ {Sbi,j}, R ∪ {Sb
i,j}, ls[i ← lsi + 1]〉
The RRbC rule to complete a blocking receive is:
RRbC :
Σ(σ),({Sb
i,j, Rk,l} ∈ M ∨ {Sbi,j, R
bk,l} ∈ M), Rb
k,l /∈ C
Σ〈I,M, C ∪ {Rbk,l}, R ∪ {Rb
k,l}, ls[i ← lsi + 1]〉
The Rb and Sb operations return only when they are completed, unlike their
nonblocking counterparts S and R that can return at anytime.
8.1.2 MPI Waitall
The MPI Waitall operation has the following prototype:
int MPI_Waitall( int count, MPI_Request array_of_requests[],
MPI_Status array_of_statuses[]);
and its arguments are an array of MPI Request handles where count is the size of
array_of_handles. The handles can be the MPI Request handles of either S or
R. MPI Waitall is denoted as Walli,j′(H) where H is the set of MPI handles where
hi,j ∈ H denotes either Si,j or Ri,j.
The Resource(σ) (Definition 3.2) and Fence(σ) (Definition 3.3) sets are updates
as follows:
Definition 8.3 ResourceWall(σ) =
Resource(σ) ∪⋃
hi,j∈H{〈Si,j, Walli,j′〉, 〈Ri,j, Walli,j′〉 | j < j′}.
92
Definition 8.4 FenceWall(σ) =
Fenceb(σ) ∪ {〈Walli,j, Fi,j′)〉}, | j < j′, F ∈ {S,R,W, B, Sb, Rb}}.
We extend the isW (Fi,j) predicate to return true when F = W or F = Wall
and false otherwise. The PW transition will now include the process transition for
Wall. The runtime transitions for Wall are presented below:
RWallC :Σ(σ), Walli,j ∈ Ready(σ)
Σ(σ′ as 〈I,M, C ∪ {Walli,j}, R}, ls〉),Assert : Ready(σ′) = Ready(σ)− {Wi,j}
RWRet :Σ(σ), Walli,j ∈ C, Walli,j /∈ R
Σ〈I,M, C, R ∪ {Walli,j}, ls[i ← lsi + 1]〉
The MPI Waitany operation behaves like Wall except that the MPI Waitany
can return when at least one of the send or receives corresponding to the handles
are complete. This requires a change to the definition of the Ready set so that the
MPI Waitany enters the Ready(σ) set when at least one of the send or receives are
complete instead of all the sends and receives that need to be complete for Wall.
8.2 Communicators and Tags
The formal model presented in Chapter 3 abstracts away the communicators
and tags. We now describe how the formal model can be extended to handle
communicators and tags. Consider an MPI program execution with n processes.
MPI allows the n processes to be divided into subsets called groups. Every process
can belong to one or more groups. All the processes in a group with m ≤ n processes
are ranked from 0 to m − 1 within the group. Initially, when the MPI processes
execute MPI_Init, all processes by default belong to the group MPI_GROUP_WORLD.
All groups are created as subsets of MPI_GROUP_WORLD.
MPI Groups are created using one of many group construction APIs provided
by the MPI library. For example, subsets of a group can be constructed using:
MPI Group Incl (MPI Group ingroup, int m, int *ranks, MPI Group
*newgroup)
93
where the newgroup contains m processes from ingroup and inranks corresponds
to ranks of m processes in ingroup that must be included in newgroup. Note that
newgroup will also rank its processes from 0 to m−1. Hence, a process in newgroup
can have a different rank in ingroup. MPI also provides various group construction
functions. Since groups are essentially sets of processes, various set operations on
groups are provided to create to new groups. Following is a list of few MPI group
creation functions.
• MPI_Group_Incl creates a subset.
• MPI_Group_Difference sets difference.
• MPI_Group_Union creates a new group that is union of two groups.
• MPI_Group_Intersection creates a new group that is intersection of two
groups.
Processes within the same group can disjointly communicate by creating various
communicators associated with the group. The MPI library by default associates
MPI COMM WORLD communicator with MPI GROUP WORLD. The MPI library provides
communicator creation APIs. Every communicator created is uniquely identified
by the MPI runtime.
The formal model assumes that all the communication happens with comm
=MPI COMM WORLD. The process ranks are hence the ranks of processes in MPI
GROUP WORLD. An MPI Isend sends to another process receiving with the same
communicator comm. The dest field is the process rank in the group associated
with comm. MPI library provides a mapping function that maps the rank of a
process in any group to its rank in MPI_GROUP_WORLD. We therefore assume that a
process rank is always mapped to MPI_GROUP_WORLD. The formal model only needs
to compare that the communicators are the same.
Tags provide more fine-grained communication within the communicator. A
tag is an integer that uniquely identifies a message along with the communicator.
When the messages are matched, it is necessary that the tags also match along with
94
the communicators. The tag field can also be MPI_TAG_ANY, which is a wildcard
tag denoted as “*”.
Note that the communicators and tags can only dictate the IntraHB order
among the operations within a process and do not contribute to any nondetermin-
ism.
8.2.1 Extensions to the Formal Model
We now extend the MPI operation S and R to have a communicator comm
and tag, where, comm ∈ N and tag ∈ N ∪ {∗}. Hence, the set Op
now contains {Si,j(k, comm, tag), Ri,j(k, comm, tag), Ri,j(∗, comm, tag), Wi,j′(hi,j),
Bi,j(comm)}.
The Nonovertake rule (Definition 3.1) is redefined as follows:
Definition 8.5 Nonovertake(σ) =
{〈Si,j(k, commj, tagj), Si,j′(k, commj′ , tagj′)〉, 〈Ri,j(k, commi, tagi), Ri,j′(k, commj′ , tagj′)〉,
〈Ri,j(∗, commj, tagj), Ri,j′(k, commj′ , tagj′)〉 〈Ri,j(∗, commj, tagj), Ri,j′(∗, commj′ , tagj′)〉
| (j < j′ ∧ commi = commj′ ∧(tagj = “ ∗ ” ∨ tagj = tagj′)}.
The MPI transitions RSR, RBC and RSR∗ are extended to support MPI com-
municators and tags as follows:
RSR :
Σ(σ),{Si,j′(k, commi, tagi), Rk,l′(i, commk, tagk)} ⊆ Ready(σ)commi = commk
tagi = ∗ =⇒ Si,j(k, commi, tagk) /∈ Ready(σ), (j < j′)tagk = ∗ =⇒ Rk,l(k, commi, tagi) /∈ Ready(σ), (l < l′)tagi '= ∗ ∧ tagk '= ∗ =⇒ tagi = tagk
Σ(σ′ as 〈I,M ∪ {{Si,j′ , Rk,l′(i)}}, C, R, ls〉),Assert : Ready(σ′) = Ready(σ)− {Si,j′ , Rk,l′}
95
RSR∗ :
Σ(σ),{Si,j′(k, commi, tagi), Rk,l′(∗, commk, tagk)} ⊆ Ready(σ)commi = commk
Rk,l(i, commi, ∗) /∈ Ready(σ) ∨Rk,l(i, commi, tagi) /∈ Ready(σ)(l < l′)tagi = ∗ =⇒ Si,j(k, commi, tagk) /∈ Ready(σ), (j < j′)tagk = ∗ =⇒ Rk,l(∗, commi, tagi) /∈ Ready(σ), (l < l′)tagi '= ∗ ∧ tagk '= ∗ =⇒ tagi = tagk
Σ(σ′ as 〈I,M ∪ {{Si,j′ , Rk,l′(i)}}, C, R, ls〉),Assert : Ready(σ′) = Ready(σ)− {Si,j′ , Rk,l′}
RBC :
Σ(σ),bar(comm) as {Bi,j(commi) | Bi,j ∈ Ready(σ) ∧ commi = comm},| bar |= size(comm)
Σ(σ′ as 〈I,M ∪ {bar}, C ∪ bar,R, ls]〉),Assert : Ready(σ′) = Ready(σ)− bar
where size(comm) is the number of processes in the group corresponding to
comm.
CHAPTER 9
ISP: A PRACTICAL DYNAMIC MPI
VERIFIER
This chapter presents our dynamic MPI verification tool that incorporates the
verification algorithms POE, POEOPT and POEMSE. Section 9.1 describes the
architecture of ISP. Section 9.2 describes various implementation tricks incorporated
into ISP in order to implement the verification algorithms. Finally, Section 9.3
presents some experimental results of the three POE algorithm variations.
9.1 ISP Architecture
ISP behaves as an auxiliary MPI runtime and performs the matching of var-
ious MPI functions. ISP uses the actual MPI runtime (henceforth referred to as
MPI library) to transfer data and complete the MPI operations. ISP works by
intercepting the MPI calls made by the target program and making decisions on
when to send the MPI calls to the MPI library. This is accomplished by the two
main components of ISP : the Profiler and the Scheduler. Figure 9.1 provides an
overview of ISP’s components and their interaction with the program as well as the
MPI library.
9.1.1 The Profiler
The interception of MPI calls is accomplished by compiling the ISP profiler
together with the target program’s source code. The profiler makes use of MPI’s
profiling mechanism (PMPI). It provides its own version of MPI f for each corre-
sponding MPI function f . Within each of these MPI f, the profiler communicates
with the scheduler using TCP sockets to send information about the MPI call the
process wants to execute. It will then wait for the scheduler to make a decision
97
!"#$%&'()*&+
,!-'-$"()*&$
./&%#012*&
3-,4(
!%5&6#*&$
+)781*+
3-,'$#80)9&
-3-,4(':;<'7"15&16'+)781*'($"9'05&'!%5&6#*&$=
,!-
Figure 9.1. ISP Architecture
whether the MPI call must be sent into the MPI library or to postpone it until a later
time. When the permission to fire f is given by the scheduler, the corresponding
PMPI f will be issued to the MPI library. Since all MPI libraries come with the
PMPI f for every MPI function, this approach provides a portable and light-weight
instrumentation mechanism for MPI programs verified.
9.1.2 The ISP Scheduler
The ISP scheduler carries out the verification algorithms. Since every process
starts executing with an MPI_Init, every process invokes the MPI Init provided by
the profiler. The MPI Init of the profiler establishes a TCP connection with the
scheduler and communicates its process rank to the scheduler. The TCP connection
is used for all further communication between the process and the scheduler. The
scheduler maintains a mapping between the process rank and its corresponding TCP
connection. Once the connection with the scheduler is established, the processes
execute a PMPI Init into the MPI library. The processes finally return from the
MPI Init of the profiler and continue executing the program.
98
Whenever a process wishes to execute an MPI function, it invokes the MPI f of
the profiler, which communicates this information to the scheduler over the TCP
connection. The profiler does not always execute the PMPI f call into the MPI
library when it calls the profiler’s MPI f. For nonblocking calls like MPI Isend and
MPI Irecv, the profiler code sends the information to the scheduler and stores this
information in a structure in the profiler and returns. When the process executes
a fence instruction like MPI Wait, the scheduler makes various matching decisions
and sends a message to the process to execute the PMPI Isend (or other nonblocking
functions) corresponding to the Wait call. The MPI library is not aware of the
existence of MPI_Isend until this time. Eventually, the scheduler sends a message
to the process to execute the PMPI Wait, at which time the process returns. It must
be noted that the scheduler will allow a process to execute a fence MPI function
only when the Wait can complete and hence return. Otherwise, the scheduler will
detect a deadlock.
9.2 ISP : Implementation Issues
This section briefly describes various implementation decisions made by ISP
inorder to support the verification algorithms.
9.2.1 Out-of-Order Issue
For nonblocking calls, the PMPI f functions are not executed when the MPI f
is executed by the process. The reason behind this decision is the nonblocking
wildcard receive function MPI_Irecv . If the process executing the wildcard receive
into the profiler also executes the PMPI Irecv into the library, the actual matching
of the receive with a send will be decided by the MPI library. Since the scheduler
MUST ensure that the matching that happens in the library is the matching the
scheduler has decided, the scheduler postpones the issue of the wildcard receive into
the MPI library until a later time. Once the scheduler decides that the wildcard
receive must be matched with the send of a particular process rank, it communicates
this decision to the process to execute a PMPI Irecv with the src set to the process
99
rank of the send (we call this Dynamic source rewrite). Note that the MPI
library never knows the existence of a wildcard receive. Since the nondeterminism
is taken away, the library matches the sends and receives as the scheduler decides.
Since the scheduler must know all the sends that can match with a nonblocking
wildcard receive, a wildcard receive may be issued out-of-order into the MPI library.
9.2.2 Scheduling MPI Waitany
Due to the out-of-order issue behavior of ISP, when a nonblocking call such as
MPI Irecv is invoked by the process, the profiler provides a unique MPI Request
handle for the nonblocking receive. When the MPI function MPI Waitany is
invoked by a process with a set of request handles, it is sufficient to complete
only one of the MPI Isend or MPI Irecv operations corresponding to the requests.
Consider an MPI Waitany that has n requests and only i of the sends or receives
have been issued to the MPI library. The MPI library is aware of only these i
requests and has no knowledge of the existence of the rest of n− i requests. When
a PMPI Waitany is called into the MPI library, the library aborts the process with
an error that the request structure is invalid! We get around this issue by updating
all the n− i requests to MPI REQUEST NULL. These requests will be ignored by
the library.
9.2.3 Buffering Sends
Inorder to implement the POEMSE algorithm, the scheduler must be able to
provide buffer to the sends so that the waits corresponding to the sends can unblock.
The scheduler cannot rely on the MPI library to provide buffering according to the
scheduler’s wishes. The solution is implemented into the profiler. The profiler
buffers a send and copies the data into a different heap space. When the wait
corresponding to the send is later issued into the profiler, the wait never issues
a PMPI Wait into the MPI library and instead returns from MPI Wait. The
profiler will eventually execute an PMPI Wait for the buffered send when the send
100
is matched with a receive. Note that the scheduler will allow a send to be issued
into the library only when there is a matching receive.
9.3 Experimental Results
This section presents experimental results when ISP was run on various MPI
programs.
Our experimental results will be reported on the following MPI programs:
• Umpire test suite [64] consists of a set of small MPI programs that capture
various error and deadlock patterns in MPI programs.
• MADRE [53, 49] is a collection of memory aware parallel redistribution
algorithms addressing the problem of efficiently moving data blocks across
processes without exceeding the allotted memory of each process. MADRE
is an interesting target for ISP because it belongs to a class of MPI programs
that make use of wildcard receives which potentially could result in deadlocks
that can easily go undetected.
• ParMETIS [43] is a parallel library that provides implementation of several
effective graph partitioning algorithms. ParMETIS also provides several par-
allel routines that are especially suitable for graph partitioning in a parallel
computing environment. ParMETIS has more than 14k LOC and executes
more than a million MPI calls when run with 32 processes.
We compare the POE algorithm with a well-known MPI testing tool called
Marmot [26]. Marmot detects deadlocks using a timeout mechanism. Marmot’s
architecture is similar to ISP. The process calls are trapped by Marmot and when a
process does not provide Marmot with the next MPI function before a timeout that
is user defined, Marmot signals a deadlock warning. We run the POE algorithm,
Marmot, on the Umpire test suite. The results for a small set of the benchmarks
are shown in Table 9.1. Readers can find the full set of results at [25].
Table 9.1 has three columns. The first column provides the Umpire bench-
mark programs. The second column shows the result of running the Umpire
101
Table 9.1. Comparison of POE with Marmot
Umpire Benchmark POE Marmotany src-can-deadlock7.c Deadlock Detected Deadlock Caught in
2 interleavings 5/10 runsany src-can-deadlock10.c Deadlock Detected Deadlock Caught in
1 interleaving 7/10 runsbasic-deadlock10.c Deadlock Detected Deadlock Caught in
1 interleaving 10/10 runsbasic-deadlock2.c No Deadlock Detected No Deadlock Caught
2 interleavings in 20 runsbasic-deadlock2.c No Deadlock Detected No Deadlock Caught
2 interleavings in 20 runscollective-misorder.c Deadlock Detected Deadlock Caught in
1 interleaving 10/10 runscollective-misorder2.c Deadlock Detected No Deadlock Caught
1 interleaving in 20 runs
benchmark on ISP executing the POE algorithm. We show the number of in-
terleavings generated by POE. The last column shows the result of running the
benchmark with Marmot. The benchmark is run multiple times on Marmot to
see how successfully Marmot detects a deadlock. As can be seen in the results,
Marmot does not necessarily detect the presence of a deadlock every time it is
run. The detection of presence of a deadlock with Marmot is not guaranteed
when the program contains nondeterministic wildcard receives. For a deterministic
program like collective-misorder.c, the deadlock is detected by Marmot in
every interleaving. The reason for this can be directly deduced from the fact that
all interleavings in a deterministic program are equivalent. Hence, if there is a
deadlock in one interleaving, there will be a deadlock in every interleaving. POE
detects a deadlock in collective-misorder2.c, which is a deterministic program,
even when Marmot does not detect any. This is because our formal model strictly
treats all collective MPI functions as Barriers. However, the MPI standard provides
latitude to the MPI libraries implementing other collectives like MPI Bcast where
these operations are not necessarily treated as MPI_Barrier. Our formal model
102
uses the strictest definition possible so all deadlocks are detected irrespective of the
MPI library on which the program will be executed.
We now provide the experimental results for ParMETIS and MADRE. ParMETIS
has no nondeterministic wildcard receives in the program. Hence, for any number
of processes, POE algorithm generates only a single interleaving. Table 9.2 shows
the experimental results on MADRE.
The results are shown when the POE algorithm is run with different programs
in MADRE for different processes. A “-” in a column indicates that ISP did not
terminate even after more than 150,000 interleavings. The results indicate that the
POE algorithm also suffers from the state explosion problem that is inherent in all
verification tools. The benchmarks in MADRE have different DTG groups due to
which POE algorithm explodes. However, the POEOPT algorithm is able to reduce
the number of interleavings in many cases to just 1 interleaving.
For the POEMSE algorithm, the benchmarks did not show the slack-inelastic
patterns. Our first study was to see if the POEMSE algorithm would detect dead-
locks in a small set of hand coded examples. The POEMSE algorithm successfully
detected deadlocks in these programs where POE failed to find them, as shown in
Table 9.3.
Our second study was to detect the overhead of POEMSE on large MPI applica-
tions. ParMETIS∗ is a modified version where a small part of the algorithm was
rewritten using wildcard receives. In most of our benchmarks where no additional
interleavings were needed, and in the presence of wildcard receives, where the new
algorithm has to run extra steps to make sure we have covered all possible matchings
in the presence of buffering, the overhead is less than 3%.
Finally, we study large examples with slack variant patterns inserted into them.
This is shown as PARMETISb where we rewrote the algorithm of ParMETIS
again, this time not only to introduce wildcard receives, but also to allow the
possibility of a different order of matching that can only be discovered by allowing
certain sends to be buffered. Our experiments show that POEMSE successfully
discovered the alternate matches.
103
Table 9.2. Results for POE and POEOPT on MADRE
MADRE POE Interleavings POEOPT Interleavings2 Procs 3 Procs 4 Procs 2 Procs 3 Procs 4 Procs
sbt1 2 6 24 1 1 1sbt2 2 6 24 1 1 1sbt3 20 1680 - 1 1 1sbt4 1 20 1680 1 1 1sbt5 1 20 1680 1 20 1680sbt6 1 20 1680 1 20 1680sbt7 20 - - 1 20 1680sbt8 1 1 1 1 1 1sbt9 - 1 - 1 1 1
Table 9.3. Results for POE and POEMSE
Number of interleavings(notice the extra necessary POEMSE POEinterleavings of POEMSE)
sendbuff.c 5 1sendbuff-1a.c 2 (deadlock caught) 1sendbuff2.c 1 1sendbuff3.c 6 1sendbuff4.c 3 1ParMETISb 2 1
Overhead of POEMSE onParMETIS / ParMETIS∗
(runtime in seconds POEMSE POE(x) denotes x interleavings)
ParMETIS (4procs) 20.9 (1) 20.5 (1)ParMETIS (8procs) 93.4 (1) 92.6 (1)
ParMETIS∗ 18.2 (2) 18.7(2)
CHAPTER 10
CONCLUSIONS
Standardization success stories such as MPI are extremely rare in computing
practices. Enduring standards provide the perfect context within which to create
robust design and debugging techniques. Yet, the formal methods community has
largely ignored MPI and related developments. In contrast, the same community
produces a vast number of papers on shared memory concurrency formalization.
We only have two explanations: (i) the sheer number of MPI API calls (which
exceed 300 in MPI 2.0) seem to have somehow discouraged the CS community
from understanding MPI, and (ii) the problems solved using MPI typically are not
taught in mainstream CS classes.
This dissertation contributes to a deep understanding of the primitives under-
lying MPI. As we show in this dissertation, one can understand MPI through
a small set of primitive notions such as nonblocking sends, receives, waits and
barriers. If one teaches the much smaller primitive basis of MPI to newcomers,
they will have a much easier time thinking about their MPI programs and possible
optimizations. One can also formulate a whole range of analysis problems in terms
of the Happens-Before relation for MPI that we contribute. As a concrete example,
prior to our work, there was no formal systematic way to argue whether this one
line MPI program will deadlock or not – both with and without slack. Using
our Happens-Before relation for MPI, we can very precisely analyze even tricky
“auto-send” examples of this nature.
P0: Irecv(from P0, x, &h); Wait(&h); Barrier; Isend(to P0, 22);
105
10.1 Suggestions for Future Work
• State Space: Though the POE algorithms developed in this dissertation
guarantee a full coverage, unfortunately they suffer from the state space
explosion problem. These verification algorithms will benefit from exploit-
ing either programmer knowledge or static analysis techniques that provide
information on semantic equivalence between various wildcard matches. This
can considerably reduce the state space and the verification time.
• Implementing MPI Waitany: The MPI functions MPI_Waitany and MPI_
Waitsome are sources of nondeterminism. For n request handles, MPI_Waitany
can cause n! interleavings and MPI_Waitsome will cause 2n − 1 interleavings.
Clearly, the presence of these MPI functions in a program can quickly cause
the state space to explode in a verification tool. As a novel solution, we
have employed simple static analysis solutions that look at the control flow
to decide the request on which Waitany or Waitsome must interleave on. Our
initial results [63] are encouraging. From our experience, a full interleaving
exploration for all subsets of requests is in general wasteful. More sohisticated
static analysis can provide support in building better verification tools for
MPI.
• MPI and OpenMP: MPI+OpenMP mixes MPI with threads. Today’s
multi-core architectures provide opportunities for higher parallelism and the
most efficient way to exploit the parallelism is by using threads. Future MPI
programs will exploit this parallelism using well-known thread library imple-
mentations like OpenMP, when each process can execute multiple threads
with each thread invoking an MPI function. These programs suffer from
both traditional data races due to shared memory among threads and with
MPI-related bugs. Debugging multithreaded MPI programs will be even more
challenging. Consider multiple MPI processes with multiple threads with all
threads of one process issuing a wildcard receive and the rest of the threads
of other processes issuing sends. A send can now match a wildcard receive
106
of any thread - in-effect, so the send is a kind of wild-card send. Extending
the algorithms in this dissertation to handle multiple threads will be a good
future work.
REFERENCES
[1] Aananthakrishnan, S., Delisi, M., Vakkalanka, S. S., Vo, A.,Gopalakrishnan, G., Kirby, R. M., and Thakur, R. How formaldynamic verification tools facilitate novel concurrency visualizations. In Eu-roPVM/MPI (2009), pp. 261–270.
[2] Avrunin, G. S., Siegel, S. F., and Siegel, A. R. Finite-state verificationfor high performance computing. In Proceedings of the Second InternationalWorkshop On Software Engineering For High Performance Computing SystemApplications, St. Louis, Missouri, USA, May 15, 2005 (2005), P. M. Johnson,Ed., pp. 68–72.
[3] Ball, T., Cook, B., Levin, V., and Rajamani, S. K. SLAM and staticdriver verifier: Technology transfer of formal methods inside Microsoft. InProceedings of IFM 04: Integrated Formal Methods (April 2004), Springer,pp. 1–20.
[4] CHESS: Find and reproduce heisenbugs in concurrent programs. http://research.microsoft.com/en-us/projects/chess Accessed 12/8/09.
[5] Clarke, E. M., Grumberg, O., and Peled, D. A. Model Checking. MITPress, 2000.
[6] Concurrency education. http://www.cs.utah.edu/formal_verification/Concurrency_Education.
[7] Exascale computing study report. http://users.ece.gatech.edu/~mrichard/ExascaleComputingStudyReports/ECS_reports.htmAccessed12/16/09.
[8] Ferrante, J., and McKinley, K. S., Eds. Proceedings of the ACMSIGPLAN 2007 Conference on Programming Language Design and Implemen-tation, San Diego, California, USA, June 10-13, 2007 (2007), ACM.
[9] Flanagan, C., and Godefroid, P. Dynamic partial-order reductionfor model checking software. In POPL ’05: Proceedings of the 32nd ACMSIGPLAN-SIGACT Symposium on Principles of Programming Languages(2005), pp. 110–121.
[10] Godefroid, P. Partial-Order Methods for the Verification of ConcurrentSystems - An Approach to the State-Explosion Problem, vol. 1032 of LectureNotes in Computer Science. Springer, 1996.
108
[11] Godefroid, P. Model checking for programming languages using Verisoft.In POPL 97: Principles of Programming Languages (1997), pp. 174–186.
[12] Godefroid, P., and Wolper, P. Using partial orders for the efficientverification of deadlock freedom and safety properties. Formal Methods inSystem Design 2, 2 (1993), 149–164.
[13] Gopalakrishnan, G. Practical formal verification of MPI and and threadprograms, 2009. Half-day tutorial, 23rd International Conference on Super-computing, ICS 2009,.
[14] Gopalakrishnan, G., and Kirby, R. M. Practical MPI and pthread dy-namic verification, Nov. 2009. Half-day tutorial, 16th International Symposiumon Formal Methods, FM 2009,.
[15] Gopalakrishnan, G., and Kirby, R. M. Dynamic verification of messagepassing and threading, Jan. 2010. Half-day tutorial, 15th ACM SIGPLAN An-nual Symposium on Principles and Practice of Parallel Programming, PPoPP2010,.
[16] Gopalakrishnan, G., Kirby, R. M., and Vo, A. Practical formalverification of MPI and thread programs, Sept. 2009. Full-day tutorial,EuroPVM/MPI 2009,.
[17] Havelund, K., and Pressburger, T. Model checking java programsusing java pathfinder. International Journal on Software Tools for TechnologyTransfer 2, 4 (Apr. 2000).
[18] Holzmann, G. J. The Spin Model Checker. Addison-Wesley, Boston, 2004.
[19] A modest proposal for petascale computing. http://www.hpcwire.com/blogs/17909359.html. Mentions energy costs of Petascale machines.
[20] PE MPI buffer management for eager protocol. http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.pe431.mpiprog.doc/am106_buff.html.
[21] III, J. W., and Bova, S. Where is the overlap? In Mesage Passing InterfaceDeveloper’s and User’s Conference (MPIDC) (1999).
[22] Intel message checker. http://www.intel.com/cd/software/products/asmo-na/eng/227074.htm.
[23] Gem - isp eclipse plugin. http://www.cs.utah.edu/formal_verification/ISP-Eclipse.
[24] Isp. http://www.cs.utah.edu/formal_verification/ISP_Release.
[25] Test results comparing isp, marmot,and mpirun. http://www.cs.utah.edu/fv/ISP_Tests.
109
[26] Krammer, B., Bidmon, K., Mller, M. S., and Resch, M. M. MAR-MOT: An MPI analysis and checking tool. In Parallel Computing 2003 (Sept.2003).
[27] LAM/MPI parallel computing. http://www.lam-mpi.org/.
[28] Lastovetsky, A., Kechadi, T., and Dongarra, J., Eds. RecentAdvances in Parallel Virtual Machine and Message Passing Interface, 15thEuropean PVM/MPI User’s Group Meeting, Proceedings (2008), vol. 5205 ofLNCS, Springer.
[29] Li, G., DeLisi, M., Gopalakrishnan, G., and Kirby, R. M. Formalspecification of the MPI-2.0 standard in tla+. In Principles and Practices ofParallel Programming (PPoPP) (2008), pp. 283–284.
[30] Slouching towards exascale: Programming models for high-performance com-puting. http://www.cs.utah.edu/ec2 Accessed 12/16/09.
[31] Manohar, R., and Martin, A. J. Slack elasticity in concurrent computing.In Proceedings of the Fourth International Conference on the Mathematics ofProgram Construction (1998), Springer-Verlag, pp. 272–285. Lecture Notes inComputer Science 1422.
[32] Matlin, O. S., Lusk, E. L., and McCune, W. Spinning parallel systemssoftware. In Proceedings of the 9th International SPIN Workshop on ModelChecking of Software (London, UK, 2002), Springer-Verlag, pp. 213–220.
[33] MPI 2.1 Standard. MPI Standard 2.1, http://www.mpi-forum.org/docs/.
[34] Mpich2: High performance and widely portable MPI. http://www.mcs.anl.gov/mpi/mpich.
[35] Musuvathi, M., Park, D., Chou, A., Engler, D., and Dill, D. L.Cmc: A pragmatic approach to model checking real code. In Proceedings of theFifth Symposium on Operating System Design and Implementation (December2002).
[36] Musuvathi, M., and Qadeer, S. Iterative context bounding for systematictesting of multithreaded programs. In Programming Languages Design andImplementation (PLDI) 2007 (2007), pp. 446–455.
[37] Open MPI: Open source high performance MPI. http://www.open-mpi.org/.
[38] Pacheco, P. Parallel Programming with MPI. Morgan Kaufmann, 1996.ISBN 1-55860-339-5.
[39] Palmer, R., Barrus, S., Yang, Y., Gopalakrishnan, G., and Kirby,R. M. Gauss: A framework for verifying scientific computing software. InWorkshop on Software Model Checking (2005). Electronic Notes on TheoreticalComputer Science (ENTCS), No. 953.
110
[40] Palmer, R., Delisi, M., Gopalakrishnan, G., and Kirby, R. M. Anapproach to formalization and analysis of message passing libraries. In FormalMethods for Industry Critical Systems (FMICS 2007) (2008), S. Leue andP. Merino, Eds., pp. 164–181. LNCS 4916.
[41] Palmer, R., Gopalakrishnan, G., and Kirby, R. M. Formal specifi-cation and verification using +CAL: An experience report. In Proceedings ofVerify’06 (FLoC 2006) (2006).
[42] Palmer, R., Gopalakrishnan, G., and Kirby, R. M. Semantics drivendynamic partial-order reduction of MPI-based parallel programs. In Paralleland Distributed Systems: Testing and Debugging (PADTAD - V) (2007),pp. 43–53.
[43] ParMETIS - Parallel graph partitioning and fill-reducing matrix ordering.http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview.
[44] Pervez, S., Palmer, R., Gopalakrishnan, G., Kirby, R. M., Thakur,R., and Gropp, W. Practical model checking method for verifying correct-ness of MPI programs. In EuroPVM/MPI (2007), pp. 344–353. LNCS 4757.
[45] Quinlan, D., Vuduc, R., and Misherghi, G. Techniques for the spec-ification of bug patterns. In Parallel and Distributed Systems: Testing andDebugging (PADTAD) (2007).
[46] Sharma, S., Vakkalanka, S., Gopalakrishnan, G., Kirby, R. M.,Thakur, R., and Gropp, W. A formal approach to detect functionallyirrelevant barriers in MPI programs. In Lastovetsky et al. [28].
[47] Sharma, S. V., Gopalakrishnan, G., and Kirby, R. M. A surveyof MPI related debuggers and tools. Tech. Rep. UUCS-07-015, Universityof Utah, School of Computing, 2007. http://www.cs.utah.edu/research/techreports.shtml.
[48] Siegel, S. F. Efficient verification of halting properties for MPI programswith wildcard receives. In Verification, Model Checking, and Abstract Inter-pretation: 6th International Conference, VMCAI 2005, Paris, January 17–19,2005, Proceedings (2005), R. Cousot, Ed., vol. 3385 of LNCS, pp. 413–429.
[49] Siegel, S. F. The MADRE web page. http://vsl.cis.udel.edu/madre,2008.
[50] Siegel, S. F. The MPI-Spin web page. http://vsl.cis.udel.edu/mpi-spin, 2008.
[51] Siegel, S. F., and Avrunin, G. S. Verification of MPI-based software forscientific computation. In Model Checking Software: 11th International SPINWorkshop, Barcelona, Spain, April 1–3, 2004, Proceedings (2004), S. Graf andL. Mounier, Eds., vol. 2989 of LNCS, Springer-Verlag, pp. 286–303.
111
[52] Siegel, S. F., and Avrunin, G. S. Modeling wildcard-free MPI pro-grams for verification. In Proceedings of the ACM SIGPLAN Symposium onPrinciples and Practice of Parallel Programming (Chicago, IL, June 2005),pp. 95–106.
[53] Siegel, S. F., and Siegel, A. R. MADRE: The Memory-Aware DataRedistribution Engine. In Lastovetsky et al. [28].
[54] “spin—Formal Verification” web site. http://spinroot.com, 2008.
[55] Stack Trace Analysis Tool. https://computing.llnl.gov/code/STAT.
[56] Strout, M. M., Kreaseck, B., and Hovland, P. D. Data-flow analysisfor MPI programs. In International Conference on Parallel Programming(ICPP) (2006), pp. 175–184.
[57] TotalView concurrency tool. http://www.totalviewtech.com.
[58] Vakkalanka, S., DeLisi, M., Gopalakrishnan, G., and Kirby, R. M.Scheduling considerations for building dynamic verification tools for MPI.In Parallel and Distributed Systems - Testing and Debugging (PADTAD-VI)(Seattle, WA, July 2008).
[59] Vakkalanka, S., DeLisi, M., Gopalakrishnan, G., Kirby, R. M.,Thakur, R., and Gropp, W. Implementing efficient dynamic formalverification methods for MPI programs. In Lastovetsky et al. [28].
[60] Vakkalanka, S., Gopalakrishnan, G., and Kirby, R. M. DynamicVerification of MPI Programs with Reductions in Presence of Split Operationsand Relaxed Orderings. In Computer Aided Verification (CAV 2008) (2008),pp. 66–79.
[61] Vakkalanka, S., Sharma, S. V., Gopalakrishnan, G., and Kirby,R. M. ISP: A tool for model checking MPI programs. In Principles andPractices of Parallel Programming (PPoPP) (2008), pp. 285–286.
[62] Vakkalanka, S., Vo, A., Gopalakrishnan, G., and Kirby, R. M.Reduced execution semantics of MPI: From theory to practice. In FM 2009(Nov. 2009), pp. 724–740.
[63] Vakkalanka, S. S., Szubzda, G., Vo, A., Gopalakrishnan, G.,Kirby, R. M., and Thakur, R. Static-analysis assisted dynamic veri-fication of MPI waitany programs (poster abstract). In PVM/MPI (2009),M. Ropo, J. Westerholm, and J. Dongarra, Eds., vol. 5759 of Lecture Notes inComputer Science, Springer, pp. 329–330.
[64] Vetter, J. S., and de Supinski, B. R. Dynamic software testing ofMPI applications with Umpire. In Supercomputing ’00: Proceedings of the2000 ACM/IEEE Conference on Supercomputing (CDROM) (2000), IEEEComputer Society. Article 51.
112
[65] Visser, W., Havelund, K., Brat, G., and Park, S. Model checkingprograms. In The Fifteenth IEEE International Conference on AutomatedSoftware Engineering (ASE’00) (Sept. 2000).
[66] Vo, A., Vakkalanka, S., DeLisi, M., Gopalakrishnan, G., Kirby,R. M., , and Thakur, R. Formal verification of practical MPI programs.In Principles and Practices of Parallel Programming (PPoPP) (2009), pp. 261–269.
[67] Vo, A., Vakkalanka, S., Williams, J., Gopalakrishnan, G., Kirby,R. M., and Thakur, R. Sound and efficient dynamic verification of MPIprograms with probe non-determinism. In EuroPVM/MPI (Sept. 2009),p. 271281.
[68] Vuduc, R., Schulz, M., Quinlan, D., de Supinski, B., and Saeb-jornsen, A. Improved distributed memory applications testing by messageperturbation. In Parallel and Distributed Systems: Testing and Debugging(PADTAD - IV) (2006).
[69] Yang, Y., Chen, X., Gopalakrishnan, G., and Kirby, R. M. Efficientstateful dynamic partial order reduction. In SPIN ’08: Proceedings of the 15thInternational SPIN Workshop on Model Checking Software (2008), LectureNotes in Computer Science, Springer.