f. a. bianchi and a. margara and m. pezze` 1 a ......f. a. bianchi and a. margara and m. pezze` 1 a...

F. A. BIANCHI AND A. MARGARA AND M. PEZZÈ 1

A Survey of Recent Trends in TestingConcurrent Software Systems

Francesco A. Bianchi, Alessandro Margara, Mauro Pezzè

Abstract—Many modern software systems are composed of multiple execution flows that run simultaneously, spanning fromapplications designed to exploit the power of modern multi-core architectures to distributed systems consisting of multiple componentsdeployed on different physical nodes. We collectively refer to such systems as concurrent systems.Concurrent systems are difficult to test, since the faults that derive from their concurrent nature depend on the interleavings of theactions performed by the individual execution flows. Testing techniques that target these faults must take into account the concurrencyaspects of the systems. The increasingly rapid spread of parallel and distributed architectures led to a deluge of concurrent softwaresystems, and the explosion of testing techniques for such systems in the last decade.The current lack of a comprehensive classification, analysis and comparison of the many testing techniques for concurrent systemslimits the understanding of the strengths and weaknesses of each approach and hampers the future advancements in the field.This survey provides a framework to capture the key features of the available techniques to test concurrent software systems, identifiesa set of classification criteria to review and compare the available techniques, and discusses in details their strengths and weaknesses,leading to a thorough assessment of the field and paving the road for future progresses.

Index Terms—Survey, Classification, Testing, Concurrent Systems, Parallel Systems, Distributed Systems

F

1 INTRODUCTIONConcurrent systems are common in several applicationdomains: many interactive applications exploit the multi-threaded paradigm to decouple the input-output processingfrom the back-end computation; applications developed forsingle nodes often exploit multiple threads to enable parallelcomputations on multi-core architectures; Web applicationsimplement the client-server communication paradigm withclient-side and server-side computations; mobile applica-tions often access data and interact with remote servers;peer-to-peer services coordinate a multitude of computingnodes.

Concurrent software systems are composed of multipleexecution flows that execute simultaneously, and the needto synchronize the execution flows leads to new problemsand introduces new design and verification challenges. Thebehavior of concurrent systems depends not only on thesequence of actions executed within each individual flow,but also on the interleavings of the actions in the differentexecution flows. Wrong interleavings may lead to concur-rency faults regardless of the correctness of the computationof each execution flow. The problem of developing reliableconcurrent systems has attracted a lot of interest in thesoftware engineering community, and has led to severalsolutions for designing [1], implementing [2], refactoring [3],modeling, verifying and validating concurrent software sys-tems [4].

Concurrency faults are intrinsically non-deterministic,since they occur only in the presence of specific interleav-

• The authors are with the Faculty of Informatics, Univeristà della Svizzeraitaliana, USI Lugano. E-mail: [email protected]

Manuscript received June 14, 2016; revised December 16, 2016, and 5 May,2017.

ings, and the interleavings depend on execution conditionsthat are not under the direct control of the program. Thetesting techniques that address the problem of efficientlyexploring the space of the interleavings consider one ormore of the following activities: (i) generating test cases,which are sequences of operations that stimulate the system;(ii) selecting a subset of interleavings for the executionflows; (iii) executing the system with the selected test casesand interleavings and validating the results.

Although the problem of testing concurrent systems hasattracted the attention of the research community sincethe late seventies, and has grown considerably in the lastdecade, to the best of our knowledge a precise survey andclassification of the progresses and the results in the field isstill missing.

In this paper, we provide a comprehensive survey of thestate-of-the-art in testing concurrent software systems. Westudied the recent literature by systematically browsing themain publishers and scientific search engines, and we tracedback the results to the seminal work of the last forty years.We present a general framework that captures the differentaspects of the problem of testing concurrent software sys-tems and that we use to identify a set of classification criteriathat drive the survey of the different approaches. The sur-vey classifies and compares the state-of-the-art techniques,discusses their advantages and limitations, and indicatesopen problems and possible research directions in the areaof testing concurrent software systems.

The remainder of the paper is organized as follows.Section 2 introduces background information on concur-rent systems, defines the terminology used in the paperand presents a general framework that captures the keyaspects of the testing techniques for concurrent systems.Section 3 presents the historical trends in the research on


testing concurrent systems and describes the methodologywe adopt to select the relevant work. Section 4 introduces aset of criteria to classify testing techniques for concurrentsystems. Section 5 surveys and classifies the state-of-the-art techniques. Section 6 discusses the key observations thatemerge from the analysis of the literature. Section 7 brieflyoverviews some important areas that lie on the bordersof the scope of this survey, and Section 8 summarizes thecontribution of the paper.

2 CONCURRENT SOFTWARE SYSTEMSIn this section, we define the scope of our analysis andintroduce the terminology that we adopt in this paper. Todo so, we define a conceptual framework that captures themain elements of the different approaches to test concurrentsoftware systems. In the remainder of the paper, we use theframework to structure our survey.

2.1 Concurrent Systems

In this survey, we follow the definition of concurrent sys-tem proposed by Andrews and Schneider in their popularsurvey and book [5], [6] that represent classic references andaccommodate the wide range of heterogeneous techniquesand tools presented in the literature.

A system is concurrent if it includes a number of ex-ecution flows1 that can progress simultaneously, and thatinteract with each other. This definition encompasses bothflows that execute in overlapping time frames, like con-current programs executed on multi-core, multi-processorparallel and multi-node distributed architectures, and flowsthat execute only in non-overlapping frames, like concurrentprograms executed on single-core architectures. Dependingon the specific architecture and programming paradigm,execution flows can be concretely implemented as processeson different physical machines, processes within the samemachine or threads within the same process, as common inmodern programming languages such as C++, Java, C# andErlang.

We distinguish two classes of concurrent systems basedon the mechanism they adopt to enable the interactionbetween execution flows, shared memory and message passingsystems. In shared memory systems, execution flows inter-act by accessing a common memory. In message passingsystems, execution flows interact by exchanging messages.Message passing can be used either by execution flowshosted on the same physical node or on different physicalnodes (distributed systems). Conversely, shared memorymechanisms are only possible when the execution flows arelocated on the same node (as in multi-threaded systems).

We model a shared memory as a repository of one ormore data items. A data item has an associated value andtype. The type of a data item determines the set of valuesit is allowed to assume. We model the interaction of anexecution flow f with the repository using two primitiveoperations: write operations wx(v), meaning that f updatesthe value of the data item x to v, and read operations

1. Although in the paper we adopt the terminology of Andrews andSchneider, we prefer the term execution flow over process to avoid biasestowards a specific technology.

rx(v), meaning that f reads the value v of x. Operationsare composed of one or more instructions. Instructions areatomic, meaning that their execution cannot be interleavedwith other instructions, while operations are in general notatomic. This model captures both operations on simple data,like primitive variables in C, and operations on complexdata structures like Java objects, where types are classes,data items are objects and operations are methods that canoperate only on some of the fields of the objects.

We model message passing systems using two primitiveoperations: send operations sf (m) that send a message mto the execution flow f , and receive operations rf (m) thatreceive a message m from the execution flow f . Messagepassing can be either synchronous or asynchronous. An execu-tion flow f that sends a synchronous message sf ′(m) to anexecution flow f ′ must wait for f ′ to receive the message mbefore continuing, while an execution flow f that sends anasynchronous message sf ′(m) to an execution flow f ′ canprogress immediately without waiting for m to be receivedby f ′.

The message passing paradigm can be mapped to theshared memory paradigm by modeling a send primitive as awrite operation on a shared queue and a receive primitive asa read operation on the same shared queue. Thus, withoutloss of generality, we refer to shared memory systems inmost of the definitions and examples presented in thissurvey.

2.2 Interleaving of Execution Flows

The behavior of a concurrent system depends not onlyon the input parameters and the sequences of instructionsof the individual flows, but also on the interleaving ofinstructions from the different execution flows that comprisethe system.

Following the vast majority of approaches that we dis-cuss in Section 5, we introduce the main concepts of con-currency under the assumption of a sequentially consistentmodel [7]. This model guarantees that all the executionflows in a concurrent system observe the same order ofinstructions, and that this order preserves the order ofinstructions defined in the individual execution flows. Wediscuss the implications of relaxing this assumption at theend of this subsection, and in the survey we considerapproaches regardless of this assumption.

Under the assumption of sequential consistency, we canmodel the interleaving of instructions of multiple executionflows in a concrete program execution with a history, whichis an ordered sequence of instructions of the different exe-cution flows.

In a shared memory system, histories include sequencesof invocations of read and write operations on data items.Since in general the operations on shared data items are notatomic, we model the invocation and the termination of anoperation op as two distinct instructions. The execution ofan operation o′ overlaps the execution of another operationo if the invocation of o′ occurs between the invocation andthe termination of o. In a message passing system, historiesinclude sequence of atomic send and receive operations.

Given a concrete execution ex of a concurrent system S,a history Hex is a sequence of instructions that (i) contains


the union of all and only the instructions of the individualexecution flows that comprise ex, and (ii) preserves theorder of the individual execution flows: for all instructionsoi and oj that occur in ex and belong to the same executionflow f , if oi occurs before oj in f , then oi occurs before oj inHex.

We use the term interleaving of an execution ex to indi-cate the order of instructions defined in the history Hex.

Listing 1: A non-deterministic concurrent system

begin f_1if (x==0) {

print OK}end f_1

begin f_2x=1end f_2

Listing 1 exemplifies the impact of the interleaving ofinstructions from multiple execution flows on the resultof an execution. In Listing 1 both execution flows f1 andf2 access a common data item x with initial value 0, f2writes 1 to x, while f1 prints OK if it reads 1 for x. Multipleinterleavings are possible. If the write operation x = 1 of f2occurs before the read operation x == 0 of f1 in the history,then f1 reads 1 and prints OK, otherwise f1 reads 0 and doesnot print anything2.

Systems that do not assume sequential consistency re-fer to relaxed memory models. Examples of systems thatrefer to relaxed models are shared memory systems wheredifferent execution flows may observe different orders ofoperations due to a lazy synchronization between the cachesof multiple cores in a multi-core architecture. Several pop-ular programming languages refer to relaxed models byallowing the compiler to re-order the operations within asingle execution flow to improve the performance. This isallowed in the memory models of both Java [8], [9] andC++ [10].

The histories of relaxed systems may violate the prop-erties that characterize the history of sequential consistentsystems, leading to new types of possible concurrency faultsthat cannot occur under the sequential consistency hypothe-sis. In this survey we review both the many approaches thatwork under the sequential consistency hypothesis and thefew approaches that extend to relaxed models.

2.3 Synchronization Mechanisms

Concurrent programming languages offer various synchro-nization mechanisms to constrain the order of instructionsand thus prevent erroneous behaviors. The synchroniza-tion mechanisms depend on the concurrency paradigm, thegranularity of the synchronization structures and the con-straints imposed on the system architecture. For instance,Java offers synchronized blocks and atomic instructions toassure that program blocks are executed without the inter-leaving of instructions of other execution flows [11]. Otherprogramming languages like C and C++ offer locks, mutexes

2. In this example, we assume that read and write of x are atomic.Relaxing the atomicity assumption would produce even more inter-leavings.

and semaphores to constrain the concurrent execution of coderegions. Yet other concurrent programming environments,like Posix threads and OpenMP, offer barrier synchronizationto constrain the access to code regions executed concurrentlyby multiple execution flows: barriers and phasers introduceprogram points that all the execution flows in a group mustreach before any of them is allowed to proceed [12]. In thecontext of message passing, programming languages andlibraries that implement the actor-based paradigm ensurethat individual messages are processed atomically and inisolation [13]. Synchronous message passing ensures thatthe sender of a message can progress only after the messagehas been successfully delivered to the recipient [14].

2.4 Testing Concurrent SystemsIn this paper we focus on software testing techniques thattarget concurrency faults, which are faults caused by unex-pected interleavings of instructions of otherwise correct ex-ecution flows. Concurrency faults can be extremely hard toreveal and reproduce, since they manifest only in the pres-ence of specific interleavings that may be rarely executed. Toexpose concurrency faults, testing techniques for concurrentsystems need to sample not only a potentially infinite inputspace, but also the space of possible interleavings, whichcan grow exponentially with the number of execution flowsand the number of instructions that comprise the flows.

The many approaches for testing concurrent systemsthat have been proposed so far address different aspects ofthe problem. Our detailed analysis of the literature led toa simple conceptual framework that captures the differentaspects of the problem and relates the many approaches fortesting concurrent systems. Figure 1 presents the conceptualframework that we use to provide a comprehensive view ofthe problem and to organize this survey.

Approaches for testing concurrent systems deal with spe-cific types of target systems and address one or more of thethree main aspects of the problem visualized with rectan-gles in Figure 1: generating test cases, selecting interleavingsand comparing the results with oracles. Generating test casesamounts to sample the program input space and producea finite set of test cases to exercise the target system. Se-lecting interleavings amounts to augment the test cases withdifferent interleavings of the execution flows to exercise theoperations that process the same input data in differentorder. Comparing the results with oracles amounts to checkthe behavior of the target system with respect to someoracles. The approaches that we found in the literature focuson either generating test cases or selecting interleavings,sometimes dealing with comparing with oracles as well.

Figure 1 presents a conceptual framework for the testingtechniques, but does not prescribe a specific process. Someapproaches may first generate a set of test cases and a set ofrelevant interleavings and then compare the execution re-sults with oracles, while other approaches may alternate theselection of interleavings and the comparison with oracle byexecuting each interleaving as soon as identified.

The approaches for generating test cases sample the inputspace to produce a finite set of test cases by considering thetarget system. They optionally also consider a target prop-erty of interleaving, a system model that provides additionalinformation about the target system, or both.


Testing Technique

Input

Selecting Interleavings

Comparing withOracles

OracleProperty of Interleavings

System Model

Test CasesGeneratingTest Cases

Output

Target System

Fig. 1: A general framework for testing concurrent software systems

The approaches for selecting interleavings identify a sub-set of relevant interleavings to be executed, and target eitherthe interleaving space as a whole or some specific propertiesof interleaving. The techniques that target the interleavingspace as a whole, hereafter space exploration techniques,explore the space of interleavings randomly, exhaustively ordriven by some coverage criteria or heuristics. Two relevantclasses of space exploration techniques are stress testing andbounded search techniques.

As we discuss in detail in Section 4, properties of in-terleaving typically identify patterns of interactions acrossexecution flows that are likely to expose concurrency faults.Approaches that target some interleaving properties, here-after property based techniques, aim to identify the interleav-ings that are most likely to expose such patterns.

Some property based techniques use the property ofinterleavings not only to select a relevant subset of in-terleavings, but also to generate a set of test cases thatcan execute the identified interleavings. Such techniquessteer the generation towards test cases that can manifestinterleavings that expose the property of interest.

Most property based techniques rely on some dynamicor hybrid analysis to build an abstract model of the sys-tem and capture the order relations between the programinstructions, and then exploit the model to identify a set ofinterleavings that expose the property of interest.

Some property based techniques, hereafter detection tech-niques, use the model to simply detect the presence of thepattern of interest in the analyzed trace. Other propertybased techniques, hereafter prediction techniques, also lookfor alternative interleavings that may expose the propertyof interest, usually relying on model checking or SAT/SMTsolvers.

Approaches that address also the problem of comparingthe results with oracles execute the system in a controlledenvironment that forces the selected interleavings and com-pare the results of the execution with the given oracle.

3 TRENDS IN RESEARCH ON TESTING CONCUR-RENT SYSTEMSIn this section we present an analysis of the research ontesting concurrent systems conducted in the last fifteenyears referring to the seminal work of the last forty years.

Concurrency has been investigated since the early sixtieswith pioneer work on models for concurrent systems, likethe research of Karl Adam Petri [15] and the inspiringwork on process algebras of Tony Hoare [16] and RobinMilner [17].

In the seventies, with the emergence of distributed ar-chitectures, the focus of the research extended towards theanalysis and verification of distributed systems with theLipton’s influential work on the theory of reduction [18] andLamport’s seminal work on distributed systems [7].

The nineties have seen the introduction of the term test-ing concurrent systems with continuity in the literature [19],[20], [21], and the appearing of analysis techniques thatare at the core of many popular approaches for testingconcurrent systems [22], [23], [24], [25], [26].

The research on testing concurrent systems has emergedoverbearing in the last fifteen years fostered by the rapidspread of multi-core technologies, distributed, Web andmobile architectures and novel concurrent paradigms. Oursurvey indicates that most of the concurrent software testingtechniques developed in the last fifteen years target sharedmemory systems, and only few cope with (distributed) mes-sage passing systems, which are addressed mainly by run-time monitoring and model based verification approaches.

To provide a comprehensive survey of the emergingtrends in testing concurrent software systems, we systemati-cally review the literature from 2000 to 2015: (i) we searchedthe online repositories of the main scientific publishers, in-cluding IEEE Explore, ACM Digital Library, Springer OnlineLibrary and Elsevier Online Library, and more generallythe Web through the popular online search engines suchas Google Scholar and Microsoft Academic Search; we col-lected papers that are published from year 2000 and thatpresent one of the following set of keywords in their title


0

2

4

6

8

10

12

14

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

DataraceAtomicityDeadlockCombinedOrderExploration

Fig. 2: Number of publications from 2000 to 2015 thatwitness novel research contributions and address differentconcurrency properties

0

2

4

6

8

10

12

14

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

SoftwareengineeringProgramminglanguagesSystemsFormalmethods

Fig. 3: Number of publications from 2000 to 2015 thatwitness novel research contributions in different researchcommunities

or abstract: “testing + concurrent”, “testing + multi-thread”,“testing + parallel”, “testing + distributed”, (ii) consideredall publications that are cited or citing the papers in ourrepository, and that match the same criteria, (iii) manuallyanalyzed the proceedings of the conferences and the jour-nals where the papers in our repository appear, (iv) filteredout the papers outside the scope of our analysis, as de-fined in Section 2, for example papers on hardware testingand papers on theoretical aspects or purely monitoring ap-proaches with no direct application to software testing (seeSection 7 for an overview of such papers), and (v) filteredout workshop papers and preliminary work that has beenlater subsumed by conference or journal publications.

We selected 94 papers that were published in the last15 years and that we identified as unique representativesof clusters of related publications. The papers focus onthe main concurrency properties, and reflect the interestof different communities in the area of testing concurrentsystems. We discuss the possible biases of our choices inSection 6.8.

Figures 2 and 3 show the distribution of the papers overthe years according to the concurrency properties they deal

with and the research communities they refer to3, respec-tively. The figures indicate a relevant increase of interest andresults: more than 80% of the papers have been publishedin the second half of the considered period (2008-2015), andmore than almost 65% have appeared since 2010. Figure 2indicates that the recent research has focused on specificinterleaving properties, which we use in Section 4 as themain classification criterion. Figure 3 suggests a growinginterest in the software engineering and programming lan-guages communities, and a stable interest in the systemsand formal methods communities, whose main focus hasbeen on techniques for monitoring concurrent systems andtheoretical investigations, respectively. We briefly discussboth topics in the related work Section 7 since they arerelevant but not central to the topic of this paper.

4 TOWARDS A CLASSIFICATION SCHEMAOur analysis of the literature led to the definition of theframework of Figure 1 in Section 2, which inspired a classifi-cation schema presented in this section and used in Section 5to organize our survey. Figure 4 overviews our classificationschema, which includes seven distinct classification criteriathat distinguish techniques based on:

Input. Most approaches assume the availability of a setof input test cases, few approaches work on some models ofthe system under test, yet others require both test cases andmodels.

Selection of interleavings. Many approaches refer to someproperties of the interleavings to select a relevant subset(property based), other approaches either exhaustively ex-plore the interleaving space or exploit some coverage crite-ria or heuristics (space exploration). Property based detectiontechniques simply check for the presence of patterns ofinterest in the execution traces, while prediction techniquesalso select new interleavings that potentially expose theproperty of interest.

Property of interleavings. Many approaches select inter-leavings referring to some specific concurrency properties:data race, atomicity violation, deadlock, combined or orderviolation properties.

Output and oracle. Some techniques simply reportwhether the explored interleavings satisfy the property ofinterest (property satisfying interleavings), while others iden-tify failing executions according to some oracles, in the formof system crashes, deadlocks or violated assertions.

Guarantees. Different techniques offer various levels ofassurance in terms of precision and correctness of theirresults.

Target systems. Most techniques target some specifickinds of systems that depend on the communication model(shared memory, message passing or general, that is, indepen-dent from the communication paradigm), the programmingparadigm (either general or specific, such as object oriented,actor based or event based paradigms) and the consistencymodel (either sequential or relaxed).

3. We identify the communities through the publication venues.


Input

Test case

System model

Output / Oracle

Guarantees

Soundness

Completeness

Feasibility

Property of Interleavings

Data race

Order violation

Space exploration

Deadlock

Atomicity violation

High levelLow level

Data centricCode centric

HeuristicsCoverage criteriaExhaustive explor.Stress testing

ResourceCommunication

Target System

Programming paradigm

Consistency model

Communication model

Testing Technique

Analysis

Scope of testing

Testing architecture

Granularityof testing

Type of testing

HybridDynamic

Non-functionalFunctional

Distributed indep.Distributed sync.Centralized

SystemIntegrationUnit

Selection of Interleavings

Property based

DetectionPrediction

Shared memory

GeneralMessage passing

SpecificGeneral

SequentialRelaxed

Property satisfying interleavings

Failing execution

Violated assertionDeadlockSystem crash

PropertySoundnessPropertyCompleteness

Combined

Fig. 4: A classification schema for the approaches to test concurrent systems

Testing technique. The techniques differ in terms of thetype of analysis, which can be dynamic or hybrid, the type oftesting, which can focus on either functional or non-functionalproperties, the granularity of testing (unit, integration or sys-tem testing), the scope of testing, and the testing architectureused to implement the technique, which can be either cen-tralized or distributed.

In the remainder of this section, we discuss the classifi-cation criteria in detail.

4.1 Input

All the techniques take in input the system under test, whichwe keep implicit in our classification. Many techniquesrequire also a set of test cases, while others generate testcases automatically, thus implementing the generating testcases feature of Figure 1.

Some techniques require also a system model that specifieseither some relevant properties or the expected behaviorof the system. For instance, some techniques rely on codeannotations to identify either code blocks that are intendedto be atomic or sets of data items that are assumed to beupdated atomically [27]. For some technique, the presence of

a system model is optional: they work independently froman initial system model, but can benefit from an optionalmodel to improve the accuracy of the approach.

4.2 Selection of Interleavings

Selecting interleavings is the primary objective of many ap-proaches. Some techniques exploit properties of interest toselect interleavings: we refer to them as property based. Othertechniques explore the space of interleavings exhaustivelyor randomly, possibly exploiting heuristics or coverage cri-teria: we refer to them as space exploration techniques.

4.2.1 Property based techniques

Property based techniques select interleavings according toone or more properties of interest. They typically applysome form of analysis to an execution trace to identifyrelevant synchronization constraints between instructions.For instance, lockset analysis focuses on lock based synchro-nization and looks for accesses to shared data items that arenot protected with locks [23]. Happens-before analysis ex-tends this approach by capturing general order constraints.Different types of happens-before analysis apply to different


synchronization mechanisms [28], and present various cost-accuracy trade offs [29], [30].

Property based techniques exploit the order informationidentified with the analysis to either simply understandwhether the analyzed trace exposes the property of interest(detection techniques) or also identify alternative interleav-ings that can expose the property of interest (predictiontechniques).

4.2.2 Space exploration techniquesSpace exploration techniques explore the space of interleav-ings without referring to a specific property, and include:

Stress testing. Stress testing approaches execute the testsuite several times, aiming to observe different interleav-ings. They do not offer any guarantee of observing a givenportion of the interleaving space, and do not introduce anymechanism to improve the probability of executing newinterleavings.

Exhaustive exploration. Exhaustive exploration ap-proaches aim to execute all possible interleavings. Sincein general the space of interleavings can be huge, thesetechniques either limit both the number of instructions andthe execution flows of the input test cases, or introducebounds to the exploration space [31]. They also often adoptreduction techniques such as dynamic partial order reduc-tion [32] to avoid executing equivalent interleavings.

Coverage criteria. Coverage criteria identify the interleav-ings to exercise in terms of depth of the explored space. Forexample, in shared memory systems, a coverage criterionmight require that for each pair of instructions i1 and i2that belong to different execution flows and operate on thesame data item d there should exist at least a test case thatexercises the interleaving in which i1 occurs before i2 andone that exercises the interleaving in which i2 occurs beforei1.

Heuristics. Heuristics guide the exploration of the in-terleaving space. For instance, some techniques prioritizeinterleavings by their diversity with respect to the executedones. Similarly, some other techniques prioritize interleav-ings that can be obtained by introducing a bounded numberof scheduling constraints.

4.3 Property of InterleavingsProperty based techniques target specific properties of in-terleavings, which are patterns of interactions between ex-ecution flows that are likely to violate the developers’ as-sumptions on the order of execution of instructions. Someproperty based techniques target the classical properties ofinterleavings: data races, atomicity violations and deadlocks.Other techniques combine multiple classical properties. Yetother techniques focus on program-specific or domain-specific order violations.

4.3.1 Data RacesA data race occurs when two operations from differentexecution flows access the same data item d concurrently,at least one is a write operation, and no synchronizationmechanism is used to control the (order of) accesses to d. A

system is data race free if no data races can occur during itsexecution.

Listing 2: An example of data race

begin f_1x=’AAAAAAAA’end f_1

begin f_2x=’BBBBBBBB’end f_2

Listing 2 shows an example of data race. Both executionflows f_1 and f_2 write on the data item x of type string.Let us assume that the underlying programming languagecan write atomically 32 bits (4 characters). If f_1 and f_2start writing on x concurrently, the resulting value of x willbe any combination of ’AAAA’ and ’BBBB’, for examplex=’AAAABBBB’, where the first 4 characters come from thewrite instruction in f_1 and the remaining ones from thewrite instruction in f_2.

Data races might represent violations of atomicity as-sumptions on the execution of individual operations. Suchviolations break the serializability of the system behavior. Theconcept of serializability was originally defined in the con-text of database systems as a guarantee for the correctnessof transactions [33]. In our context a history is serializable ifit is equivalent to a serial history, which is a history in whichall the atomicity assumptions are satisfied. Two histories areequivalent if they produce the same values for all the dataitems.4 A data race implies an uncontrolled access to a dataitem, which may or may not be an error. For instance, inthe example of Listing 2 the value x=’AAAABBBB’ may bevalid or not depending on the application logic.

Data race techniques target either low level or high leveldata races, and propose different kinds of analysis and algo-rithms. Techniques for detecting low level data races workat a fine granularity level, and typically consider accessesto individual memory locations. Techniques for detectinghigh level data races check misbehaviors due to differentexecution flows invoking concurrent operations on sharedcomplex data structures, like public methods of an object ora library in an object oriented program.

Data races can occur also in programs that implementthe message passing paradigm, when the code fragmentsthat process two messages access a common data item, andat least one fragment modifies that data item.

4.3.2 Atomicity violationsAtomicity violations extend the concept of atomicity to se-quences of operations. An atomicity violation occurs whena sequence of operations of an execution flow that is as-sumed to be executed atomically is interleaved with conflict-ing operations from other flows. Many atomicity violationtechniques use serializability to validate the correctness ofinterleavings. However, checking for the serializability ofan interleaving can be very expensive, since it requires com-paring the interleaving with all the possible serial histories.Thus, many techniques often check for specific patterns ofinterleavings that are known to be non serializable.

4. In literature such property is referred to as view serializabilityopposite to conflict serializability [34]


Listing 3: An example of atomicity violation

begin f_1if (x>0)

x = x-1end f_1

begin f_2x = 0end f_2

Listing 3 shows an example of atomicity violation. In theexample, we assume that x initially holds a non-negativevalue, and we expect its value to always remain non-negative. The property is satisfied under the assumptionthat f_1 executes atomically, that is, the sequence of op-erations in f_1 is executed atomically without interleavedoperations of f_2. However, if the atomicity of f_1 is notproperly enforced through synchronization mechanisms,the conflicting operation x=0 in f_2 can occur between thetwo operations of f_1, causing the value of x to becomenegative (-1).

Atomicity violations can occur also in programs thatimplement the message passing paradigm, when the codefragments that process two or more messages that shallbe executed atomically are interleaved by the processing ofsome other messages.

We distinguish two main classes of approaches to detectatomicity violations, namely code centric and data centricapproaches, based on their definition of atomic blocks.

Code centric. Code centric techniques work at a coarsegranularity level. They target code blocks that should beexecuted atomically according to some specification of thesystem, and verify if the system implementation guaranteesthe atomicity requirements. Atomic code blocks can be ei-ther explicitly specified [27], [35] or implicitly assumed [36],[37], based on some heuristics or some hypotheses of theprogramming paradigm.

Data centric. Data centric atomicity violations were firststudied in 2006 by Vaziri et al. who introduced the conceptof atomic-set serializability as a programming abstractionto ensure data consistency [38]. Atomic-set serializabilitybuilds on the concepts of atomic sets that represent sets ofdata items that are correlated by some consistency con-straints, and units of work that are the blocks of code used toupdate variables in an atomic set. In the atomic-set program-ming model, developers only need to specify atomic setsand units of work, and the compiler infers synchronizationmechanisms to avoid potentially dangerous interleavings.

Some testing techniques use patterns that violate atomic-set serializability to select dangerous interleavings. Sinceatomic-set serializability approaches require the definitionof atomic sets, these techniques either rely on some codeannotations or infer atomic sets using either heuristics orassumptions about the programming paradigm in use.

4.3.3 DeadlocksDeadlocks occur when the synchronization mechanismsindefinitely prevent some execution flows from continuingtheir execution. This happens in the presence of circularwaits, where each execution flow of a given set of flowsis waiting for another execution flow from the same set toprogress, and thus cannot continue its own execution.

Listing 4: An example of deadlock

begin f_1acquire(L1)acquire(L2)...release(L2)

release(L1)end f_1

begin f_2acquire(L2)acquire(L1)...release(L1)

release(L2)end f_2

Listing 4 shows a simple example of deadlock due to anincorrect use of locks. Locks provide two atomic primitives,acquire(L) and release(L). When an execution flow facquires a lock L (acquire(L)), no other execution flowf ′ can acquire L until f releases the lock (release(L)).Locks are used to implement mutual exclusion: for instance,to guarantee the atomicity of a set O of operations, eachexecution flow shall acquire a lock L before accessing anoperation in O, and release the lock L upon terminating theaccess to the resource to prevent other flows to execute anoperation in O concurrently.

In the example of Listing 4, the execution flow f_1acquires first lock L1 and then lock L2 before releasingL1, while the execution flow f_2 acquires lock L2 beforereleasing L1. If f_1 acquires L1 and f_2 acquires L2 beforea progress of f_1, then f_1 is blocked waiting for f_2 torelease L2 and f_2 is blocked waiting for f_1 to release L1,thus resulting in a deadlock.

Deadlocks may depend either on an incorrect order ofaccesses to shared resources (resource deadlocks) or on anincorrect communication protocol between execution flows(communication deadlocks) [39].

Resource deadlocks. Resource deadlocks occur when a setof execution flows try to access some common resourcesand each execution flow in the set requests a resource heldby another execution flow in the set. The code in Listing 4 isan example of resource deadlock.

Communication deadlocks. Communication deadlocks oc-cur in message passing systems when a set of executionflows exchange messages and each of them waits for amessage from another execution flow in the same set.

4.3.4 CombinedSome testing techniques address combined properties of in-terleavings, that is more than one of the properties definedabove.

4.3.5 Order violationData races, atomicity violations and deadlocks are classicand well studied properties that in general result in specificviolations of the order of interleavings. Some techniquestarget program- or domain-specific violations of the orderof interleavings that cannot be traced back to classic in-terleaving properties. We refer to these techniques as orderviolation approaches. For instance, some techniques targetthe domain of concurrent object oriented programs, andfocus on interleavings that lead to null pointer dereferencingthat occur when an execution flow erroneously tries to


dereference an object reference after it has been set to nullby another execution flow [40], even if suitable protectedwith locks that avoid data races.

4.4 Output and OracleTesting techniques produce two types of outputs, somesimply produce property satisfying interleavings that are in-terleavings that expose a property of interest, while otherscompare the results produced by executing an interleavingwith an oracle, and return the failing executions, thus imple-menting the comparing with oracle feature of Figure 1.

In general, not all the interleavings that exhibit a prop-erty of interest lead to a failure. Thus, techniques thatoutput property satisfying interleavings may result in falsepositives. For instance, some techniques that detect dataraces may signal many benign data races.

Oracles define criteria to discriminate between accept-able and failing executions, and can be system crash, deadlockor violated assertion oracles. System crash and deadlock ora-cles, also known as implicit oracles, identify executions thatlead to system crashes and deadlocks, respectively. Violatedassertion oracles exploit assertions about the correct behav-ior of the system, based either on explicit specifications orimplicit assumptions about the programming paradigm.

4.5 GuaranteesDifferent techniques for selecting interleavings guaranteevarious levels of validity of the results.

A technique guarantees the feasibility of the results ifit produces only interleavings that can be observed insome concrete executions. Not all techniques guaranteethe feasibility of interleavings, for instance, some predic-tion techniques that analyze traces and produce alternativeinterleavings of the observed operations may miss someprogram constraints, and thus return interleavings that donot correspond to any feasible execution.

A technique guarantees soundness of the results if itproduces only interleavings that lead to an oracle violation.A technique guarantees the completeness of the results if itproduces all the interleavings that can be observed in someconcrete executions and that lead to an oracle violation. Atechnique guarantees property soundness if it identifies onlyfeasible interleavings that exhibit the property of interest forthe considered test cases. A technique guarantees propertycompleteness if it identifies all feasible interleavings thatexhibit the property of interest for the considered test cases.

The concepts of property soundness and property com-pleteness only apply to property based techniques. Severalauthors of property based techniques present their approachas sound and/or complete with respect to the property ofinterleavings they consider. However, their claims often relyon the assumption that only some specific synchronizationmechanisms are used. In the general case, the type of or-der relations and analysis adopted in most property basedtechniques, such as happens-before analysis [28] or causally-precedes relations [29], can introduce approximations thathamper both soundness and completeness [30].

Soundness and completeness describe the accuracy of atechnique in detecting concurrency faults. Property sound-ness and property completeness describe the accuracy of

a technique in detecting interleavings that expose a givenproperty.

4.6 Target SystemWe identify three elements that characterize the type oftarget concurrent systems, the communication model, the pro-gramming paradigm and the consistency model.

The communication model specifies how the executionflows interact with each other, and includes shared memoryand message passing models. General techniques target bothtypes of systems.

Many techniques target a specific programming paradigm,sometimes identified by the target synchronization mecha-nisms. For instance, some techniques exploit specific prop-erties of the object oriented paradigm, such as encapsulationof state or subtype substitutability. Similarly, other tech-niques build on the assumptions provided in actor basedsystems. Some techniques only consider faults that arisefrom the use of specific synchronization mechanisms, suchas deadlocks that derive from the incorrect use of lockbased synchronization. Yet other techniques do not makeany assumption on the programming paradigm adoptedand work with general systems.

Finally, some testing techniques assume a sequential con-sistency model, while other techniques can be applied torelaxed consistency models.

4.7 Testing TechniqueWe characterize the many testing techniques proposed sofar along five axes, the type of analysis they implement, thetype, granularity and scope of the testing technique, and thetype of testing architecture they adopt.

4.7.1 AnalysisTesting techniques implement either dynamic or hybrid anal-ysis. Dynamic analysis techniques use only information de-rived from executions of the system under test. Hybrid anal-ysis techniques use both static and dynamic information.Strictly speaking, techniques that rely on static informationonly are not testing techniques, and thus fall outside thescope of this survey. We discuss the most relevant staticanalysis techniques in the related work section (Section 7).

4.7.2 Type of testingTesting techniques may target either functional or non-functional properties. Functional testing techniques check thecorrectness of the program results with respect to a givenoracle, while non-functional testing techniques verify non-functional properties such as response time and scalability.

4.7.3 Granularity of testingDifferent testing techniques work at various granularity lev-els that span from unit to integration and system. Unit testingtechniques target individual units in isolation. For instance,in object oriented programs, unit testing considers classesin isolation, without taking into account their interactionswith other components of the system. Integration testingtechniques focus on the interactions between units, andcheck that the communication interface between the units


under test works as expected. System testing techniquestarget the system as a whole, and verify that it meets itsrequirements. In Section 5 we indicate both the granularitylevel and the type of system components considered by thetesting technique.

4.7.4 Scope of testingSome testing techniques have a limited scope in the devel-opment process. For example, many techniques are used forsystem validation, while other techniques target regressiontesting.

4.7.5 Testing ArchitectureTesting techniques refer to different testing architectures thatcharacterize the concrete infrastructure used to exercise thesystem under test. Such infrastructures are composed ofone or more test drivers, which produce input data to oneor more execution flows of the system under test, and thatobserve the outputs produced by the system under test. Forinstance, to test a client-server distributed system, a drivercan initialize a client, submit some requests to the server,wait for replies from the server, and evaluate the receivedreplies with respect to an oracle.

Testing architectures can be either centralized, when asingle driver interacts with the system under test, or dis-tributed, when more than one driver interacts with (differ-ent) execution flows of the system under test. The abovescenario of a client-server distributed system exemplifiesa centralized testing architecture. A framework for testinga peer-to-peer system in a distributed environment with atest driver for each peer is a simple example of a distributedtesting architecture.

We distinguish between synchronized and independent dis-tributed testing architecture. In synchronized testing architec-tures, the drivers coordinate each other by exchanging mes-sages, while in independent testing architectures the driversexecute independently and do not exchange messages tocoordinate their actions. In distributed testing architectures,the driver synchronization is used to overcome controllabilityand observability problems. These problems occur if a drivercannot determine when to produce a particular input, orwhether a particular output is generated in response to aspecific input or not, respectively [41].

5 DETAILED SURVEYWe classify the main approaches for testing concurrentsystems according to the criteria discussed in Section 4.Figure 5 shows the organization of this section. We use selec-tion of interleavings as the main classification criterion, anddistinguish between property based and space exploration tech-niques. We classify property based approaches according tothe target property of interleaving as data race (Section 5.1),atomicity violation (Section 5.2), deadlock (Section 5.3), com-bined (Section 5.4) and order violation (Section 5.5), anddiscuss space exploration techniques in Section 5.6.

This organization groups together approaches that im-plement related classes of methodologies and algorithms todetect faults. This is the case of property based techniquesthat exploit the same property of interleavings, which typi-cally target the same type of concurrency faults, as well as

space exploration approaches, which typically target generictypes of faults.

We summarize the classification of the techniques in sixtables according to the criteria. Table 1, Table 2, Table 3,Table 4 and Table 5 overview property based techniques.Table 6 overviews space exploration techniques. The sixtables share the same structure. The rows indicate the tech-niques with their name, when available, or with the nameof the authors of the paper that proposed the technique.Rows are grouped by subcategory when applicable, andreport the approaches sorted by main contribution withinthe same subcategory. The contribution of most approachesis multi-faced, we list the approaches according to what weidentify as their core novelty, mentioning other elementswhen particularly relevant. The columns correspond to thecriteria identified in Figure 4.

In the six tables we do not report explicitly type of testing,scope of testing and testing architecture, since all the techniqueswe analyze (i) implement functional testing, with the onlyexception of SpeedGun, which focuses on performance,(ii) are in the scope of validation testing, with the exceptionof ReConTest, SimRT and SpeedGun, which are explic-itly designed for regression testing, and (iii) implement acentralized architecture. In Table 6, we also omit columnsprediction, property completeness and property soundness, sincethey apply only to property based techniques.

5.1 Property Based: Data RaceWe classify the large number of techniques designed todetect data races in shared memory programs according toboth their granularity and the type of analysis they perform.We consider techniques that target low level and high leveldata races, and we further classify low level techniques aslockset, happens-before and hybrid analysis techniques.

Low level data race detection techniques. Low level data racedetection techniques target data races that occur at the levelof individual memory locations. They rely on some formof analysis to track the order relations between memoryinstructions in a given execution trace, and either detect theoccurrence of a data race or predict if a data race is possiblein alternative interleavings.

These techniques rely either on lockset analysis, whichsimply identifies concurrent memory accesses not protectedby locks, or on happens-before analysis, which detects someorder relations among concurrent memory accesses. Testingtechniques that rely on happens-before analysis often claimto be property complete, meaning that they can detect all thedata races that can be generated with alternative schedulesof an input execution trace. However, happens-before analy-sis is in general conservative: for instance, when it observesthe release of a lock followed by an acquisition of the samelock in an execution trace, it interprets the two operations astotally ordered, while they could appear in a different orderin other interleavings of the same trace. Because of this,traditional happens-before analysis may miss some possibleinterleavings, and may thus miss some faults [29].

The property completeness problem has been addressedby either implementing variants of the happens-before re-lation that capture the order of events more accurately, andthus reduce the possibility of missing some faults [29], [30],


Testing techniques

Property based

Exploration (S 5.6)

Order violation (S 5.5)

Combined (S 5.4)

Deadlock (S 5.3)

Atomicity violation (S 5.2)

Data race (S 5.1)

Data centric

Code centric

High level

Low level

Hybrid

Happens-before

Lock-set

Heuristics

Coverage criteria

Exhaustive exploration

Stress testing

Fig. 5: Organization of the detailed survey

or by exploiting model checking to explore re-orderingsof instructions that are not allowed according to the over-restrictive happens-before analysis, but still possible in prac-tice. Some hybrid solutions combine the advantages of theless accurate but inexpensive lockset analysis with the moreaccurate but expensive happens-before analysis [42].

High level data race detection detection techniques. Highlevel data race detection techniques target complex datastructures, such as objects in object oriented programs, andlook for interleavings that lead to results not compatiblewith any serial execution. The naı̈ve approach to detecthigh level data races consists in comparing the results of aninterleaving with all possible serial executions. The intuitivescalability issues of the exhaustive analysis of all possibleserial executions is tackled by many testing techniques thatexploit some form of specification provided by the devel-oper, which indicates relevant order characteristics amongoperations, such as commutativity.

5.1.1 Low level — LocksetLockset analysis has been proposed for data race detectionby Savage et al. in the late nineties based on the theoryof reduction that Lipton introduced in the seventies [18].Savage et al.’s Eraser approach [23] addresses both the con-servative limitations of static data race analysis and the per-formance problems of happens-before analysis, the two pop-ular classes of approaches for detecting data races that wereinvestigated at that time. Indeed in the nineties, happens-before analysis was considered too expensive, since it re-quires information for each execution flow about the con-current accesses to each shared data item. Lockset analysisreveals possible data races by both dynamically computingthe set of common locks held when accessing shared dataitems, and identifying execution flows that access the sameshared data items without sharing any lock. By taking into

account only lock based synchronization, lockset analysisimproves the efficiency with respect to happens-before anal-ysis since it only needs to store information about the set oflocks held by the execution flows, but looses accuracy, sinceit ignores additional order relations determined by othersynchronization mechanisms. Thus, the techniques based onlockset analysis are not property sound.

In the last fifteen years, research on data race detectionfocused mostly on reducing the amount of false positives,moving the attention back to happens-before analysis andforward to hybrid approaches. The few recent techniquesbased exclusively on lockset analysis aim to either improveprecision and efficiency or to extend to new programmingparadigms. Shacham et al. [43] and Racez [44] improvethe precision and accuracy, respectively, while Praun andGross [45] and ACCORD [46] extend lockset analysis toobject oriented and array based concurrent programs, re-spectively.

Improving precision

Shacham et al., PPoPP 2005 [43]. Shacham et al. combinedynamic lockset analysis with model checking to detect dataraces in general shared memory programs. They capturelocking constraints with dynamic lockset analysis, generatealternative interleavings for the executed trace with modelchecking, and execute the interleavings identified by themodel checker to reveal faults. Lockset analysis ensuresproperty completeness but not property soundness, whilethe final execution phase guarantees both soundness andcompleteness with respect to the oracle.

Improving efficiency

Racez, ICSE 2011 [44]. Racez reduces the overhead of


TABLE 1: Data race detection techniques

Input Output /OracleSelect.Interl.

TargetSystem

TestingTech. Guarantees

Test

Cas

e

Mod

el

Sat.

Inte

rl.

Cra

sh

Dea

dloc

k

Ass

ert.

Vio

l.

Pred

ict.

Com

mun

.

Para

digm

Con

sist

.

Gra

nula

rity

Ana

lysi

s

Prop

.Com

plet

.

Prop

.Sou

nd.

Com

plet

.

Soun

dnes

s

Feas

ibil

ity

Low level — LocksetShacham et al. X X X X X SM General Seq S D X X X XRacez X X SM General Seq S D - - Xvon Praun and Gross X X SM OO Seq S H - - XACCORD X X X X SM Fork-join Seq S D - -

Low level — Happens-beforeFastTrack X X SM General Seq S D - - XLiteRace X X SM General Seq U D - - XPacer X X SM General Seq S D - - XSOS X X SM General Seq S D - - XCarisma X X SM General Seq S D - - XReEnact X X X SM General Seq S D - - XNarayanasamy et al. X X X SM General Seq S D X - - XFrost X X X X X SM General Seq S D XPortend X X X SM General Seq S D X - - XTian et al. X X SM General Seq S D - - XRDIT X X X SM General Seq S D - -Smaragdakis et al. X X X SM General Seq S D - -DrFinder X X X SM General Seq S D - - XRVPredict X X X SM General Seq S D - -WebRacer X X SM Web platforms Seq S D - - XEventRacer X X SM Event-based Seq S D - - XDroidRacer X X SM Android apps Seq S D - - XJava RaceFinder X X X SM General Rel S D X - - XRelaxer X X X SM General Rel S D - - X

Low level — HybridChoi et al. X X SM General Seq S H - - XWester et al. X X SM General Seq S D - - XRaceMob X X SM General Seq S H X - - XRaceFuzzer X X X SM General Seq S D X XRaceTrack X X SM General Seq S D - - XGoldilocks X X SM General Seq S H - - XMultiRace X X SM General Rel S D - - XSimRT X X X SM General Seq S D X XRacageddon X X SM General Seq S D - - XNarada X X X SM General Seq S D X X

High levelColt X X X X SM OO Seq U D - - XDimitrov et al. X X X SM OO Seq U D - - X

In this and all the tables in the paper, the techniques are sorted according to their publication date within each subcategory, and are named as indicated by theproposers, or by the authors of the seminal paper when a name is not available.

Legend (common to all tables):SM shared memory Seq sequential consistency S system testing H hybrid analysisMP message passing Rel relaxed consistency U unit testing D dynamic analysis

lockset analysis with a sampling approach that capturesonly a subset of the synchronization operations and thememory accesses performed by the target system, thustrading accuracy for complexity. Racez inherits the idea ofsampling from previous work such as LiteRace and Pacer,that we discuss in the remainder of this section. Racezfurther reduces the overhead by both only instrumentingsynchronization operations and delegating the detection ofmemory operations to the hardware performance monitor-ing unit (PMU). The PMU stores hardware information ina buffer that is accessed asynchronously with system-levelcalls, and is usually adopted for performance monitoring.This combination of user-level instrumentation to monitorsynchronization operations and hardware monitoring ofmemory accesses leads to a very low overhead —Shenget al. report an overhead as low as 2.8% in some experi-ments performed on real systems—. The approach is neither

property complete, due to the approximations introducedby the sampling, nor property sound, since it relies on theimprecise lockset analysis.

Extending to new paradigms

von Praun and Gross, OOPSLA 2001 [45]. von Praunand Gross exploit the encapsulation features of object ori-ented programs to reduce the time and space overheadof the instrumentation to optimize lockset analysis. Theapproach takes advantage of the confinement property thatcharacterizes objects and object fields that can be accessedonly by one execution flow, and excludes them during theinstrumentation. The technique pairs static analysis to detectencapsulation properties and dynamic lockset analysis todetect memory accesses. It searches for data races withinthe executed interleavings, without predicting faults, thus,


it is not property complete.

ACCORD, PPoPP 2011 [46]. ACCORD targets data racesin array-based concurrent programs characterized by fork-join parallelism, like programs written in OpenMP, CILKand TBB. ACCORD strengthens lockset analysis by relyingon an input model expressed as source code annotationsthat specify the intended concurrency coordination strategy.The model specifies the set of memory locations that eachexecution flow reads and writes, distinguishing betweenaccesses that should only occur when a lock is acquiredand accesses that might occur without mutual exclusion.ACCORD is one of the few approaches that exploit an-notations from the developers to improve the accuracy ofthe results. ACCORD automatically verifies that the con-currency coordination strategy is data race free using aconstraint solver. It also automatically generates assertionsto check that the implementation conforms to the specifiedconcurrency coordination strategy during testing. ACCORDis neither property complete, since the constraint solvingmight not be able to generate solutions in all cases, norproperty sound, because the specification only targets lockbased synchronization mechanisms.

5.1.2 Low level — Happens-beforeHappens-before analysis was introduced by Leslie Lamportin the late seventies [47] and has been adopted in severalearly techniques to detect data races in concurrent pro-grams [48], [49].

Happens-before analysis is more precise than locksetanalysis, since it can deal with any kind of synchroniza-tion mechanism beyond lock based synchronization, andthus can avoid many false positives that are inevitable inlockset analysis at the price of additional computationalcosts. Happens-before analysis has been widely studied inthe last fifteen years in the context of testing data raceswith focus on (i) improving performance, (ii) reducing falsepositives, (iii) improving completeness, (iv) tailoring theanalysis to specific programming paradigms or languages,and (v) extending the analysis to relaxed (non sequential)memory models.

Different approaches improve the performance ofhappens-before analysis by maintaining additional infor-mation at runtime (FastTrack [50]), reducing the expensivemonitoring activities to a subset of samples (LiteRace [51],Pacer [52], SOS [53] and Carisma [54]), exploiting the paral-lelism of multi-core systems (Wester [55]), and implement-ing happens-before analysis in hardware (ReEnact [56]).

Some approaches reduce the amount of false posi-tives by (i) classifying data races in harmful and benign(Narayanasamy et al. [57], Frost [58] and Portend [59]),(ii) exploiting program specific synchronization mecha-nisms (Tian et al. [60]) or (iii) considering synchronizationevents generated from external libraries (RDIT [61]).

Smaragdakis et al. and DrFinder improve the complete-ness of happens-before analysis by taking into account alter-native interleavings of locked code regions [29], [62], whileRVPredict improves completeness by integrating happens-before analysis with control flow information [30].

Some techniques improve the precision of the analysisby reducing its scope to specific types of applications, and

exploiting the domain semantics to infer order constraintsmore precisely: WebRacer [63] and EventRacer [64] adaptthe analysis to Web and event-based applications, respec-tively; DroidRacer [65] applies the analysis to Androidapplications.

RaceFinder [66] and Relaxer [67] augment the analysiswith model checking to capture order relations in the re-laxed Java memory model.

Improving performance

FastTrack, PLDI 2009 [50]. FastTrack introduces a runtimehappens-before analysis with a constant time and spacecomplexity, thus improving over the linear complexity oftraditional vector clocks approaches. FastTrack builds uponthe observation that in the context of data race detectionthe full generality of vector clocks is not needed to charac-terize a large fraction of read and write operations. Thus, itproposes a lightweight representation of the happens-beforeinformation that records only the information about the lastwrite operation on each data item. In this way, it reduces thecost of vector clock comparison up to an order of magnitudecompared to the original happens-before analysis. FastTrackis neither property complete nor property sound.

LiteRace, PLDI 2009 [51]. LiteRace is the first techniquethat introduces sampling to reduce the analysis overhead.LiteRace instruments only cold regions that are defined asthe less frequently accessed code elements, based on theassumption that frequently accessed code elements (hotregions) have fewer probabilities to be involved in data races.LiteRace trades property completeness (number of detecteddata races) for efficiency. The technique is not propertysound, since the happens-before analysis might be unawareof program-specific synchronization mechanisms.

Pacer, PLDI 2010 [52]. Similar to LiteRace, Pacer targetsthe efficiency of happens-before analysis through sampling.Differently from LiteRace that relies on a heuristic approach,Pacer estimates the probability of finding data races withincode regions based on the sampling rate, thus tradingprecision for costs. Pacer extends FastTrack by alternatinga sampling period that implements the classic FastTrackalgorithm, and a non-sampling period that reduces theamount of recorded information and simplifies the manage-ment of vector clocks. The non-sampling period reduces theoverhead of precise happens-before analysis. Pacer is notproperty sound since the happens-before analysis does notcapture program-specific synchronization mechanisms. It isnot property complete due to the sampling approach thatdoes not analyze the entire execution.

SOS, OOPLA 2011 [53]. SOS improves FastTrack andPacer by reducing the overhead of dynamic happens-beforeanalysis, while maintaining the precision of the originalapproach. SOS introduces the concept of stationary objects,which are objects that are only read after being writtenduring the initialization period, and optimizes happens-before analysis by excluding stationary objects from theanalysis, since they cannot lead to data races. SOS reducesthe average overhead of FastTrack by 45%, while increasingthe amount of detected races, and reveals over five times


more races than Pacer when considering 50% runtime over-head.

Carisma, ISSTA 2012 [54]. Carisma improves both theperformance and the accuracy of LiteRace and Pacer by ex-ploiting the similarity between multiple accesses to the samedata structures, such as arrays or lists. When processingmany accesses to the same data structure, techniques suchas LiteRace and Pacer waste time with redundant memoryaccess sampling. Carisma dynamically infers the applicationcontexts that specify mappings between memory locationsand high level data structures, and uses the contexts tocompute the distribution of memory locations across datastructure to better balance the sampling budget. Beingbased on FastTrack, Carisma is neither property sound norproperty complete.

ReEnact, ISCA 2003 [56]. ReEnact improves the perfor-mance of happens-before analysis by proposing an originalhardware implementation. It segments a program executioninto epochs, saves the state of the program in cache prior toeach epoch, and persistently saves the state of the epochonly at the end of its execution, if they do not sufferfrom data races. Otherwise ReEnact rolls back, restoresthe previous state, and re-executes the epoch under thesame interleaving with additional instrumentation to collectuseful information for the developers. ReEnact implementshappens-before analysis based on vector clocks, and reducesthe analysis overhead to a bare 5.8% of the program exe-cution time in average. ReEnact automatically repairs datarace problems that match a set of predefined bug patterns,such as a missing lock-unlock synchronization, and relieson program annotations to distinguish benign data races.Hence, it is neither property sound nor property completesince it is not predictive.

Reducing false positives

Narayanasamy et al., PLDI 2007 [57]. Narayanasamy etal.’s technique reduces the false positive rate by automat-ically classifying the detected races as either benign orharmful. For a given data race, the approach replays theexecution for the different orders among the memory op-erations involved in the data race, and classifies the raceas harmful only if the executions result in different programstates. The approach can still lead to misclassifications, sinceit would erroneously classify as benign a data race betweentwo interleavings that lead to indistinguishable state andresults only incidentally [59]. The approach is not propertycomplete, due to imprecisions in the dynamic analysis. It isproperty sound, since it checks that the two different ordersbetween the memory operations involved in the data raceare feasible.

Frost, SOSP 2011 [58]. Frost detects non-benign dataraces by comparing the results and program state obtainedby executing multiple replicas of the same program withdifferent interleavings. Frost segments an execution intoepochs, and runs each epoch on three replicas. It executes areplica instrumented with dynamic happens-before analysisto detect synchronization points in the program, and theother two replicas with a non-preemptive controlled sched-

uler on a single thread. Frost schedules these two replicasas complementary as possible, meaning that it schedules twostatements in reversed order in the two replicas wheneverthe synchronization mechanisms allow so. Frost both infersthe presence of harmful data races and identifies the mostlikely faulty replica by comparing the output of the threereplicas. When used at runtime, Frost can also recover froma faulty execution by rolling back and re-executing the faultyepoch. Frost incurs a utilization cost of 3 ×, since it executesthree replicas of each epoch. Simultaneously executionsreplicas on spare cores can reduce the increased utilizationto a bare 3% to 12% overhead. Frost is neither propertysound nor property complete because of the imprecisionintroduced by the happens-before analysis and by the epochsplitting mechanism.

Portend, ASPLOS 2012 [59]. Portend proposes a preciseclassification of data races, which includes different types ofharmful data races based on the effects that they have onthe system under test. It addresses the limitations of datarace classification approaches that rely on the identity ofboth outputs and state of the pair of interleavings inducedby a data race: accidental identity and differences that donot depend on data races. Portend considers data racesas benign only if they produce both the same results andstate under all possible test inputs, and checks this propertywith symbolic execution. Portend is not property complete,due to imprecisions of dynamic analysis. It is propertysound, since it checks if the two different orders betweenthe memory operations involved in the data race are bothfeasible.

Tian et al., ISSTA 2008 [60]. Tian et al.’s dynamic tech-nique proposes an automated mechanism to infer program-specific synchronization mechanisms to improve the accu-racy of happens-before analysis and thus reduce the rateof false positives. Specifically, Tian et al. target synchro-nizations based on flags, test-and-set locks and barriers.They infer common patterns that indicate the presence ofsuch synchronization mechanisms, and dynamically detectthese patterns. The approach improves the precision of datarace detection, but does not detect all possible synchroniza-tion constraints, and thus cannot guarantee either propertysoundness or property completeness.

RDIT, FSE 2015 [61]. RDIT reduces the amount of falsepositives of happens-before analysis by analyzing missingevents, which are events not captured by happens-beforeanalysis because they belong to portions of the code thatcannot be instrumented, such as external libraries. Missingevents are responsible for many false positives. RDIT tack-les this problem by extending happens-before relation totake into account also the invocations of external functionseven if their implementation cannot be instrumented. RDITassumes that two external functions that operate on someshared memory locations can possibly use such locationsfor synchronization purposes, and adds a happens-beforerelation between the invocation of the two functions, thusreducing the number of false positives that could be gen-erated if this relation is ignored. RDIT relies on RVPredict,discussed in the remainder of this section, for the data racedetection based on the computed happens-before relations.


Thus, it does not ensure either property completeness orproperty soundness.

Improving completeness

Smaragdakis et al., POPL 2012 [29]. Smaragdakis et al.observe that, by analysing single execution traces, happens-before analysis may infer incorrect order relations and thusmiss some data races. Indeed the order relation that thehappens-before analysis infers between the release and asubsequent acquisition of the same lock may be violatedin other execution traces. Smaragdakis et al. mitigate thisproblem, by proposing a new causally-precedes (CP) relationthat captures the order of execution of statements. The CPrelation relaxes the happens-before relation with respect tolock releases and acquisitions by inferring an order betweentwo lock-protected blocks if and only if they contain con-flicting statements. Smaragdakis et al. detect CP-races thatoccur when two conflicting memory accesses are not CPrelated, and demonstrate that a CP-race always correspondsto a feasible interleaving with a data race. The techniqueis neither property complete nor property sound, since itcan miss relations due to program-specific synchronizationmechanisms.

DrFinder, FSE 2015 [62]. DrFinder targets the problem ofdetecting hidden data races, which are data races that pre-vious the happens-before analysis –as well as the causally-precedes analysis and the precise analysis implemented inRVPredict– may miss due to the over-constraining nature ofthe analysis. A hidden data race consists of a pair of accessesto the same shared memory location that are in a happens-before relation only for a subset of all possible interleavings.For instance, hidden data races can occur when a happens-before relation depends on the order of acquisition of alock, which can change from execution to execution. Thekey intuition of DrFinder is that many hidden races canbe detected by reversing the order of execution of one ormore operations in a happens-before relation. For instance,reversing the execution order of two locking operationscould remove a happens-before relation and expose a hid-den data race. DrFinder introduces a new may-trigger relationthat specifies if a function may trigger a lock acquisitioneither directly or indirectly, through a chain of functioninvocations. DrFinder is a predictive technique. It computesthe may-trigger relation on an execution trace, looks foralternative interleavings that might expose data races, andexecutes the selected interleavings to check their feasibility.DrFinder is neither property complete nor property sound.

RVPredict, PLDI 2014 [30]. RVPredict defines an orderrelation to detect data races that improves the accuracyof the classic happens-before analysis and the causally-precedes analysis of Smaragdakis et al. The key insight ofthe proposed relation consists in taking into account controlflow information. Previous dynamic data race predictionapproaches permute an execution trace to identify possiblefeasible traces that may suffer from a data race by relying ona conservative definition of feasibility that may miss somevalid interleavings. In particular, they assume read-writeconsistency, meaning that every read in a valid permutationreturns the same value as in the original trace. RVPredict

exploits a more precise definition of feasible permutations:it encodes the order relation as a set of constraints, and in-vokes a constraint solver to detect races. Thanks to the addi-tional information, RVPredict can detect up to two order ofmagnitude more concurrency faults than approaches basedon happens-before or causally-precedes analysis. RVPredictis not property complete, since it relies on constraint solving,and is not property sound since it can miss program-specificsynchronization mechanisms like all the techniques basedon classic happens-before analysis.

Tailoring to specific paradigms

WebRacer, PLDI 2012 [63]. WebRacer enhances happens-before analysis by taking advantage of the semantics of Webplatforms and in particular information about the specifica-tion and implementation of the different browsers, like theorder constraints between loading, parsing and executing.WebRacer focuses on (i) variable races, which represent dataraces caused by concurrent accesses to shared memory loca-tions, such as JavaScript variables, (ii) HTML races, whichoccur when accesses of DOM nodes that represent HTMLelements may occur both before and after their creations,(iii) function races, which occur when function invocationsmay occur both before and after the parsing of the functions,(iv) event dispatch races, which occur when events may fireboth before and after adding the corresponding event han-dlers. WebRacer is neither property sound, since happens-before analysis can miss program-specific synchronizationmechanisms, nor property complete, since it does not checkfor alternative interleavings of the analyzed execution trace.WebRacer automatically generates sets of events to interactwith Web sites, thus producing concrete test case.

EventRacer, OOPSLA 2013 [64]. EventRacer extends We-bRacer by formulating the happens-before relation forevent-based programs, and improves the classification ofdata races to reduce false positives. EventRacer extends theanalysis to ad-hoc synchronization mechanisms that devel-opers often use to extend the few synchronization primitivesavailable in Web platforms, thus eliminating many benigndata races that data race detectors such as WebRacer report.EventRacer prunes benign races by introducing the notionof race coverage: a race a covers a race b if the race ona is used as a synchronization element to eliminate therace on b. An inexpensive inspection of uncovered racescan quickly identify races on synchronization variables, andeliminate the false positives covered by those races. The setof data items with uncovered races is 14 times smaller onaverage than the set of all data items with races. EventRacerimplements an efficient vector clock algorithm to performhappens-before analysis based on chain decomposition [68].

DroidRacer, PLDI 2014 [65]. DroidRacer dynamically ex-ploits the concurrency semantics of the Android program-ming model to derive a precise happens-before relation thatreduces the amount of false positives for Android applica-tions. DroidRacer takes into account the non determinismthat derives from both multi-threaded executions and asyn-chronous tasks, a concurrency mechanism implemented inAndroid to prevent complex computations from running onthe user interface thread. Although DroidRacer is the only


technique specifically conceived for Android applications sofar, the concurrency model of Android that is based on asyn-chronous callbacks is similar to the model of Web platformthat is targeted by WebRacer and EventRacer. Comparedwith EventRacer, DroidRacer does not consider ad-hoc syn-chronization mechanisms, and thus it is not property sound.It is not property complete either, due to the limitations inthe happens-before analysis it relies on.

Extending to relaxed memory models

Java RaceFinder, ASE 2009 [66]. Differently from mostdata race detection techniques, Java RaceFinder (JRF) doesnot assume a sequentially consistent memory model andintroduces a new happens-before analysis to capture or-dering relations in the relaxed Java memory model. JRFrelies on the Java PathFinder model checker to generateinterleavings that may result in data races, and exploresthe interleaving space driven by patterns that increase theprobability to identify a data race. JRF is property complete,since it relies on model checking and exploits a precisevariant of happens-before analysis that does not producefalse order constraints. It is not property sound, since it canmiss program-specific synchronization mechanisms.

Relaxer, ISSTA 2011 [67]. In line with Java RaceFinder,Relaxer defines a dynamic analysis approach to detect dataraces in a relaxed memory model. The analysis detectspotential data races in a sequentially consistent executiontrace by computing the set of potential happens-beforecycles, which represent possible violations of sequentialconsistency, uses the detected races to predict alternativeinterleavings on a relaxed memory model, and exploits abiased-random scheduler to force the occurrence of suchinterleavings. Relaxer is not property complete, since it usesa random scheduler that could miss some interleavings thatsuffer from a data race, and is not property sound, since itmay miss program-specific synchronization mechanisms.

5.1.3 Low level — Hybrid

Hybrid techniques combine lockset and happens-beforeanalyses to benefit from the accuracy of happens-beforeanalysis with the efficiency of lockset analysis. Followingthe seminal work of Choi et al. [69], hybrid techniqueslimit the scope of expensive happens-before analysis tocode fragments that either static or dynamic lockset analysisefficiently isolates as possibly affected by data races.

The approaches differ from their focus that can be on(i) improving performance by means of specific executionframeworks, (ii) reducing the amount of false positives,(iii) targeting specific synchronization mechanisms, (iv) cov-ering relaxed memory models, and (v) completing the iden-tified interleavings with test cases.

In line with the seminal work of Choi et al., Wester etal. and RaceMob improve performance by pipelining theanalysis on multicore hardware [55] and crowdsourcingdistribute data race detection across several executions [70],respectively. RaceFuzzer [71] reduces the amount of falsepositives by randomly generating executions that exposedata races and observing their effects. RaceTrack [72] targetsboth lock-based and fork-join synchronization primitives,

while Goldilocks [73] targets the synchronization mecha-nisms of software transactions. MultiRace [74] covers re-laxed memory models. SimRT [75] complements interleav-ings with test case priorities, while Racageddon [76] andNarada [77] complement interleavings with test cases.

Seminal work

Choi et al., PLDI 2002 [69]. Choi et al.’s reduce theoverhead of data-race detection by combining static anddynamic optimization techniques. They determine a set ofstatements that can cause data races with static locksetanalysis, and verify if the canditate statements lead to adata race by tracing the runtime memory accesses of theselected statements only. They introduce a weaker-than de-pendence relation among statements that identifies state-ments whose instrumentation would be redundant, Theycache the memory accesses to further optimize the analysis.The combination of these optimizations reduces the runtimeoverhead down to 13% to 42%, by trading performance forcompleteness. The weaker-than relation guarantees to reportat least one (not every) memory access operations involved ina data race on a memory location. As a consequence, thetechnique is neither property complete nor complete. It isnot sound, since it could miss ad-hoc synchronization mech-anisms. O’Callahan and Choi later extend this approach byrelying on dynamic lockset analysis [42].

Improving performance

Wester et al., ASPLOS 2013 [55]. Wester et al. propose aparallel infrastructure that exploits multiple cores to speedup lockset and happens-before analyses. The techniquesplits an execution trace into multiple epochs, which areexecuted on different cores in a pipeline. A preliminarycheap analysis identifies epochs whose initial conditionsdepend on the results of other epochs to determine theminimum amount of information needed to initialize allepochs. The technique takes advantage of multiple coresto execute different epochs in parallel, and guarantees thatan epoch runs entirely on a single core thus significantlyoptimizing the analysis due to the limited synchronizationon data structures that store information on individualepochs. The technique inherits the limitations of the tools ituses for data race detection, and thus it is neither propertycomplete nor property sound.

RaceMob, SOSP 2013 [70]. RaceMob optimizes dynamicanalysis with crowdsourcing. RaceMob statically identifiescandidate data races with lockset analysis, and distributesthe dynamic validation of these candidate data races tomany people. Each individual monitors only the memoryaccesses related a subset of candidate races, thus requiringto instrument only a small subset of memory accesses foreach individual, with a conceivable reduction of the over-head of the single executions. RaceMob is property sound.It is not property complete,

f. a. bianchi and a. margara and m. pezze` 1 a ......f. a. bianchi and a. margara and m. pezze` 1 a...

Documents