66 distributed programming framework for fast iterative ... · 66 distributed programming framework...

66

Distributed Programming Framework for Fast Iterative Optimizationin Networked Cyber-Physical Systems

RAHUL BALANI, IBM Research, IndiaLUCAS F. WANNER and MANI B. SRIVASTAVA, University of California, Los Angeles

Large-scale coordination and control problems in cyber-physical systems are often expressed within thenetworked optimization model. While significant advances have taken place in optimization techniques,their widespread adoption in practical implementations has been impeded by the complexity of internodecoordination and lack of programming support for the same. Currently, application developers build theirown elaborate coordination mechanisms for synchronized execution and coherent access to shared resourcesvia distributed and concurrent controller processes. However, they typically tend to be error prone andinefficient due to tight constraints on application development time and cost. This is unacceptable in manyCPS applications, as it can result in expensive and often irreversible side-effects in the environment due toinaccurate or delayed reaction of the control system.

This article explores the design of a distributed shared memory (DSM) architecture that abstracts thedetails of internode coordination. It simplifies application design by transparently managing routing, mes-saging, and discovery of nodes for coherent access to shared resources. Our key contribution is the design ofprovably correct locality-sensitive synchronization mechanisms that exploit the spatial locality inherent inactuation to drive faster and scalable application execution through opportunistic data parallel operation. Asa result, applications encoded in the proposed Hotline Application Programming Framework are error free,and in many scenarios, exhibit faster reactions to environmental events over conventional implementations.

Relative to our prior work, this article extends Hotline with a new locality-sensitive coordination mech-anism for improved reaction times and two tunable iteration control schemes for lower message costs. Ourextensive evaluation demonstrates that realistic performance and cost of applications are highly sensitiveto the prevalent deployment, network, and environmental characteristics. This highlights the importance ofHotline, which provides user-configurable options to trivially tune these metrics and thus affords time to thedevelopers for implementing, evaluating, and comparing multiple algorithms.

Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Sys-tems—Distributed applications; D.1.3 [Programming Techniques]: Concurrent Programming; J.7 [Com-puter Applications]: Computers in Other Systems—Industrial control

General Terms: Design, Algorithms, Performance

Additional Key Words and Phrases: Wireless sensor/actuator networks, distributed optimization, distributedshared memory, synchronization, subgradient methods

ACM Reference Format:Rahul Balani, Lucas F. Wanner, and Mani B. Srivastava. 2014. Distributed programming framework forfast iterative optimization in networked cyber-physical systems. ACM Trans. Embedd. Comput. Syst. 13, 2s,Article 66 (January 2014), 26 pages.DOI: http://dx.doi.org/10.1145/2544375.2544386

This work is supported by the National Science Foundation under grant CNS-0435060, grant CCR-0325197,and grant EN-CS-0329609.Authors’ addresses: R. Balani (corresponding author), IBM Research, New Delhi, India; email: [email protected]; M. B. Srivastava, Electrical Engineering Department, University of California, Los Angeles;L. F. Wanner, Computer Science Department, University of California, Los Angeles.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2014 ACM 1539-9087/2014/01-ART66 $15.00

DOI: http://dx.doi.org/10.1145/2544375.2544386

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 66, Publication date: January 2014.

66:2 R. Balani et al.

1. INTRODUCTION

Wireless sensor and actuator networks (WSANs) form an integral part of cyber-physicalsystems (CPSs). Actuator control problems in several important CPS applications,such as water-efficient irrigation [Park et al. 2009], energy-efficient HVACs, self-reconfiguring visual surveillance [Kansal et al. 2006], and personalized light control[Singhvi et al. 2005], are often expressed as optimization of a cost function, involvingcontrol inputs to actuators and sensor data, through techniques proposed in model pre-dictive control. Programmers implement these applications by selecting an optimiza-tion algorithm that satisfies their performance requirements given energy constraintsand network characteristics of the deployment. This research focuses on the scenarioswhere these algorithms execute on distributed controllers and continuously drive thephysical state of the world towards application-specific goals by iteratively estimatingoptimal control inputs of each actuator. Henceforth, this article interchangeably refersto these control inputs as actuator variables (or simply variables).

1.1. Problem

In absence of any high-level coordination mechanisms, distributed implementations arehard to encode and subsequently tend to be error prone and inefficient due to complexcommunication requirements. Specifically, distributed controllers need to coordinatewith each other for coherent access to shared resources, like sensors and actuators. Inaddition, some algorithms [Kansal et al. 2006; Nedic and Bertsekas 2001; Rabbat andNowak 2004] require serializability guarantees such that the result of any distributedexecution is equivalent to some predefined or random sequence of estimations (andactuations) in each iteration. Satisfying these communication requirements in wirelessmesh networks is known to be hard due to intermittent link qualities and, consequently,the respective mechanisms tend to be rather elaborate. Moreover, in all this complexity,programmers often miss domain-specific optimizations that could improve performanceof the algorithm and/or decrease its communication cost.

The persistent need for decreasing development costs and time-to-market often re-sults in rushed implementations, causing further errors and inefficiencies. This prob-lem is exacerbated when application developers have to implement and evaluate mul-tiple algorithms in a short time to select the best one for final deployment in termsof its performance and cost. The latter claim is justified in this text by demonstrat-ing that only theoretical evaluations of multiple algorithms are insufficient for thisselection process, as they typically use universal metrics which are detached from theimplementation platform and deployment characteristics.

In contrast with traditional sensor network applications that do not involve actua-tion, the errors or inaccuracies in calculated values of actuator control inputs in CPS’scould cause persistent and expensive side-effects in the environment. Moreover, in-efficient execution could prevent timely reactions to events that could again resultin undesirable side-effects, such as over-watering in agriculture [Park et al. 2009],missed events in security surveillance [Kansal et al. 2006], and damage to the marineecosystem through waste water overflow in sewers [Montestruque et al. 2008]. There-fore, application programmers typically over-provision the deployment with high-costcomputation and communication resources to (1) counter the impact of inefficient ex-ecution, and (2) avoid the catastrophic effects through well-understood, robust, and,more often than not, centralized solutions proposed for resourceful environments, likedesktops and servers. Subsequently, the cost of deployment is high and often pro-hibitive for certain applications that require a large number of sensors and actua-tors, such as in many commercial and residential buildings, to employ coordinatedcontrol.


Distributed Programming Framework for Fast Iterative Optimization 66:3

1.2. Objective and Proposed Solution

In this research, we aim to simplify development of distributed optimization algorithmson wireless sensor and actuator networks such that the application:

(1) is error-free,(2) exhibits a low reaction time (high performance) to environmental events,(3) incurs a low messaging cost for efficient operation in resource-constrained scenar-

ios, and(4) can be trivially (re)configured for improved performance and cost under varying

runtime characteristics.

Numerous distributed programming, macroprogramming, and middleware frame-works have been proposed in prior literature to simplify application developmentwith minimal execution overhead, but they are either (a) incomplete, due to lack ofsynchronization primitives [Whitehouse et al. 2004; Hnat et al. 2008], (b) inefficient,due to missed opportunities for domain-specific optimizations [Kothari et al. 2007], or(c) inapplicable [Fung et al. 2002; Madden et al. 2003; Girod et al. 2008] with respect tosupport for distributed iterative algorithms. Further details are provided in Section 9.

We achieve the objectives stated previously through the design and implementa-tion of the Hotline Application Programming Framework (APF). Hotline supportslarge-scale coordinated control in a wireless network of sensors and actuators throughlocality-sensitive shared memory and synchronization primitives. These primitives en-able error-free implementation of iterative optimization algorithms, as they are provento be correct, that is, deadlock-free, safe, and fair under realistic assumptions in Balani[2011].

Moreover, our prior work demonstrates that the proposed primitives can exploitspatial locality inherent in actuation to improve the reaction time of various ap-plications and algorithms. Specifically, Balani et al. [2011b] introduce this domain-specific performance optimization in Hotline through the design and implementationof a locality-sensitive synchronization mechanism. It facilitates data parallel executionof (sub)iterations in the incremental subgradient method [Nedic and Bertsekas 2001]and a randomized local search algorithm [Kansal et al. 2006] such that the outcome isequivalent to some random sequence of estimations in each iteration (serializability). Asubsequent publication [Balani et al. 2011a] demonstrates significant increases in effi-ciency via similar algorithmic modifications to consensus-based subgradient methods[Johansson et al. 2008].

1.3. Contributions

Relative to our prior work, this article makes the following contributions.

(1) It extends Hotline with a new locality-sensitive coordination mechanism that fa-cilitates parallel execution of iterative optimization algorithms such that theiroutcome is equivalent to a predefined and static sequence of estimations. Althoughthis requirement could be trivially satisfied by the prior mechanism [Balani et al.2011b], the resulting implementation displays significantly worse performance.The correctness of the new algorithm is shown in Balani [2011].

(2) It adds iteration control schemes to the new and prior synchronization mechanismsin Hotline for controlling their message costs.

(3) It subsequently extends Hotline with several implementation and configurationoptions, including the choice of different coordination mechanisms, that exhibitwidely varying performance and cost in differing runtime conditions. It thereforeenables programmers to trivially configure the applications for efficient executionbased on the runtime factors.



Finally, this article highlights the extremely variable impact of user-selectable con-figurations on the performance and cost of applications under a variety of deploymentnetwork, and environmental factors. In some instances, the results demonstrate 51–84% reduction in application latency due to data parallel execution with the new coor-dination mechanism in a network of 49 sensors/controllers. However, selection of thesame coordination mechanism and configuration results in at least 18% higher latencyover the traditional sequential execution in other deployment conditions due to a highercommunication overhead. In another instance, the prior coordination mechanism (pro-posed in [Balani et al. 2011b]) exhibits a lower message cost than the new algorithmwhen the events (i.e., modifications to the environment) are rare. This emphasizes theimportance of the Hotline APF that affords more time to the programmers for eval-uating and comparing multiple algorithms by simplifying their implementation andconfiguration. Hotline has been implemented in TinyOS-2.x, and its source is availableat Balani [2009].

2. LOCALIZED ACTUATION AND ITS IMPLICATION IN CPS APPLICATIONS

Hotline is naturally suited for implementing distributed optimization algorithms innumerous CPS applications, where actuators demonstrate bounded spatiotemporalinfluence. This locality in actuation is due to the;

—physical limitations of actuation devices, like light sources, sprinklers, pan-tilt-zoom(PTZ) cameras, etc.,

—environmental characteristics of the deployment, like obstacles, or resistance to ac-tuation through wind or slope of the agricultural fields, and

—application objectives that induce planned but localized overlap in influence zones ofactuators to achieve complete geographical coverage.

In this section, we analyze the impact of localized actuation on the performance ofiterative optimization algorithms. This analysis motivates the design and implemen-tation of the Hotline APF that enables programmers to discover and exploit this localityfor automatic improvements in application performance. Without Hotline, developershave to manually detect this locality and fine-tune the system for optimal performance.Moreover, the resulting implementation is not agile and may require considerable effortfor re-tuning when deployment characteristics change over time.

2.1. Optimization Problem

Consider the following optimization problem

minx

f (x) =N∑

i=1

f i(x) s.t x ∈ X , (1)

where each f i : RM → R is a convex function and X is a nonempty, closed, and con-

vex subset of RM. This class of optimization problems is found in optimal control for

finite-time rendezvous of multiple dynamical agents, cooperative multitarget track-ing, estimation and control in sensor/actuator networks, and resource allocation incomputer networks. However, this article uses a simpler example of an intelligentlight-control application commonly envisioned for huge office spaces, workshops, andtheatre/multimedia production. It is important to note that the ideas discussed hereare applicable to other algorithms and applications as well that demonstrate localitysimilar to the following analysis. For instance, Balani et al. [2011b] demonstrate 22%improvement in performance for a distributed local search algorithm proposed for co-operative visual surveillance using a network of pan-tilt-zoom cameras [Kansal et al.2006].



2.2. Personalized Light Control

Consider a light-control application with M light sources and N light sensors in aroom. Each light sensor i corresponds to an occupant of the room and has an associatedincident light intensity L∗

i desired by the user. The sensors also act as distributedcontrollers that regulate output intensitie I = (I1, . . . , IM)T of the light sources toachieve resultant light intensities such that the error between actual and desiredvalues is minimized. This can be expressed as a convex optimization problem, wherethe optimal control inputs I∗ to the light sources are determined by

I∗ = arg minI∈X

F( I) = arg minI∈X

N∑i=1

(Li( I) + φi − L∗

i

)2 (2)

s.t. Li( I) =M∑

j=1

aij Ij, X = {(I1, . . . , IM)T | 0 ≤ Ij ≤ Imax

},

where Li : RM → R models the resultant light intensity at sensor i from M controllable

light sources given I, and φi represents the modeling error and incident light at sensori from uncontrollable sources in the room, like windows. Assuming the location andorientation of all light sources and sensors is fixed, model coefficients aij ∈ [0, 1] areconstant. Imax is the maximum output intensity of all light sources.

2.3. Incremental Subgradient Algorithm

A popular method for solving problems of type (1) is the subgradient algorithm, whichconsists of the following iterative procedure when applied to problem (2):

Ik = PX

[Ik−1 − α

N∑i=1

gi( Ik−1)

], (3)

where gi( I) is a subgradient of f i at I, α is a constant positive stepsize, and PX denotesprojection on the set X ⊂ R

M [Balani 2011]. The subgradient method could also beimplemented in an incremental fashion, as proposed in Nedic and Bertsekas [2001].For each kth iteration, this entails changing the variable Ik−1 incrementally through Nsubiterations to obtain Ik, and, in each step, using only the subgradient correspondingto a single component function f i. The procedure is shown as follows:

ψi+1,k = PX[ψi,k − αgi(ψi,k)

], with ψ0,k = Ik−1, (4)

where ψi,k is the estimate at ith subiteration of kth iteration and at the end of Nsubiterations, Ik = ψN,k. Rabbat and Nowak [2004] observed that this method canbe performed in a distributed manner by circulating the estimate ψi,k between thesensor nodes following a logical ring, where they perform a subiteration according toEquation (4) using only a single subgradient corresponding to the node’s componentfunction f i. This incremental subgradient scheme has advantages over the standard(Equation (3)) in terms of rate of convergence.

Prior literature has typically evaluated these iterative optimization algorithms interms of rate of convergence, defined as a function of the number of iterations μ requiredfor the value of objective function f (xμ) to reach within an ε-ball of the optimal valuef (x∗) starting from an initial estimate x0. However, in practice, the user is ultimatelyconcerned with convergence latency, defined as the time it takes for the algorithm tocomplete μ iterations for a given ε and x0. It is equivalent to the reaction time of theapplication to events. Events are defined as modifications in the external environment,



Fig. 1. A sample four node deployment. Edges between nodes in the figures on the left show internodedependencies, and the node identifiers determine the desired order of execution in respective figures.

such as users entering or exiting a room or changes in ambient light, to which thecontrollers must react by altering actuator control inputs. Our prior work [Balani et al.2011a] demonstrates that the former metric is often misleading, as a lower iterationcount, although important, does not necessarily translate to lower latency.

It has been shown that the convergence rate of the incremental subgradient methoddepends on the exact sequence of execution, with the best attained when the sequenceis randomized in each iteration at runtime. Implementing a dynamic ring topology tosupport this cyclical communication is known to be hard. Moreover, a quick analysisof the subgradient algorithm demonstrates that its convergence latency scales poorlyfor large networks of sensors and actuators due to N subiterations and the numberof link packets required to transport the updated estimates of M control inputs overpossibly multiple hops after each ith subiteration. The latter component is typicallyomitted from theoretical evaluation of algorithms for universal applicability but has asignificant impact on their performance, as discussed later in Section 8. The followinganalysis demonstrates that spatial locality in actuator influence can be exploited toreduce the convergence latency of the algorithm.

2.4. Data Parallel Execution

In the light-control application, we observe that each light source affects only a localand adjacent set of sensors, and the light intensity at each sensor is influenced by a localset of actuators. As a result, the coefficients aij of the light models Li( I) in Equation (2)are nonzero only if source j influences sensor i. Subsequently, a trivial analysis ofEquation (4) reveals that each controller i accesses and updates the estimates foronly the local set of actuators βi that influence its sensor measurements, that is, βi ={ j | aij �= 0, j ∈ [1, . . . , M]}. This results in localized data dependencies betweencontrollers due to the common actuators they control.

Figure 1(a) shows that with the consequential dependencies, executing the subiter-ations in some desired order of execution leads to the depicted dataflow graph (DFG)when the iterations are unrolled/unfolded. In an execution schedule determined bythis DFG, nodes 2 and 3 can operate in parallel, as can nodes 1 and 4, due to thenon-intersecting sets of actuator variables required by them. Further, note that theiterations overlap at nodes 1 and 4, that is, the nodes operate concurrently but executedifferent iterations of the algorithm. Subsequently, the algorithm latency is reducednot only due to parallel execution but O(βi) bits that need to be exchanged at each hopinstead of O(M) bits, where βi � M in most practical scenarios.

2.4.1. Scheduling Problem Formulation. Deriving this optimal parallel scheme can be for-mulated as a scheduling problem, where a set of N tasks (subiterations) need to bescheduled on N controllers in each iteration such that the effect of executing theresulting minimal-length schedule over μ iterations is equivalent to executing thesubiterations in a given serial order (i.e., serializability). This equivalence is defined interms of the order of updates to actuator control inputs. The data requirements (i.e.,



shared actuator variables) of all the tasks are known only at runtime (but before thescheduling operation), and each controller must execute exactly one task correspondingto each iteration. As soon as a task completes execution at a node, a copy of the taskcorresponding to the next iteration is waiting to be scheduled at the same node, and itcan be executed even when other nodes are executing (or waiting to execute) tasks fromthe previous iteration, as long as the serializability guarantee is not violated acrossthe iterations. This data parallel model is natural for the class of CPS applicationsconsidered in this research, as the sensor and actuator data are naturally distributedover the network and are available to the controllers in spatial locality of the respectivedevices.

Note that this is a simpler problem to solve than the classical distributed-schedulingproblems, as the tasks have already been allocated to their respective controllers.Prior literature in data parallel scheduling typically employs static dataflow analysesby unfolding the loop iterations to determine loop-carried dependencies [Parhi andMesserschmitt 1991]. This is not an option in Hotline for most of the applicationswhere the data dependencies are determined at runtime by deployment characteris-tics. For the same reason, any iterative algorithm implemented without Hotline willrequire careful post-deployment analysis to discover these dependencies at runtimeand achieve optimal performance through manual fine-tuning. This analysis could bea daunting task for developers, considering the scale of CPS applications currentlyenvisioned for commercial, residential, and environmental spaces.

The optimal solution to the preceding problem, presented in Section 4, schedules thetasks (on their respective nodes) in the order defined by the DFG. It is incorporatedin Hotline through coordination mechanisms supporting the synchronization primi-tives. While this solution focuses on static sequence of operations, some algorithmsrequire the sequence to be randomized in each iteration at runtime in order to avoidlocal minima, such as in a distributed version of simulated annealing [Kansal et al.2006], or to obtain faster convergence, such as in the incremental subgradient method.Section 4 also describes the coordination algorithm proposed in Balani et al. [2011b]that supports this randomization at runtime with minimal programming effort.

This potential performance improvement is limited by several factors, such as the se-lected sequence, internode dependencies, and communication overhead. For instance,at one extreme end of the spectrum, a sequence, such as the one shown in Figure 1(b),can not be parallelized due to the sequential dataflow requirements in its correspondingDFG. Therefore, selecting a sequence with the best performance from amongst manypossible permutations requires exhaustive comparison of their impact on rate of con-vergence of the algorithm, attainable parallelism, and communication latencies. Thefollowing sections describe the design and implementation of the Hotline APF, whichenables developers to navigate these nonintuitive algorithm- and deployment-specifictrade-offs to balance the performance and cost of the applications with ease.

3. SOFTWARE ARCHITECTURE

In the Hotline APF, an application consists of three logical entities—sensors, actuators,and controllers—distributed throughout the network. They may reside physically onany device and communicate with each other over the network or locally through asoftware API. For instance, the controller entities reside at the respective sensors inthe light-control application, while the actuator entities reside at the correspondinglight sources. In contrast, the controller entities reside at the actuators in the visualsurveillance application discussed in Balani et al. [2011b].

Programmers implement distributed algorithms at the controllers where they accessshared sensor data and current values of actuator variables using a concise get/put in-terface in each iteration, perform some computation to update the estimates of optimal



Fig. 2.

control inputs, and manage actuators by writing back to the actuator variables. Sub-sequently, Hotline reduces programmers’ burden, as its runtime manages underlyingrouting, messaging, and discovery of nodes necessary for transparent and synchronizedaccess to remote variables. It reuses mechanisms similar to Hood [Whitehouse et al.2004] and Abstract Regions [Welsh and Mainland 2004] to share sensor measurementswhich essentially follow a multiple-reader/single-writer model. This article does notdelve into their details, but instead, focuses on shared actuator variables that areaccessed and modified by multiple controllers.

3.1. Coarse Physical Resources

Hotline associates each controller node i with a set σ (i) of all actuator variables that thenode will need to access in every iteration of the distributed algorithm. Each unique setσ (·) is therefore represented collectively as a single resource that directly controls theeffect of actuators on their physical environment. This is the first input to the parallelscheduling problem. It is formally defined as

σ (i) = {uj | j ∈ βi}, (5)

where uj is the control input of actuator j, and βi is the set of actuators which caninfluence the state of environment in a local region A that is of interest to node i.Figure 2(b) illustrates this set of actuators for each controller in a personalized light-control application, where uj = Ij .

Application developers can define these resources statically through compile-timespecification of βi for each controller, as shown in Figure 2(b), through the Resourceinterface, or let Hotline discover them dynamically through continuous rule-basedapplication at each node. The latter option is provided to support runtime modificationsin the deployment due to mobility or faults. It is used in Balani et al. [2011b] toimplement distributed visual surveillance (Figure 2(c)), but is omitted here for brevity.



3.2. Locks and Synchronization Barriers

Hotline provides lock primitives to serialize access to shared actuator variables. In eachiteration, programmers guard any read or write access to the variables by requestinga lock on the respective resource at each controller, as shown in Figure 2(a). They alsoassociate a (unique) priority with each request to define the required order of executionin that iteration. This is the second input to the scheduling problem. Note that a singlelock guards access to all the shared variables that are collectively defined as a resourceat each controller. Combined with the lock arbitration protocols described in Section 4,this obviates the need for deadlock detection and resolution mechanisms at runtimethat are common in prior approaches [Kothari et al. 2007], which manage locks at thegranularity of individual variables.

While the locks guarantee serializability within an iteration, Hotline provides syn-chronization barriers to ensure coherency across consecutive iterations. Programmersplace barriers at the end of each iteration to prevent nodes from proceeding to theirnext iteration until they have synchronized the current iteration with their selected setof neighbors. Hotline provides PhyLock, a distributed lock manager that grants locksto nodes in the decreasing order of priorities and manages synchronization betweennodes. PhyLock is short for Physical Lock, as it locks resources in the deployment thathave a physical effect on the environment through actuation. Programmers can selectfrom three different coordination mechanisms implemented in PhyLock that supportboth lock and synchronization interfaces. This enables the same application code, withlittle or no modification, to work seamlessly across all three implementations that ex-hibit a wide range of performance-cost characteristics. The next section describes allthree mechanisms but focuses primarily on the two locality-sensitive protocols thatenable opportunistic data parallel execution.

4. PHYLOCK: DISTRIBUTED LOCK ARBITRATION AND SYNCHRONIZATION

A token-based coordination mechanism in PhyLock, called PhyLock-Token, exposes thelock interface to support strictly sequential execution by enforcing mutual exclusionamongst all the nodes in the network. It makes use of a static ring topology over-laid on a wireless mesh network. On the contrary, the alternative locality-sensitivecoordination mechanisms in PhyLock, named PhyLock-Static and PhyLock-Dynamicbased on differing requirements (static vs. random) on the order of execution, solvethe scheduling problem optimally to enable data parallel execution whenever possible.They enforce priority-based mutual exclusion only amongst the local set of nodes thatrequire conflicting access to common actuator variables. This maintains serializabilityin writes to shared variables while simultaneously allowing nonconflicting nodes toacquire locks and execute in parallel.

A conflict is flagged between a pair of nodes when at least one of them needs toaccess a common variable for a write operation. The pairs of conflicting nodes arehenceforth referred to as coordination neighbors, and all the neighbors of a node de-fine its coordination neighborhood or clique. The distributed synchronization schemein both coordination algorithms is derived from Peleg [2000] and operates on thesecliques as well. In this mechanism, the nodes send a SYNC message to all the nodesin their neighborhood with the latest completed iteration number and wait to re-ceive SYNC messages from them with an equal or higher iteration number. Peleg[2000] demonstrates that iteration count at any two neighbors subsequently differsby at most one when one of the nodes in the pair is waiting for a third node tocomplete its older iteration and synchronize. An overlapping operation (shown inFigure 1(a)) can be explained from the recursive application of this result to the wholenetwork.



4.1. Conflicts and Coordination Graph (G)

It follows from the previous discussion that discovery of conflicts and communicationbetween conflicting neighbors form primary building blocks of the proposed coordina-tion mechanisms. Hotline automatically discovers pairs of conflicting neighbors andmanages coordination cliques for each node along with the communication betweentheir respective members. It only requires programmers to associate (bind()) distinctREAD or WRITE permissions, denoted by τ (i, uj), with each actuator variable uj accessedby a node i to enable this critical service. Subsequently, Hotline defines and utilizesa binary commutative operator that operates on nodes i and l to return the set ofall common variables uj in resources σ (i) and σ (l) that need to be accessed for a writeoperation by at least one of the operands. It is expressed as

i l = {uj | uj ∈ σ (i)∩σ (l), (i, l, uj) = TRUE}, (6)

where, boolean property (i, l, uj) holds true when τ (i, uj) = WRITE or τ (l, uj) = WRITE.Thus, Hotline marks a conflict between nodes i and l (or correspondingly, their lockrequests) if and only if i l �= φ.

This article analyzes the properties of coordination mechanisms implemented inHotline, by associating a coordination graph G with the network of controller nodessuch that there is an edge between all conflicting nodes. Formally, it can be definedas G = (V, E), where V = {1, . . . , N} is the set of vertices represented by N controllernodes, and E = {(i, l)|i l �= φ,∀i, l ∈ V} is the set of edges that denote conflicts betweencorresponding end points (Figures 2(b)). Consequently, a node’s coordination clique isequivalent to the set of its adjacent nodes in G and can be formally defined as

ξ (i) = {l | (i, l) ∈ E, i �= l}. (7)

4.2. PhyLock-Dynamic

PhyLock-Dynamic is designed to randomize the sequence of operations (subiterations)at runtime and execute them in parallel whenever possible. Consequently, it employsseparate lock and synchronization protocols to enable this randomization. Its lock arbi-tration mechanism is similar to a quorum-based mutual exclusion protocol [Maekawa1985], where each node requesting lock on a resource must communicate with everymember of an associated coordination clique to convey the REQUEST, obtain permissions(REPLYs), and release the lock when it is done. It is based on the fact that if node ireceives permission to access its resource σ (i) from all the nodes in its clique ξ (i), noother conflicting node can lock its resource. The protocol state machine is shown inFigure 3(a), but the algorithm is detailed in Balani et al. [2011b]. It is similar to thatat Ricart and Agrawala [1981], except for operation over cliques and inbuilt timeoutsand retransmissions to ensure reliability in the face of communication losses.

PhyLock-Dynamic utilizes the unique priority ζ (i) associated with each new requestfor σ (i) to avoid deadlocks when multiple nodes request simultaneous access to re-sources. The priority is a (sequence, identifier) tuple, where the sequence number isprovided by the programmer. A lower sequence number has a higher priority, but in casethey are equal, unique node identifiers are transparently utilized by PhyLock to breakthe tie by selecting the node with a lower identifier. The sequence numbers can be set us-ing global time stamps with a time sync protocol like FTSP (default), Lamport’s logicalclocks [Lamport 1978] or Maekawa’s sequence numbers [Maekawa 1985] such that theorder of execution can be randomized by controlling the invocation time of lock requests.

4.2.1. Iteration Control. As a desirable side-effect of this lock arbitration protocol, con-trollers can stop or restart iterations on-demand via some modifications to the synchro-nization scheme proposed in Peleg [2000]. Stopping the iterations when the system



Fig. 3. State machines for proposed coordination mechanisms in PhyLock.

achieves stability after reacting to all past events could potentially reduce the cumula-tive message cost of the protocol to offset the overhead of additional message exchange.In addition, due to various deployment and environmental factors, it may sometimesbe desirable to allow only a subset of nodes in the spatiotemporal locality of events toreact to them. This is because other nodes in the network may not get affected sig-nificantly by the changes to actuator control inputs in the locality of the events, andthe system operation (as measured by the sensors) may still be within tolerable limits.Therefore, PhyLock-Dynamic incurs a lower message cost when events are rare, andthe optimization algorithm needs only a few iterations to calculate optimal values ofcontrol inputs. The latter condition is typically satisfied when the initial estimates ( I0)of control variables are not far from the optimum ( I∗).

Hotline exposes an IterationControl interface comprising start() and stop() com-mands that are implemented by PhyLock-Dynamic, as explained next. On detectingabsence of events at a node, the application can command the iterations to stop().PhyLock-Dynamic puts the node in an idle state granting permissions to all futureREQUESTs irrespective of priorities. In this state, the synchronization module at thenode automatically responds to any SYNC messages from the neighbors by transmit-ting a SYNC message containing its iteration count updated to match the maximumof iteration count at any of its neighbors. Hotline therefore maintains an internaliteration counter, hidden from the application, that is always used for synchroniza-tion. The internal counter increments by one, with every call to sync() in normal ex-ecution. This enables all of the node’s neighbors to proceed with their iterations, aspreviously discussed. It is henceforth referred to as the dummy sync. In the absenceof this modification, the iterating nodes will wait indefinitely for the non-iteratingnodes in their neighborhood to synchronize and therefore block the progress of thealgorithm.

However, as soon as any event is detected at a non-iterating node, it can restart itsiterations by invoking the start() command and requesting a lock. PhyLock-Dynamicrestarts normal execution by transmitting a REQUEST message with a new priority, andthe synchronization module stops the dummy sync procedure. Any future calls to thesync() function automatically utilize the latest iteration count that the module mayhave reached during its dummy sync phase. This is necessary to prevent the node fromobtaining an undue advantage from its actual iteration counts that may have becomesignificantly lower than its neighbors’ counts due to its stopped iterations.



4.3. PhyLock-Static

PhyLock-Static is designed for continuously iterating nodes, irrespective of events. Iteliminates the need for redundant REQUEST messages when priorities are predefinedand static, and replaces the REPLY messages with SYNC to combine synchronization withlock arbitration. As a result, it improves performance and reduces message overheadduring the reaction phase of the application when the controllers are actively adjustingcontrol variables to react to events in the environment. This protocol is thereforesuitable for scenarios with a high frequency of events and/or where the number ofrequired iterations is high.

4.3.1. Algorithm. In this algorithm, nodes first exchange their static priorities ζ (·) withtheir coordination neighbors and store the received values in a neighbor table. Thishappens only once at the beginning of the deployment, thus its message cost is amor-tized over the lifetime of the deployment. After this phase is over, the nodes initializetheir iteration count C(i) to one and begin in state lock wait, where they wait to acquirea lock on their respective resources. However, they set Ci( j) = 0 for the other nodesin their clique, that is, ∀ j ∈ ξ (i). The state machine is shown in Figure 3(b). The lockrequests are implicit in this method, and the algorithm proceeds as follows.

—ST1. A node i acquires the lock on its resource if all the higher-priority nodes inits clique have completed C(i) iterations, as indicated by Ci( j) stored at the node,that is, the following condition is satisfied: Ci( j) = C(i) ∀ j ∈ ξ (i) s.t. ζ ( j) < ζ (i). Itreleases the lock when it’s done and transmits a SYNC message to its clique signallingits new iteration count C(i). Next, it moves to state sync wait, where it waits forlower-priority nodes to complete their iterations. However, if the condition is notsatisfied, it stays in lock wait state and waits to receive SYNC messages in step ST2.

—ST2. Upon receiving a SYNC message from a neighbor j, a node i updates the storediteration count Ci( j) from the received message. It moves to step ST1 if it is in statelock wait, otherwise, it moves to step ST3.

—ST3. A node i moves to state lock wait if all the lower-priority nodes in its cliquehave completed C(i) iterations, that is, Ci( j) = C(i) ∀ j ∈ ξ (i) s.t. ζ ( j) > ζ (i). It alsoincrements C(i) before moving to step ST1 as soon as the preceding condition issatisfied. Otherwise, it stays in sync wait state and moves back to step ST2.

4.3.2. Iteration Control. It is observed that PhyLock-Static imposes an unnecessary mes-sage overhead when the system has stabilized and the controllers do not need to modifythe actuator variables. Therefore, a controllable and bounded delay � is introduced atstep ST3 to delay the move to state lock wait after a node i has determined that allthe lower-priority nodes in its clique have synchronized with it. The proposed adap-tive mechanism enables programmers to control the inter-iteration period through theIterationControl interface and tune the message cost of the algorithm to runtime con-ditions. On every call to stop() command, it doubles the period until a compile-timeconfigurable maximum limit is reached. If the current period is set to zero, it setsthe period to a configurable minimum value. This forced delay in iterations slows therate of message exchange but potentially increases the reaction time of the applicationwhen the system has to react to a new event after achieving stability. The latter isinfluenced by the state of �, and iterations at all the nodes that must execute beforethe event-detecting node(s) can obtain lock(s) in the next iteration. Nevertheless, themechanism resets the period to zero as soon as any event is detected to enable quickerresponse after the initial delay.

The decision to exercise control over algorithm iterations is application specific.Programmers can use sensor measurements to detect events in their locality and op-tionally combine it with a flooding mechanism to propagate the news of detection and



trigger iterations at neighboring nodes. The latter mechanism could potentially reducethe reaction time with PhyLock-Static by recursively speeding up iterations at all thenodes in the network, but may not be necessary with PhyLock-Dynamic that does notrequire participation from all the nodes. If required, such a mechanism needs to beimplemented separately with PhyLock-Static. However, with PhyLock-Dynamic, thelockRequested() event defined in the PhyLock interface could be used to trivially im-plement the wakeup scheme by treating the explicit lock REQUESTs as wakeup triggers.

5. IMPLEMENTATION

Hotline has been implemented in NesC as a runtime library on top of TinyOS-2.x.It consists of three components that reside with the respective logical entities. Pro-grammers configure the components at sensors and actuators, besides encoding thedistributed algorithm at controllers, to enable resource management and discovery ofcliques for supporting all localized coordination mechanisms in Hotline.

5.1. Clique Management

Hotline represents the resource at each node as a list of preconfigured globally uniquekeys associated with the control inputs of actuators in βi (Equation (5)). The CliqueManager in Hotline proactively advertises the resource definitions, along with theirrespective permissions, throughout the network using DIP dissemination protocol[Lin and Levis 2008] to identify and maintain the coordination clique at each node(Equations (7) and (6)). Subsequently, it enables programmers to iterate over the list ofnode’s neighbors through a NeighborTable interface, similar to Whitehouse et al. [2004].

In the process, it also discovers and maintains unidirectional routes from a node to allthe members of its clique. Programmers can control the discovery process through theSync interface, as shown in Figure 2(a) for the light-control application. Any transmis-sion from a node to a coordination neighbor uses source routing along the discoveredpath to provide unreliable delivery of packets. Communication in the reverse direc-tion uses the path maintained by the original recipient, as the cliques are symmetric(Equations (6), (7)). Packetization and reliable delivery of messages are pushed upto the coordination mechanisms that utilize services provided by the Clique Man-ager. Consequently, Hotline provides a compile-time flag that allows the Manager tobroadcast packets or use another unreliable multicast routing service provided by theprogrammer.

The current implementation of Hotline also includes the actuators in βi as a part ofthe clique at each controller to enable sharing of all the updates to actuator controlvariables with the respective actuators as well. The Hotline library at the actuatorstherefore defines resources at each actuator as comprising of only its unique iden-tifier bound with the READ permission. It propagates these definitions amongst thecontrollers, and vice-versa disseminates controller resource definitions in the actuatornetwork to maintain symmetry of cliques. However, the actuators do not interfere withother coordination mechanisms running at the respective controllers.

5.2. Shared Variable Repositories

The Shared Memory Manager in Hotline transparently caches shared actuatorvariables at the respective controllers to reduce data access latency. Following aneager Release Consistency (RC) model for cache coherency, any updates to a localcache are synchronized across all relevant copies during the lock release operation.This model is favored over lazy RC schemes that synchronize updates at the next lockrequest as each node undergoes multiple iterations, and subsequently another node isalready waiting to acquire its lock on a conflicting resource. As a result, it minimizesthe expected time a node has to wait for consistent data access and reduces the number



of messages exchanged over the network by bundling all updates in a single message.Similarly to the coordination mechanisms described in previous sections, the cachecoherency protocol propagates variable updates originating at a node to only its clique.

Hotline proactively mirrors these caches of actuator control inputs accessed by acontroller, along with any other normal variables shared by it, at each of its neigh-bors, similar to Whitehouse et al. [2004]. These Shared Variable Repositories (SVRs),although not necessary for the implementation of the Shared Memory Manager, en-able neighborhood-based consensus algorithms [Balani et al. 2011a] and other spatialoperations proposed in Welsh and Mainland [2004], by exposing direct access to theserepositories through a SharedVariable interface [Whitehouse et al. 2004]. It is impor-tant to note that while the SharedMemory interface, used in Figure2(a), indexes eachcontrol variable by its global key and returns its latest value irrespective of the con-troller that updated it, the SharedVariable interface indexes each variable with a (key,node id) tuple to return its value written by the controller with identifier = node id.The SharedVariable interface is also used by the implementations of synchronizationprotocols to exchange SYNC messages by sharing the iteration count at a node as anormal variable with its neighbors.

6. EVALUATION METRICS AND SETUP

In this section, we describe the evaluation metrics and the common experimentalsetup used in Sections 7 through 8 to analyze the impact of various user-configurableparameters, including the choice of coordination mechanisms, the respective iterationcontrol schemes, and the desired sequence of operation, on performance and cost of thepersonalized light-control application implemented in Hotline.

6.1. Performance and Cost

Recall that the performance of applications is measured in terms of the time required tocomplete μ ≥ 1 iterations of the algorithm. In parallel execution with PhyLock-Static,it is given by

T (DFG, μ) ={

(S − Ov) (μ − 1) δ + S · δ, if μ > 1,S · δ, if μ = 1,

(8)

where, δ is the mean scheduling delay between subsets of nodes that can be scheduledin consecutive steps, S is the number of delay units required to execute one completeiteration of the algorithm, and Ov is the number of overlapping delay units betweenconsecutive iterations for μ > 1. The associated DFG is generated from the analysis ofdesired sequence of operations given internode data dependencies in the deployment.Figure 4 illustrates these parameters through an example of a 3 × 3 grid of nodeswhen the desired sequence of operations is determined by their unique identifiers. Theminimum value of Ov is 0 when consecutive iterations overlap but do not decreaseapplication latency, such as in Figure 1(a). However, if the nodes are executed sequen-tially, as in the classical implementation of the incremental subgradient method (usingPhyLock-Token), or a parallel schedule does not exist due to reasons mentioned inSection 2, then S = N − 1 and Ov = −1, resulting in T = (Nμ − 1)δ for N nodesin the deployment. For execution with PhyLock-Dynamic, Equation (8) needs to bemodified, as the schedule is irregular due to randomization. However, this change isstraightforward, and the modified equation is given by

T (DFG, μ) =

⎧⎪⎨⎪⎩

μ−1∑k=1

(Sk − Ov,k)δ + Sμ · δ, if μ > 1,

S1 · δ, if μ = 1,

(9)

where Sk and Ov,k vary in each kth iteration.



Fig. 4. A sample nine-node deployment. (b) The metrics S and Ov . Dotted lines demarcate consecutiveiterations that overlap at concurrently-executing nodes.

In a distributed system, scheduling delay δ is determined primarily by the communi-cation delay, as computation latency is relatively negligible. Correspondingly, schedul-ing cost (δC) is defined as the number of messages exchanged by each controller forscheduling its execution in every iteration of the algorithm. Programmers can selectan appropriate coordination mechanism and configure its iteration control scheme toinfluence the value of both δ and δC , as required. The iteration count μ of the incre-mental subgradient method is affected by the choice of execution sequence and theuser tolerance bound B in desired and actual light intensities obtained at the sen-sors. The remaining parameters S, Ov are impacted by the choice of desired executionsequence and the selected coordination mechanism that enforces strictly sequential(PhyLock-Token), regular parallel (PhyLock-Static), or randomized parallel execution(PhyLock-Dynamic).

The evaluation in subsequent sections demonstrates that the impact of these user-controllable parameters on the performance and cost of the application, throughδ, S, Ov, μ, and δC , respectively, varies with the following deployment, network, andenvironmental factors: (1) number of sensors (N); (2) number of actuators (M);(3) influence range of actuators (γ ); (4) communication topology (Gcomm), determinedby communication range; (5) network stack (Layer 2 and Layer 3); (6) packet loss rates;and (7) user entry/exit events. In the process, it intends to emphasize that no singleconfiguration has a clear advantage over others in all the scenarios considered in thistext. For instance, in one scenario, the difference in latency between the best and worstconfiguration is observed to be as large as 47x.

6.2. Simulation Setup

In order to perform this evaluation, the personalized light-control application is im-plemented in Hotline and two different types of deployments—D1 and D2—of M lightsources and N sensors/controllers are simulated in the TOSSIM simulator [Levis et al.2003]. The desired light intensities L∗

i associated with the sensors are selected ran-domly to generate ten different scenarios for the simulations. They are activated whenthe respective users enter the office space (detected by some other mechanism), andare otherwise set to zero to conserve electrical energy when the users exit. Applicationcode implementing the incremental subgradient method is shown in Figure 2(a). Forsimplicity, the three implementations of the application are referred to as Light-Token,Light-Static, and Light-Dynamic based on their respective underlying coordination



Fig. 5. Types of deployments.

mechanism. Wherever applicable, the programmers can configure these parametersthat are specific to the algorithm (step-size α, sequence), Hotline (iteration control, �),or the application (B).

The simulated deployments, shown in Figure 5(a), represent different possible or-ganizations and densities of lights in a typical office space with cubicles. They can beordered in increasing range of actuators (measured by the number of sensors γ j theyaffect), as D1, D2.1, and D2.2. This variance captures the effect of not only the fadingof light with distance, but obstacles, such as cubicle walls, as well. Consequently, thesize of coordination cliques (ξ (i)) also increases in these deployments due to higheroverlap in actuator influence zones. The deployment parameters are summarized inFigure 5(b). It is important to note that in Figure 5(c), the average size of cliques alsoincreases with the number of sensors and actuators in the network and asymptoticallyapproaches the corresponding maximum |ξ |max fixed by the actuator influence range.For a given number of sensors N, deployment D1 has a higher number of actuators Mthan D2.x such that M > N.

All sensors in these deployments are configured to be 1-hop away from the respectiveactuators in βi that could influence their measurements, assuming that communica-tion range is at least equal to, if not greater than, the actuation range. This is alsodesirable for allowing sensors to directly control the actuators. However, the commu-nication range of sensors is varied in the simulations to vary internode connectivity inthe underlying 1-hop communication graph Gcomm. The network stack has been imple-mented on top of the stock CSMA MAC layer provided in TinyOS-2.x and provides twounreliable routing services: the first service referred to as BCAST, provides a simplebroadcast mechanism that is used in simulations, where all the pairs of coordinationneighbors are within 1-hop communication range; and the second service, referred to asMHOP, provides one-to-one unicast over multiple hops using source routing. These twoschemes can be placed at opposite ends of the message-efficiency spectrum, and theirusage demonstrates the impact of the network stack on the performance of coordinationprotocols in Hotline. The performance with other multicast mechanisms is expected tovary between these two extremes, depending on their respective efficiencies.

Finally, different packet-loss rates are simulated by randomly dropping packets atnodes. A maximum link MTU of 127 bytes for IEEE 802.15.4 is used, which roughlyallows plink = 96 bytes of data payload, including all routing and transport headers.



Fig. 6. Impact of selected sequences with PhyLock-Static.

In practice, although smaller link packets have significantly higher delivery ratios,reliable delivery is assumed with the selected MTU to get optimistic results for Light-Token that exchanges larger messages than the localized coordination mechanismsproposed in Hotline.

7. MICROBENCHMARKS

This section evaluates the impact of PhyLock-Static/Dynamic coordination mechanismsand the selected sequence on scheduling delay (δ), cost (δC), and parameters S, Ov. It isan application-independent analysis, as it does not consider the influence of executionsequence on iteration count μ of the optimization algorithm.

7.1. Impact of Coordination Mechanisms

In execution with PhyLock-Static or -Dynamic, both δ and δC scale with the size of coor-dination cliques (ξ ) and tolerate packet loss with graceful degradation of performance.The detailed results are presented in Balani [2011], but omitted here for brevity. Theydemonstrate that the proposed coordination mechanisms can scale to a large number ofnodes N as long as the average clique size |ξ | � N. Similar results are observed for δCas well. As expected, execution with PhyLock-Dynamic exhibits a higher δ and δC thanPhyLock-Static due to its higher communication overhead. However, with the iterationcontrol mechanisms enabled in both protocols, Section 8.4 demonstrates that PhyLock-Dynamic decreases reaction time and message cost under certain assumptions on thespatiotemporal distribution of events.

7.2. Impact of Selected Sequence

This section verifies the impact of selected sequence(s) on the extent of parallelismin an execution schedule and, subsequently, their impact on application latency as afunction of the iteration count μ. It first considers execution with PhyLock-Static dueto its regular schedule.

7.2.1. PhyLock-Static. This research quantifies the extent of parallelism in an executionschedule using the tuple 〈S, S − Ov〉. Equation (8) demonstrates that given a fixed δ,it is desirable to minimize both the components of the tuple to reduce applicationlatency. The table in Figure 6(a) lists the values of 〈S, S− Ov〉 for different sequences ofoperations executed in networks of varying sizes using PhyLock-Static. Each columnin the table demonstrates that for a fixed sequence and network size, parallelismdecreases with increasing range of actuators. This is symbolized by increasing valuesof S and S−Ov from top to bottom in the respective columns. Moreover, given a constantactuator range and network size across each row, values of the tuple demonstrate awide range of parallelism in schedules generated from different sequences.

Assuming a constant δ = 1 unit, Figure 6(b) plots application latency (given byEquation (8)) as a function of iteration count μ. All the curves are therefore linear,



Fig. 7. The graphs demonstrate the parallel-slowdown effect though observed application latencies andmean scheduling delays for different sequences in execution with PhyLock-Static.

Fig. 8. The graphs plot observed application latency as a function of μ in execution with PhyLock-Static.

starting at offset S and increasing with slope equal to S − Ov. Note that it is theexpected impact on application latency in terms of the number of delay units. Thesequence labeled “Ring” is the worst in terms of the parallelism, with S ≈ N − 1and Ov = −1, respectively, confirming that the execution is sequential and that itsiterations can not overlap in the generated schedule. On the contrary, the sequencemarked “GColor” is the best, as it is obtained from the output of a centralized, greedygraph coloring heuristic algorithm. The other sequences labeled “Seq-*” are selectedrandomly, while the one labeled “Nodeid” corresponds to an order specified by theunique node identifiers. The relative performance of the application with differentsequences depends on the required iteration count. This is an important considerationin selection of an execution sequence, because it also influences the rate of convergenceof the algorithm, which ultimately determines the iteration count.

According to the preceding discussion, the execution with sequence Ring is expectedto perform the worst, compared to the others, irrespective of network size and actuatorinfluence range in deployments D1 or D2.1. This is substantiated by the sharp increasein number of delay units required to complete μ iterations in Ring. However, themeasured impact of sequence Ring on the application latency, shown in Figures 7 and8, is not significantly worse than others contrary to expectations. In fact, for N = 49nodes in deployment D2.1, shown in Figure 7(a), it displays almost equivalent or betterperformance than the other sequences for μ ≤ 30. This effect can be attributed to the3–5x increase in scheduling delay δ with other sequences, as shown in Figure 7(b) forN = 49 nodes. It is due to a higher rate of packet collisions at the MAC layer inducedby concurrently-executing nodes. This communication overhead offsets the advantagesof parallelism in execution schedules and results in parallel slowdown.

However, deployments with low actuator ranges, such as D1 (shown in Figure 8(a)),exhibit better performance with parallel schedules due to reduced collisions in smallneighborhoods. Similarly, networks with a large number of nodes, such as N = 100,display lower application latencies with parallel schedules, even in deployments withhigh actuator ranges, such as D2.1 (Figure 8(b)), simply due to sequential execution



Fig. 9. The graphs plot expected and measured application latency as a function of μ in execution withPhyLock-Dynamic. They are compared against execution with PhyLock-Static obtained from Figure 8. Thesesimulations also use BCAST for packet exchange.

with Ring that scales with N. However, this improvement is also subsequently reducedand eventually eliminated with increasing actuator influences when the size of thenetwork remains constant.

Although the impact of the exact sequence on δ and δC is ignored in Section 7.1, itis now important to state that the sequence Ring was in fact used in the evaluation.Figure 7(b) confirms that the mean value of δ increases with other sequences due toincreased rate of packet collisions, as discussed previously.

7.2.2. PhyLock-Dynamic. The preceding analysis can be extended to PhyLock-Dynamicas well, where the sequence is randomized in each iteration of the algorithm. Figure 9(a)illustrates that the expected execution latency with PhyLock-Dynamic is comparable tosome of the selected sequences executed with PhyLock-Static. The execution schedulewith PhyLock-Dynamic is extracted from ten different simulation runs, as indicated bygroups of similarly styled lines in the figure. However, comparing with PhyLock-Staticin Figure 9(b) reconfirms the higher communication overhead imposed by PhyLock-Dynamic, as their difference in measured latency increases multifold.

8. CASE STUDY: PERSONALIZED LIGHT-CONTROL APPLICATION

This section compares the Light-Static, Light-Dynamic, and Light-Token implemen-tations of the personalized light-control application. It extends the analysis from theprevious section to incorporate the impact of sequences on the rate of convergence ofthe incremental subgradient method that ultimately determines the iteration countμ. In the process, it also evaluates the impact of PhyLock-Token on performance andcost of the application. The setup described in Section 6.2 is reused in this section.The following results assume the worst-case situation when all the lights are initiallyoff ( I0 = (0, . . . , 0)T ) and all N users enter the office simultaneously. However, later inSection 8.4, typical usage scenarios are simulated with fewer numbers of events to eval-uate the impact of iteration control mechanisms in Light-Static and Light-Dynamic.

8.1. Overall Impact of Execution Sequence

Figure 10(a) confirms that the rate of convergence of the incremental subgradientmethod is influenced by the choice of execution sequence [Nedic and Bertsekas 2001].Subsequently, the exact number of iterations required by the optimization algorithmis determined by the value of the user tolerance bound B. For instance, in the followingsimulations, the application only requires μ < 20 iterations to achieve light intensityat each sensor i that is within B = 1% of L∗

i . Successively lower values of B increasethe difference between iteration counts for various sequences, thereby increasing theimportance of μ in selecting the best configuration.

Consequently, the final selection of sequence in Light-Static or Light-Token mustaccount for its impact on the rate of convergence as well as the mean time required



Fig. 10. (a) Impact of sequences on convergence of the subgradient method; and (b) performance of theapplication in a network of N = 49 sensors.

Fig. 11. Simulation results for deployments with N = 49 sensors.

to complete each iteration (time per iteration, TPI). In Light-Static, TPI is determinedby 〈S, S − Ov〉 and δ. The table in Figure 10(b) demonstrates that neglecting eitherone of the factors in Light-Static will result in a suboptimal selection of sequence thatmay increase application latency by as much as 87%. For instance, in deployment D1with N = 49 sensors, selecting the sequence Ring or Nodeid according to iterationcount alone can increase the total time to complete μ iterations by 28% over thebest performing sequence labeled Seq1. On the contrary, in deployment D2.1 with thesame number of sensors, selecting the GColor sequence purely on TPI will decreaseperformance by 87% over Ring.

Given a constant number of sensors and actuators in the deployments, the TPI inLight-Token is determined by the average number of hops that messages (tokens) needin order to travel between successive nodes in the ring. Our results demonstrate thatfor the range of iteration counts required in the light-control application, selection ofthe best sequence for Light-Token is more strongly influenced by the network topology(Gcomm) rather than the rate of convergence of the algorithm discussed previously. Forinstance, given a constant network topology, Figure 11(a) illustrates that the perfor-mance of Light-Token is the best with sequence Ring independent of N and M due toits lowest hop count (and TPI).

8.2. Impact of Coordination Mechanisms on Application Performance

Results presented in this section demonstrate that Light-Static and Light-Dynamicexhibit lower reaction time than Light-Token when the ratio of network size (N) toclique size (|ξ |) is relatively high. Simulations confirm that a higher number of sen-sors/controllers (N), in combination with a high number of actuators (M), degradesperformance of Light-Token, as discussed in Section 2. Figure 11(b) summarizes these



Fig. 12. Performance comparison between Light-Static and Light-Token in execution with sequence Ringfor a network of N = 49 sensors using different network services.

results by listing the two best implementations of the application along with the re-spective sequences for different combinations of deployment and network factors. Itconfirms that no single configuration has an advantage over the others in all simulatedscenarios.

8.2.1. Light-Static vs. Light-Token. Figure 12(a) demonstrates that execution with se-quence Ring in Light-Static reduces application latency by 4.9x over Light-Token indeployment D1, despite the absence of parallelism in its execution schedule. This isbecause the nodes in Light-Static need fewer numbers of link packets to exchangemessages that contain values of at most four control variables (|β|max = 4 in D1, cf.Figure 5(b)), as opposed to Light-Token, where messages contain values of M = 84variables. It highlights that the Shared Memory system in Hotline can improve theperformance of the application independent of parallel execution with PhyLock.

This difference in performance between Light-Static and Light-Token with Ring isreduced to only 2x in deployment D2.1 (Figure 12(a)) due to a decrease in the number ofactuators and an increase in the communication overhead of PhyLock-Static at higherclique sizes. Due to the same reason, Light-Static demonstrates 1.18x higher latencythan Light-Token in D2.2 with the sequence Ring. In addition, results in Balani [2011]show that there is only marginal or no performance improvement using other sequenceswith Light-Static in deployment D2.2, compared to execution with Ring despite higherparallelism in execution schedules of the former. This is attributed to the parallelslowdown explained in Section 7. However, in deployments with a higher number ofsensors (such as N = 100) but the same range of actuator influences, Light-Staticagain proves to be superior, as it scales with the size of cliques [Balani 2011] ratherthan N.

The communication range of nodes in deployments D2.x is decreased in the next setof simulations to force coordination over multiple hops in Light-Static. Accordingly,broadcast is disabled and MHOP is used. Results show that the impact of unicastmultihop communication is the highest on Light-Dynamic, followed by Light-Staticand Light-Token, in that order, determined by communication overheads of respec-tive protocols. Therefore, with sequence Ring, Light-Static performs 5.4x worse thanLight-Token in deployment D2.1 with MHOP (Figure 12(b)), when it was clearly supe-rior in the same deployment with BCAST (Figure 12(a)). However, due to low cliquesizes and high numbers of actuators in D1, Light-Static outperforms Light-Token bya factor of 1.9. As PhyLock-Static relies strongly on group-based communication, webelieve that the introduction of efficient multicast protocols will also improve the per-formance of Light-Static.

8.2.2. Light-Dynamic. Given the low value of required iteration counts, any signifi-cant difference between executions with Light-Static and Light-Dynamic is not ob-served in terms of μ, contrary to prior research. But given the higher overhead of



Fig. 13. Cumulative message count per node in Light-Static and Light-Dynamic implementations (N = 49)when the respective iteration control mechanisms are enabled.

PhyLock-Dynamic, the corresponding implementations always perform worse thanLight-Static. In comparison with Light-Token, Light-Dynamic performs 2.4x betterin deployment D1 with low clique sizes and high actuator counts. However, in otherdeployments, Light-Token is superior due to the sharp increase in communicationoverhead of PhyLock-Dynamic at higher clique sizes. From this analysis, it can beconcluded that given a constant range of actuator influence, where Light-Static outper-forms Light-Token at certain values of N and M, Light-Dynamic will perform betterthan Light-Token only in networks with successively higher sizes.

8.3. Impact of Coordination Mechanisms on Application Cost

Ignoring the overhead of routing and ring formation + maintenance at runtime, theresults demonstrate that Light-Token incurs at least 2x lower message cost than thealternative implementations, while the message cost of Light-Dynamic is at least 5xhigher than the rest. However, Light-Token requires a higher number of link pack-ets to transport the messages at each hop. Consequently, in deployment D1 with thelowest clique size, Light-Static uses less than 42% of link packets, compared to Light-Token, whereas in other deployments, the difference between the communication costis reduced. Detailed results are presented in Balani [2011].

8.4. Iteration Control: Impact on Cost and Reaction Time

In the next set of simulations, the iteration control mechanisms are enabled in bothLight-Static and Light-Dynamic to evaluate their impact on the communication costand reaction time of applications given that the distribution of events varies acrossapplications, space, and time. Individual controllers in Light-Static and Light-Dynamiccall the stop() function provided by the IterationControl interface when they detect thattheir local sensor measurements have stabilized (variance of mean sensor readings ≤2) over a period of time, which is configured as 20 seconds for the following results.

Figure 13(a) plots the cumulative number of messages exchanged by a controllerover time in Light-Static and Light-Dynamic implementations. It confirms that thenodes in Light-Dynamic stop their iterations after achieving stability (marked by aconstant cumulative count). On the contrary, nodes in Light-Static slow down the iter-ations successively until a maximum inter-iteration period of � = 16 seconds. AlthoughLight-Dynamic starts with a higher communication cost, Light-Static eventually sur-passes it at a break-even point during the stable phase of the system. Therefore, itcan be concluded that in a deployment where consecutive events are separated by a



Fig. 14. Evolution of the cost function in Light-Static and Light-Dynamic after different numbers of user-exitevents are introduced. Each curve represents a single simulation run.

larger interval, PhyLock-Dynamic exhibits lower message cost over the lifetime of theapplication. However, Figure 13(a) also demonstrates that with increasing clique sizes,the break-even points shift to the right, apparently reducing the utility of PhyLock-Dynamic. A similar effect is also observed in Figure 13(b) with the increase in maximumvalue of � to 32 and 64 seconds, respectively.

However, Figure 14 demonstrates that a higher � in Light-Static causes additionaldelay in response to the new events that are introduced in the post-stability phase.The events, marked by a sharp rise in the value of cost function (Y-axis), represent theexit of randomly selected users, where the respective light intensities L∗

i are changedto the ambient value φi to conserve electrical energy. The delayed response in Light-Static is partly due to the lack of an explicit event-signalling protocol that resets �at non-event-detecting nodes. In the preceding simulations, this signal is propagatedthrough the changes detected by local sensors of other nodes when the event-detectingnodes modify actuator control inputs. Although this approach works well for PhyLock-Dynamic, as shown by its quick reaction to the events, it incurs a wake-up delay inPhyLock-Static, depending on the state of iterations at all the nodes in the network.Subsequently, Figure 14(a) clearly illustrates that the delay is longer in Light-Staticwhen only a small number of events (as compared to the network size) are introduced inthe environment. On the contrary, a larger number of events trigger faster iterations atmany nodes in the network, resulting in a faster reaction by the application, as shownin Figure 14(b).

In addition, it is observed that the mean error in desired and measured light inten-sities at the sensors increases with higher numbers of events. Specifically, it rises fromaround 2%, when only one event is introduced, to close to 42%, when 25 events areintroduced in the simulations. However, the absolute values of light intensities andcost functions at the end of simulations are different, because each curve in the figuresrepresents a unique execution, where the spatial distribution of events is selected atrandom. Low error with a small number of events indicates that only a local subsetof actuator control inputs need to be modified such that the light intensities at othersensors are unperturbed. This supports the case for selective and efficient executionwith iteration control in the PhyLock-Dynamic protocol. In any case, the error betweendesired and actual intensities in the post-event phase could be attributed to the sys-tem implementation that assigns equal weights to opposing objectives of meeting user-specified values, where they are still present, and saving energy, where they are not.An appropriate selection of weights could trivially alleviate this issue.

9. RELATED WORK

Numerous programming frameworks have been proposed that aim to support datacollection, sharing, and aggregation in WSANs through a variety of language abstrac-tions, explicit APIs, and runtime mechanisms, which greatly reduce the effort involved



in application development. They can be classified into macroprogramming frame-works and distributed programming libraries. In general-purpose macroprogrammingframeworks, such as those of Hnat et al. [2008], application developers use supportedlanguage abstractions to write programs for the entire network, assuming centralizedaccess to variables at all the nodes in the network. The macroprogram compiler andruntime cooperatively split the program into node-level code and execute it in a cen-tralized or distributed fashion, whichever is more efficient for the target deployment.On the contrary, distributed programming libraries (such as [Whitehouse et al. 2004;Welsh and Mainland 2004]) provide an explicit interface for access to shared variablesthat represent raw sensor data or fused information at neighboring nodes.

While these data sharing services are absolutely necessary to support distributedoptimization and control in WSANs, they are not sufficient, as they do not (need to)support synchronized access to shared variables in the presence of multiple writers.TeenyLime [Costa et al. 2007] recognizes the insufficient support for actuation but ex-pects users to implement their own mutex protocols for coherent access. Hotline extendsprior distributed-programming systems with support for localized lock-arbitration andprocess-synchronization mechanisms. Subsequently, the added primitives could beutilized by macroprogramming frameworks by generating node-level code that linksagainst the Hotline runtime library for distributed execution. Moreover, none of theproposed systems, to the best of our knowledge, exploit domain-specific characteristics,such as spatial-locality, to improve application performance. As a result, implementa-tions of iterative optimization algorithms in Pleiades [Kothari et al. 2007] incur theoverhead of deadlock detection and resolution, as discussed before.

Besides the general-purpose macroprogramming systems discussed, other frame-works have been proposed that are customized for respective classes of applications.For instance, database-like systems, such as TinyDB [Madden et al. 2003], etc., aremore suitable for data-collection applications, while other systems, such as Regiment[Newton et al. 2007], XStream [Girod et al. 2008], etc., allow users to define a staticset of long-running operations over streams of sensor data. Both classes of systems aretherefore more appropriate for centralized control of actuators rather than distributedcontrol targeted in Hotline.

The design of Hotline is similar to conventional DSM architectures, like ReflectiveMemory (RM) and Distributed Transactional Memory (DTM) systems. However, its un-derlying memory coherency and synchronization mechanisms are carefully designedfor supporting multiple-reader-multiple-writer model in large-scale sensor and actua-tor networks. For instance, while the Hotline API could be replaced in principle witha lockless DTM interface, the latter lacks support for priorities that define desiredexecution sequences. Without these priorities, the transaction abort and reexecutionpolicies in DTM systems can not enforce the correct order of writes to actuator controlvariables. Moreover, typical conflict resolution mechanisms in these systems requiremessage exchanges between all the nodes, which can not scale to a large network, ornecessitate a centralized directory service for detecting conflicting nodes and reducingmessage costs. Hotline provides and maintains an equivalent of a scalable distributeddirectory service through its built-in conflict discovery mechanisms.

The Shared Variable Repositories (SVRs) in Hotline runtime implement a sharedmemory architecture that is similar to Reflective Memory (RM) systems popular inparallel computing [Jovanovic and Milutinovic 1999]. While the granularity of sharingin RM is typically a page, Hotline shares tuples [Gelernter 1985] associated withindividual access permissions. Hotline’s unique arbitration of locks on coarse sets ofthese tuples, based on their fine-grained access permissions, eliminates false sharing,which is common in conventional DSM systems [Keleher et al. 1992] that share at thegranularity of a memory page.



10. CONCLUSION

This article describes the design and implementation of a distributed programmingframework for WSANs that form an integral part of networked cyber-physical sys-tems. It not only promotes widespread adoption of advanced optimization techniquesbut provides for their rapid, error-free, and efficient implementations with minimalprogramming effort. The extensive evaluation of the light-control application, alongwith the proposed coordination mechanisms, highlights the gap between theoreticalresults published in prior literature and actual performance in realistic scenarios withvarying network properties, spatiotemporal distribution of events, and deploymentcharacteristics. Consequently, applications implemented in this framework neither re-quire overprovisioned nodes nor incur the expensive side-effects caused by delayed orinaccurate reaction to events in the environment. The resulting deployments cost lessenabling many new applications that were previously cost-constrained by their scale.

REFERENCES

R. Balani. 2009. Hotline App. Programming Framework. http://nesl.ee.ucla.edu/projects/hotline.R. Balani. 2011. Distributed programming framework for fast iterative optimization in networked cyber-

physical systems. Ph.D. dissertation. University of california at Los Angeles.R. Balani, N. H. Chehade, S. Chakraborty, and M. B. Srivastava. 2011a. Distributed coordination for fast

iterative optimization in wireless sensor/actuator networks. In Proceedings of the SECON.R. Balani, K. Lin, L. Wanner, J. Friedman, R. K. Gupta, and M. B. Srivastava. 2011b. Programming support

for distributed optimization and control in cyber-physical systems. In Proceedings of the ACM/IEEE2nd International Conference on Cyber-physical Systems.

P. Costa, L. Mottola, A. Murphy, and G. Picco. 2007. Programming wireless sensor networks with theteenylime middleware. Middleware’07. Lecture Notes in Computer Science, vol. 4834, Springer, Berlin,429–449.

W. F. Fung, D. Sun, and J. Gehrke. 2002. Cougar: The network is the database. In Proceedings of the ACMSIGMOD International Conference on Management of Data. ACM, 621.

D. Gelernter. 1985. Generative communication in Linda. ACM Trans. Program. Lang. Syst. 7, 1, 80–112.L. Girod, Y. Mei, R. Newton, S. Rost, A. Thiagarajan, H. Balakrishnan, and S. Madden. 2008. XStream:

A signal-oriented data stream management system. In Proceedings of the International Conference onData Engineering.

T. W. Hnat, T. I. Sookoor, P. Hooimeijer, W. Weimer, and K. Whitehouse. 2008. MacroLab: A vector-basedmacroprogramming framework for cyber-physical systems. In Proceedings of the Sensys.

B. Johansson, T. Keviczky, M. Johansson, and K. H. Johansson. 2008. Subgradient methods and consensusalgorithms for solving convex optimization problems. In Proceedings of the IEEE Conference on Decisionand Control.

M. Jovanovic and V. Milutinovic. 1999. An overview of reflective memory systems. IEEE Concu. 7, 2, 56–64.A. Kansal, W. Kaiser, G. Pottie, M. Srivastava, and G. Sukhatme. 2006. Virtual high resolution for sensor

networks. In Proceedings of the Sensys.P. Keleher, A. L. Cox, and W. Zwaenepoel. 1992. Lazy release consistency for software distributed shared

memory. In Proceedings of the 19th International Symposium on Computer Architecture (ISCA). ACM.N. Kothari, R. Gummadi, T. Millstein, and R. Govindan. 2007. Reliable and efficient programming abstrac-

tions for wireless sensor networks. In Proceedings of the ACM SIGPLAN PLDI.Leslie Lamport. 1978. Time, clocks and ordering of events in a distributed system. Comm. ACM, 21, 7,

558–565.P. Levis, N. Lee, M. Welsh, and D. Culler. 2003. TOSSIM: Accurate and scalable simulation of entire TinyOS

applications. In Proceedings of the Sensys.Kaisen Lin and Philip Levis. 2008. Data discovery and dissemination with DIP. In Proceedings of the IPSN.S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. 2003. The design of an acquisitional query

processor for sensor networks. In Proceedings of the ACM SIGMOD. 491–502.M. Maekawa. 1985. An algorithm for mutual exclusion in decentralized systems. ACM Trans. Comput. Syst.

3, 2.L. Montestruque, M. Lemmon, and LLC EmNet. 2008. CSOnet: A metropolitan scale wireless sensor-actuator

network. In Proceedings of the International Workshop on Mobile Device and Urban Sensing (MODUS).



A. Nedic and D. P. Bertsekas. 2001. Incremental subgradient methods for nondifferentiable optimization.SIAM J. Optim. 12, 1.

R. Newton, G. Morrisett, and M. Welsh. 2007. The regiment macroprogramming system. In Proceedings ofthe ACM/IEEE IPSN.

K. K. Parhi and D. G. Messerschmitt. 1991. Static rate-optimal scheduling of iterative data-flow programsvia optimum unfolding. IEEE Trans. Comput. 40, 2 (1991), 178–195.

Y. Park, J. S. Shamma, and T. C. Harmon. 2009. A receding horizon control algorithm for adaptive manage-ment of soil moisture and chemical levels during irrigation. Env. Model Softw. 24, 9.

D. Peleg. 2000. Distributed Computing: A Locality-Sensitive Approach. Society for Industrial Mathematics.M. Rabbat and R. Nowak. 2004. Distributed optimization in sensor networks. In Proceedings of the IPSN.G. Ricart and A. K. Agrawala. 1981. An optimal algorithm for mutual exclusion in computer networks.

Comm. ACM 24, 1, 9–17.V. Singhvi, A. Krause, C. Guestrin, J. H. Garrett Jr., and H. S. Matthews. 2005. Intelligent light control using

sensor networks. In Proceedings of the ACM Sensys.M. Welsh and G. Mainland. 2004. Programming sensor networks using abstract regions. In Proceedings of

the 1st Symposium on Networked Systems Design and Implementation (NSDI).K. Whitehouse, C. Sharp, E. Brewer, and D. Culler. 2004. Hood: A neighborhood abstraction for sensor

networks. In Proceedings of the MobiSys. 99–110.

Received September 2011; revised October 2012; accepted January 2013


66 distributed programming framework for fast iterative ... · 66 distributed programming framework...

Documents