designing and debugging real-time distributed systems€¦ · designing and debugging real-time...

SAFETY CRITICAL EMBEDDED SYSTEMS

27 February 2008

Designing and debugging real-timedistributed systems

� Real-time system designers and embeddedsoftware developers are very familiar with thetools and techniques for designing, developingand debugging standalone or loosely coupledembedded systems. UML may be used at thedesign stage, an IDE during development anddebuggers and logic analysers (amongst othertools) at the integration and debug phases.However, as connectivity becomes the norm,systems are becoming ever more distributed;what used to be a few nodes connected togeth-er with clear functional separation between theapplications on each node, is now becomingtens or hundreds of nodes with logical appli-cations spread across them. Worse, such distributed systems are becoming increasinglyheterogeneous in terms of both operating systems and executing processors.

The idea of a “platform” for development haslong pervaded the real-time embedded designspace as a means to define the application de-velopment environment separately from the un-derlying (and often very complex) real-timehardware, protocol stacks and device drivers.Much as the operating system (OS) evolved toprovide the fundamental building block ofstandalone system development platforms, real-time middleware has evolved to address the dis-tributed systems development challenges ofreal-time network performance, scalability and

heterogeneous processor and operating systemsupport. And as has already happened in theevolution of the standard real-time operatingsystem, new tools are becoming available to sup-port development, debug and maintenance ofthe target environment – in this case, real-timeapplications in large distributed systems. Fromthe individual application developer perspective,if the network is now the target system, what arethe basic capabilities which must be provided byan application development platform? Theseare: communication between threads of execu-tion, synchronisation of events, and controlledlatency and efficient use of the network re-sources.

Communication and synchronisation are fair-ly obvious distributed platform service re-quirements and are analogous to the servicesprovided by an OS, only now they have to runtransparently across a network infrastructure ofheterogeneous OS and processors with all thatimplies in terms of byte ordering and data rep-resentation formats. It should ideally use amechanism that does not require the develop-er to have an explicit understanding of the lo-cation of the intended receiver of a message orsynchronising thread so that the network can betreated as a single target system from an appli-cation development perspective. There are sev-eral middleware solutions which support this

approach, such as JMS and DDS (data distri-butions service) from the Object ManagementGroup (OMG). But only solutions such as DDSexplicitly address the third point; controlled la-tency and efficient use of (target) network re-sources, which is a critical issue in real-timeapplications. DDS provides messaging andsynchronisation similar to JMS, but additionallyincorporates a mechanism called Quality ofService (QoS). QoS brings to the applicationlevel the means to explicitly define the level ofservice (priority, performance, reliability etc) re-quired between the originator of a message orsynchronisation request, and the recipient.

DDS treats the target network somewhat like astate machine, recognizing that real-time sys-tems are data driven and it is the arrival,movement, transition and consumption ofdata that fundamentally defines the operationof a real-time system. Some data is critical andneeds to be obtained and processed within con-trolled/fixed latencies, most especially across thenetwork. Moreover, some data needs to be per-sisted for defined periods of time so it can beused in computation; other data may need to bereliably delivered but is less time-critical. QoSfacilitates all these requirements and more.

One of the easiest ways to make the networkedtarget environment more debuggable is to de-

By Geoff Revill, RTI

This article identifies the issuesof real-time distributed system

development and discusseshow development platformsand tools have to evolve to

address this challenging newenvironment.


fine strong interfaces between modules that areindependently testable. What good middlewaredoes is allow you to completely specify the data

interaction through quality of service whichforms a “contract” for the application. DDS forexample allows a data source to specify not only

the data type, but also whether the data is sentwith a “send once” or “retry until” semantic,how big a history to store for late-arriving re-ceivers, the priority of this source as comparedto others, and the minimum rate at which thedata will be sent, as well as many more possi-bilities. By setting these explicitly many of thesoft issues that creep up in integration can beaddressed quickly by matching the promisedbehavior to requested one. DDS middlewarewill even provide warnings at runtime whencontracts are not met.

We do not have a complete application devel-opment platform until we have the tools to sup-port the environment. Ask any support ormaintenance engineer and they will tell you thatthey need three things: good documentation,great tools and code written to expose the stateand event parameters as easily as possible. Cur-rent toolchains that operate on a single nodecan still be used as normal, and in effect can beused for white box testing in this isolated envi-ronment. In addition, the data-driven (statemachine) nature of the DDS distributed appli-cation development platform means that whitebox testing can be easily achieved. The real chal-lenge for developers is isolation, identificationand correction of the problems that are exhib-ited at the integration stage, when individual

RTI Analyser is a system level debugging tool


distributed sub-components are connectedand the network starts – for the first time - toexecute as the target environment.

Most engineers are familiar with debuggingwithin a single-board environment, and willhave developed a high degree of debug com-

petence in fixing hard faults, i.e. faults that haltor crash the process. These are relatively easy todebug because you can normally work back-wards from the state of the crash or, if you werereally lucky, you could get it to crash in a de-bugger and you were home free. The nastiesthard faults to deal with are normally multi-threading related, so it comes as no surprise thatas we move to larger, more complex distributedsystems you will see more and more of thesetypes of faults; every node will have its ownthread(s) of execution, potentially working onthe same data at the same time received fromacross the distributed system architecture.

Distributed systems are much more likely to besubject to numerous types of soft faults. Inthese cases, no application crashes, but thewarning lights are flashing and the distributedapplication either performs poorly or not at all.There are numerous types of soft faults, butmany of them come down to the synchronisa-tion of data generation and processing acrossmany machines. One example, for instance, isthe effect of a single dropped message; if thatmessage is one sample of an update of data itmight not be a big deal, but if it is transitionalevent or command, you could suddenly havethe system in an unexpected state. Moreover,you may not be able to detect this until some

RTI Analyser showing the QoS mismatch error in “ownership” between a DataReader andDataWriter

February 2008 30


time after the initial fault occurred, leading toa debugging nightmare.

This is just one type of soft fault. Many othersoccur regularly: high latencies (either sus-tained or periodic) which cause control loops tolose stability, massive data dropouts, unex-pectedly blocking applications, systems thatwork in the lab but fail when scaled up, data

mismatches between what is provided andwhat is expected etc. Thus for distributed sys-tems, it is vital to be able to get at the state andevent information without stopping or signif-icantly slowing the system.

Starting with the basics: the first thing that youneed is a tool that allows you to generate com-mon data types across all your boards and a

process that keeps them in synchronisation. Ifyou are using middleware you will normallywrite your data types in a meta-language (IDL,XML, XDR) and autogenerate the code thathandles the data types. Some systems willallow you to create new types on the fly, but be-ware that this is potentially a source of errorsince it will be much harder to verify the usagecontract on data if the programmer does notknow its details. The next tool you need allowsyou to design the applications and specify thedata and QoS requirements. This tool should beused to design as many of the applications aspossible so that the QoS contract betweensenders and receivers is met at design time(much easier than debugging and fixing itlater). In an ideal world, this tool should inte-grate with your normal design methodology.For instance, UML users may wish to considerSparxUML. This tool has interface descriptioncomponents for middleware such as DDS tomake it easier to initially set these up. Once yourapplications are deployed you need to makesure that the communications are happening asintended, QoS parameters are meshed and thesystem is running. One of the first questionsyou will need to answer at integration is “arethese distributed application functions talkingproperly?” With the appropriate middleware in-terrogation tool such as RTI Analyser you candetermine that the middleware has hooked upthe two applications and you can make surethat the designers of the two application func-tions actually met specification.

Such a tool also needs to show you which ob-jects are talking, or more importantly, not talk-ing, to each other and if not, suggest why not.You can truly appreciate these tools when youhave 3 different subcontractors (or even justfree-willed developers) each building part of adistributed application and it comes time to in-tegrate. The root cause of most configurationissues can be found quickly, accurately and witha minimum of debate. You now have great up-front design, good interfaces that people are fol-lowing and yet it still is not working. This iswhere distributed system-wide state and eventanalysis becomes key. Typically there are the fol-lowing three use cases during the debugging:

Monitoring of overall distributed system health.In this case you might want to see the high-levelbehavior of most of the applications in the sys-tem. Tools such as RTView from SL Corporationallow you to build one or many control panelGUIs or data report views by listening to data putout by the middleware as well as your applica-tion. By selectively instrumenting key variablesin your application this can be a great first stepin isolating system issues and ensuring that yoursystem is running properly. Because tools likeRTView leverage the DDS middleware, the lo-cation of source information for the displays

RTI protocol analyser allows you to see the on-wire traffic.

RTI Scope showing DDS topic data plotted against time with an oscilloscope-like display

does not need to be known. Merely knowing that it exists and in what for-mat it is available (as defined by your data meta-language) and how the datais made available (QoS) facilitates rapid assimilation of the informationneeded for such useful system overview displays. Typically the applicationsleveraging this sort of tool will have lots of different data sources, proba-bly at low time resolution, that need to be combined and displayed togetherto create a meaningful perspective of the system’s health. Tools like theseare often deployed as part of the maintenance environment for the dis-tributed system and as such include easy-to-use GUI builders that allowend-user-oriented displays of system data and health to be generated.

Getting into the guts of a faulty application. Once you have isolated whichnodes are having a problem with the system health tool you may need toget more detailed and higher time resolution data from a few selected ap-plications and their interaction across the network. Tools such as RTIScope provide this functionality by allowing the user to look at the dif-ferent data streams into and out of an application graphically, in real-time,without pre-configuration. Think of it as an oscilloscope for the datacoming out of an application from anywhere in the network, completewith negative time triggering, multiple plot types (vs time, x vs y), derivedsignals and the ability to save the data for post processing. RTI Scope stilloperates at the defined data level, but is designed to capture fewer datasources, in a minimally intrusive manner. It is ideal for capturing data thatruns out of bounds, or is delivered outside of its required throughput orperformance objectives. Its full knowledge of the underlying middlewareimplementation means that it can discover the data sources and recipi-ents and connect to them across the network, leveraging the middlewareto pull the data through for local analysis and visualisation.

Network Analysis. Sometimes the middleware is attempting to performthe service requested of it by the application, but it is the underlying net-work implementation itself that is not delivering as expected. Perhaps therouter is not routing properly, or there is an address corruption some-where or any one of a number of other problems. At this point you areleft with no choice but to drill down to the wire and see what is hap-pening. You reach for your protocol analyser and it gives you all the UDPor other packet information you need. But it is meaningless unless youcan correlate it back up to the application. Well constructed distributedmiddleware includes a standardised on-the-wire protocol; DDS for ex-ample uses the open standard RTPS (real-time publish subscribe), andas you would expect, such a platform includes the ability to monitor thewire traffic and pull out the associated middleware packets, dissectingthem for correlation back to the application layer. RTI can help here toowith a dedicated protocol analyser, capable of providing a real-time dis-play of all “on-the-wire” activity. �

31 February 2008


designing and debugging real-time distributed systems€¦ · designing and debugging real-time...

Documents