composing high-performance schedulers: a case study from real-time simulation

25
CONCURRENCY: PRACTICE AND EXPERIENCE Concurrency: Pract. Exper., Vol. 11(5), 221–245 (1999) Composing high-performance schedulers: a case study from real-time simulation KAUSHIK GHOSH * , RICHARD M. FUJIMOTO AND KARSTEN SCHWAN College of Computing, Georgia Institute of Technology, Atlanta, GA30332, USA SUMMARY Dynamic, high-performance or real-time applications require scheduling latencies and throughput not typically offered by current kernel or user-level threads schedulers. Moreover, it is widely accepted that it is important to be able to specialize scheduling policies for specific target applications and their execution environments. This paper presents one solution to the construction of such high-performance, application-specific thread schedulers. Specifically, scheduler implementations are composed from modular components, where individual scheduler modules may be specialized to underlying hardware characteristics or implement precisely the mechanisms and policies desired by application programs. The resulting user-level schedulers’ implementations can provide resource guarantees by interaction with kernel-level facilities which provide means of resource reservation. This paper demonstrates the concept of composable schedulers by construction of several compositions for highly dynamic target applications, where low scheduling latencies are critical to application performance. Claims about the importance and effectiveness of scheduler composition are validated experimentally on a shared-memory multiprocessor. Scheduler compositions are optimized to take advantage of different low-level hardware attributes and of knowledge about application requirements specific to certain applications, including a Time Warp-based real-time discrete event simulator. Experimental evaluations are based on synthetic workloads, on a real-time simulation blending simulated with implemented control system components, and on a dynamic robot control program. Measurements indicate that schedulers can be composed and specialized to offer performance similar to that of dedicated scheduling co-processors. Copyright 1999 John Wiley & Sons, Ltd. 1. INTRODUCTION Current and future systems must operate in increasingly complex execution environments, where platform characteristics may change rapidly (e.g. changes in resource availability), application behavior may be highly variable (e.g. data dependent changes in workloads in scientific or real-time applications), and where underlying hardware platforms may be subject to frequent change (e.g. different underlying execution engines in mobile systems). As a result, applications must be able to react quickly to such changes by dynamic creation, deletion and re-scheduling of relevant application tasks. Moreover, for real- time applications, the dynamic nature of their execution environment implies that such tasks have dynamically determined execution times, periods and inter-task dependencies. Therefore, we pose that the performance of future high-performance and real-time applications critically depends on the performance of their task schedulers. Toward this end, we show in this paper that significant improvements in scheduler performance can * Correspondence to: Dr. K. Ghosh, College of Computing, Georgia Institute of Technology, Atlanta, GA, 30332, USA. CCC 1040–3108/99/050221–25$17.50 Received 17 June 1997 Copyright 1999 John Wiley & Sons, Ltd. Accepted 15 February 1998

Upload: karsten

Post on 06-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Composing high-performance schedulers: a case study from real-time simulation

CONCURRENCY: PRACTICE AND EXPERIENCEConcurrency: Pract. Exper.,Vol. 11(5), 221–245 (1999)

Composing high-performance schedulers: a casestudy from real-time simulationKAUSHIK GHOSH∗, RICHARD M. FUJIMOTO AND KARSTEN SCHWAN

College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA

SUMMARYDynamic, high-performance or real-time applications require scheduling latencies andthroughput not typically offered by current kernel or user-level threads schedulers. Moreover,it is widely accepted that it is important to be able to specialize scheduling policies for specifictarget applications and their execution environments. This paper presents one solution to theconstruction of such high-performance, application-specific thread schedulers. Specifically,scheduler implementations are composed from modular components, where individualscheduler modules may be specialized to underlying hardware characteristics or implementprecisely the mechanisms and policies desired by application programs. The resulting user-levelschedulers’ implementations can provide resource guarantees by interaction with kernel-levelfacilities which provide means of resource reservation.

This paper demonstrates the concept of composable schedulers by construction of severalcompositions for highly dynamic target applications, where low scheduling latencies are criticalto application performance. Claims about the importance and effectiveness of schedulercomposition are validated experimentally on a shared-memory multiprocessor. Schedulercompositions are optimized to take advantage of different low-level hardware attributes andof knowledge about application requirements specific to certain applications, including aTime Warp-based real-time discrete event simulator. Experimental evaluations are based onsynthetic workloads, on a real-time simulation blending simulated with implemented controlsystem components, and on a dynamic robot control program. Measurements indicate thatschedulers can be composed and specialized to offer performance similar to that of dedicatedscheduling co-processors. Copyright 1999 John Wiley & Sons, Ltd.

1. INTRODUCTION

Current and future systems must operate in increasingly complex execution environments,where platform characteristics may change rapidly (e.g. changes in resource availability),application behavior may be highly variable (e.g. data dependent changes in workloadsin scientific or real-time applications), and where underlying hardware platforms may besubject to frequent change (e.g. different underlying execution engines in mobile systems).As a result, applications must be able to react quickly to such changes by dynamiccreation, deletion and re-scheduling of relevant application tasks. Moreover, for real-time applications, the dynamic nature of their execution environment implies that suchtasks have dynamically determined execution times, periods and inter-task dependencies.Therefore, we pose that the performance of future high-performance and real-timeapplications critically depends on the performance of their task schedulers. Toward thisend, we show in this paper that significant improvements in scheduler performance can

∗Correspondence to: Dr. K. Ghosh, College of Computing, Georgia Institute of Technology, Atlanta, GA, 30332,USA.

CCC 1040–3108/99/050221–25$17.50 Received 17 June 1997Copyright 1999 John Wiley & Sons, Ltd. Accepted 15 February 1998

Page 2: Composing high-performance schedulers: a case study from real-time simulation

222 K. GHOSHET AL.

be attained by implementing them such that we take advantage of underlying platformcharacteristics and use knowledge about specific application characteristics and needs.

The principal contributions of this paper are (i) the design and implementation ofhigh-performance task schedulers, as well as (ii) their internal decomposition suchthat the resulting composable real-time schedulers can make scheduling decisions withlatencies approximating those of proposed specialized scheduling co-processors. Twoimplementation principles are employed to attain such improvements in schedulerperformance compared to previous work:

• Separation of concerns– past designs of real-time threads packages[1–3] performschedulability analysis when threads are created, thereby combining the overheadsof thread creation, scheduling and dispatching. In contrast, the real-time schedulerpresented in this work separates thread management from scheduling anddispatching. This is attained by introduction ofeventsas descriptors of schedulableexecution intervals.

• Scheduler configuration– proposed real-time thread packages may permit theirschedulers to be replaced in their entirety, or perhaps process alternative schedulingfunctions[4,5], thereby supporting alternative scheduling policies and schedulerimplementations within a single framework. In contrast, this paper explores theattainment of increased performance by exchange of specific scheduler componentssuch that application-level scheduler interfaces need not be affected. Toward thisend, we precisely define the interfaces between those components. We hope toderive from this work insights into the suitable construction of future kernel-levelor user-level real-time schedulers. Additional techniques for scheduler configurationdescribed in the literature, such as value or time functions[6] may be employed aswell, but that is not the subject of this paper.

In the remainder of this paper, we first discuss the reasons for separating thread creationfrom scheduling, followed by a statement of our approach toward scheduling/dispatchingin real-time environments. For illustrative purposes, and to set up a ‘control’ scenario,we evaluate in detail the costs of real-time scheduling and thread dispatching in atraditional high-performance real-time thread package. We then show how the structure andinterfaces of this traditional scheduling framework should be modified to permit schedulerconfiguration. Toward this end, we introduce and discuss the notion and implementationof scheduling ‘events’, where multiple such events may be associated with the executionof a single real-time thread. The lightweight event scheduler’s interfaces to applicationprograms and to the underlying thread and event dispatchers are defined such that theresulting ‘scheduler’ may be configured for different target hardware and applicationrequirements. We illustrate this configurability by re-composition of the thread scheduler,and we show that such re-composition results in reduced scheduling latencies.

2. FUNCTIONAL COMPONENTS OF SCHEDULERS IN REAL-TIMEENVIRONMENTS

A real-time application imposes three disparate requirements on any underlying operatingsystem ‘kernel’. Itsthreads(or task) module is used to associate control with pieces of codein the application. Theschedulermodule reserves time for certain activities performed bythose threads on the ‘time-line’ of the processor. Thedispatchermodule runs particular

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 3: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 223

Figure 1. The structure of our framework

activities on the processor in the interval reserved for it by the scheduler module, andessentially ensures temporal security. The interactions among these modules are shown inFigure1.

The functional distinctions between the various parts of this framework should alsobe followed up with distinct data structures for threads, scheduler and dispatcher. Inthe remainder of this paper, we show how existing real-time systems combine thesefunctionalities and structures, and thereby lose performance advantages that wouldotherwise become available.

2.1. Drawbacks of existing approaches

Extant real-time ‘thread packages’ combine the to-be-scheduled activities of threads withother (orthogonal) aspects of ‘threaded-behavior’, with thread scheduling and dispatching.For instance, such packages have calls to fork a periodic or sporadic thread on a certainpiece of code with appropriate timing characteristics, thereby combining the schedulabilityanalyses performed for threads’ activities with thread management and dispatching[2,3].This might have been justified in somewhat predictable environments – where task periodsare not frequently changed – but it cannot be justified for more dynamic environments, e.g.data-dependent events to be scheduled for execution in discrete event simulations or for thevariable number of iterations to be executed by threads in dynamic robotics applications(for more detail on these applications, see Section4).

Instead, we submit that forking a thread should be seen as associating an executioncontext with a code fragment, whereas schedulability analysis must typically be performedfor some execution of that code fragment based on certain ambient conditions, including itsinput data, processor load, etc. For instance, an application might consider such conditionsbefore explicitly trying to determine whether the intended activity can be completed intime. Further, it should be possible for the application to determine the idle time on aprocessorat any time, not only when a new thread is being created.

In summary, the activities of associating control with code, that of determining whetherthe code can be executed to completion in time, and of determiningwhenthe code shouldstart/end execution should be separated from each other. This is not the case for extantreal-time thread packages. In what follows, we show the advantages that can accrue froma clean separation between the notion of threads and the scheduling of certain ‘activities’by them.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 4: Composing high-performance schedulers: a case study from real-time simulation

224 K. GHOSHET AL.

Figure 2. Example code for events

2.2. Threads and events

Our thesis is that both the performance and the configurability of real-time threadsand their schedulers is improved by separating the notion of an executable threadfrom the scheduling of its activities and by exposing this separation at the externaland internal interfaces of high-performance or real-time schedulers. Towards this end,henceforth, we define a schedulable and discernible activity of a thread as anevent,which is an abstraction that is already commonly used in simulations[7]. Therefore, acertain code segment is associated with a thread when it is forked, but events performedby that code segment are described and scheduled separately. This has some obviousperformance implications, since generating a new event is less costly than forking anew thread, and since the schedulability analysis for events is easily separated fromthe threads packages (or operating system’s) mechanisms for thread creation, schedulingand dispatching. Moreover, this approach is attractive from a configurability/extensibilitystandpoint, since it separates code concerned with thread creation and representationfrom the potentially less expensive mechanisms used for scheduling threads’ activities.In addition, it permits applications to create threads when needed while employing the lessexpensive event scheduling mechanism when existing threads can be reused. Finally, theseparation of threads from events permits the construction and simultaneous use of multipleimplementations of event schedulers with a single underlying thread scheduler/dispatcher.We next present an illustrative example demonstrating the difference between threads andevents.

2.3. An example

Consider the example shown in Figure2. The system consists of two periodic tasksP1 andP2. The numbers outside the boxes are the times at which the instances of these periodicsare ‘released’, while the numbers inside the boxes show the execution time of each instance(also called an event).

It may so happen that though the representation of the system (from a modeling pointof view) demands two distinct ‘tasks’, there is no concurrency that can be exploited inthe execution of these tasks, or the events of the tasks may have to run in a specifiedsequential order – as often happens in discrete-event simulation (causality) or database(serializability) applications. Further, the application might itself take care of variables

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 5: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 225

local to P1 andP2 (perhaps by maintaining them in distinctmalloc -ed regions).Therefore, to reduce overheads, it is desirable to associate a single thread with all

such tasks and perform co-routining to switch between them. Due to dynamism in theenvironment, the time of the next occurrence and the execution time of a task might becomefully known only at the end of the immediately previous occurrence (the task is therefore‘self-initiating’ rather than a true periodic). It is apparent from this discussion that therequirements of the above activities may not match the functionality of forking a ‘periodicthread’ for each task, with a well-known period, execution time, etc. We illustrate thismismatch by representing these activities with ‘classical’ real-time threads.

When using a traditional real-time threads package, an implementor would associatea periodic real-time thread with each ofP1 andP2, with the worst-case execution timeof each ofP1 andP2 as parameters. Each of these threads would have a distinct stack,and it would experience all overheads of threads management. However, such a threadspackage cannot trivially take into account the changing periods ofP1 andP2. As a result,each instance ofP1 and P2 would have to be represented as different sporadic threads,thereby forcing implementors to create and manage a needlessly large number of threads.Modern notions of ‘stack-less’ threads (e.g. filaments[8]) might alleviate this situation, butwe submit that the appropriate approach to addressing this problem is the use of ‘events’,as described below.

Using both threads and events, the application generates typed events to represent eachof the processesP1 andP2, with specific deadlines and execution times associated withthose events† as shown in Figure2. The thread module is used to associate control withthe first of the events onP1 (since that has the earliest start time). Subsequently, the firstevent onP2 is invoked via a simple function call (instead of a thread context switch). Itstiming characteristics are also known by this time, and the scheduler module is used to‘reserve’ time for the new event. The scheduler returns a status stating whether the newevent is schedulable, and the idle time left if it is not schedulable (which can be used by theapplication to submit a new event with a smaller execution time, if so indicated). Finally,the ‘next event to be run’ is available from the event scheduler’s data structures, while theevent dispatcher assures temporal security by generating and enforcing the start and endtimes for this event. Note that the event scheduler does not explicitly represent the start andend times for each event, as done for threads in traditional real-time schedulers[9]. This hasimplications for the configuration of thread schedulers, as shown in Section3.1

3. A COMPOSABLE THREADS SCHEDULER

Having explained the differences between traditional real-time scheduling frameworks andours, in this Section we discuss the re-organization needed in the scheduler in greater detail.

The threads and event schedulers indicated above are assumed replicated across all nodesof the parallel machine, as typically done in NUMA machines[9,10]. Therefore, concerningthreads and event scheduling, each node (processor) is treated as a uniprocessor engine.The experimental implementation of event and threads schedulers/dispatchers evaluated inthis paper does not support multiprocessor scheduling in part due to the cost of accessingremote state and/or task migration in NUMA and CC-COMA machines such as the KSR,and in part due to the logical process and event model offered by the underlying Time

†As is customary in real-time software, we assume that the deadlines and (worst-case) execution times of eventsare known when the events arrive.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 6: Composing high-performance schedulers: a case study from real-time simulation

226 K. GHOSHET AL.

Warp discrete event simulation kernel employed for the execution of real-time simulationsconstructed as part of this research[11,12].

The specific event scheduler evaluated in this paper uses a variation of theEarliestDeadline(ED) algorithm[13,14] for scheduling events[12,15]. Once accepted, an event isrun by the low-level eventdispatcher, which simply executes the next available event withthe lowest deadline among all those accepted. Applications demand that schedulabilityanalyses be executed with low latency, especially when newly arriving real-time tasks havesmall laxities‡.

In order to demonstrate the advantages accruing from the proposed schedulerconfiguration, we first set up a ‘control’ scenario: we provide experimental evaluations oftraditional real-time scheduling latency on the KSR multiprocessor. This latency accruesfrom the following sources:

• search time: finding the proper place for insertion of a new event, which involvessearching the ED scheduler’s sorted list

• idle-time calculation: calculating the amount of idle time available to run a newlyarriving event

• updating the ED list: inserting the event entry in the sorted list and updating theentries corresponding to events with higher deadlines to show that the idle time onthe processor has decreased.

We will return to the costs associated with these items throughout the remainder of thepaper.

A related cost is due to garbage-collection of ‘old’ parts of the scheduler data structures.Scheduling analysis will not involve any event with deadline earlier than the ‘current time’.Garbage-collecting the list is an overhead, but it reduces the list length, which in turnspeeds up the search, the idle time determination, and the updates mentioned earlier withrespect to the costs of scheduling. In [16], we derive a relationship between the optimallength of the ED list, taking into account garbage collection and scheduling overhead,thereby obtaining a reconfigurable real-time garbage collection scheme.

3.1. Chief costs in a traditional scheduler

Real-time operating systems typically use a single task list for both schedulability analysisand task dispatching. An example is shown in Figure3.

In this example, there are five activities (tasks, in ‘classical’ real-time schedulers) inthe system initially, with deadlines and execution times as shown. The ED schedule forthe tasks is built at real time 0. This schedule, showing in particular the times at whicheach task should start execution and when they should complete execution, appears in thesame Figure. Information about start and end times would be used by the low-lever taskdispatcher to hand off the processor to the thread of the appropriate task.

At real time 1, a new activity arrives in the system. It has the lowest deadline of alltasks, and it should therefore be run before any other task. Scheduling this task involvestraversing the part of the ED list corresponding to tasks with deadlines higher than the new

‡We use the term ‘laxity’ as follows: the ‘laxity’ of a taskT denotes the urgency ofT : how long the processor canexecute other tasks (or remain idle) beforeT misses its deadline. If the deadline of a taskT is d, the remainingexecution time for accepted tasks with deadline less than or equal tod on that processor ise, and the current realtime is t , we say that the laxity of taskT at real timet is d − (e + t).

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 7: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 227

Figure 3. Example showing original ED slot list

task. Now, for each of these ‘later’ slots in the ED list, we have to increase the start andend time by an amount equal to the execution time of the new task. If this results in the endtime ofanyof these slots becoming greater than the deadline on the slot, then the new taskcannot be accepted. In this case, the start and end times of all such ‘affected’ slots will haveto be reverted to their original values. However, if the task is deemed acceptable after sucha traversal, nothing further needs to be done. Note that these updates can become costly ifthe number of slots in the ‘later’ part of the ED list is large.

We posit that this approach is not suitable for low-latency dynamic scheduling, simplybecause this single task list contains more information than is needed for low-costschedulability analysis. Thus, information redundant to the scheduler (but useful for thedispatcher) has to be updated for schedulability analysis of a task. Specifically, an improvedtask list should keep track of only the information needed to perform schedulabilityanalysis for any incoming task, and to maintain data required toconstructinformationrequired by the dispatcher as needed§. These data consist of the ‘idle time’ available on

§At any time, the dispatcher needs to be concerned only with parameters of the immediately next task to run. Thisis unlike the scheduler, which has to be concerned with all tasks whose deadlines are not over yet.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 8: Composing high-performance schedulers: a case study from real-time simulation

228 K. GHOSHET AL.

Figure 4. Basic costs of real-time scheduling (on KSR-1, without a hardware floating-point unit)

the processor to run an incoming task, the execution time of each task, and the deadline ofeach task.

3.2. Basic scheduler performance

Experimental environment:The measurements reported in this paper were taken on twoversions of a Kendall Square Research CC-COMA multiprocessor. The early version,a KSR-1, had a 20 MHz clock, 32 Mbytes of local memory organized as a cache and256 Kbytes each of instruction and data subcache. The subcache access time is 2 cycles,while access from the local cache requires 23 cycles, and satisfying a cache miss requires150 cycles. However, 128 contiguous bytes are fetched when any miss is serviced. Theinterconnect is a slotted ring with a bandwidth of 1 Gbytes/s. The later version, a KSR-2,has the same interconnect, memory capacity and number of cycles to memory as the KSR-1, but it has a 40 MHz clock. Furthermore, the KSR-2 has a hardware floating-point unit,while on the KSR-1 floating point operations were performed in software. The data shownhere are the average costs of scheduling 25,000 tasks.Performance of ‘control’ setup:The measurements in Figure4 were taken on a KSR-1.These measurements corroborate our claim about the cost of scheduling. Namely, thetotal cost of scheduling is almost identical to the sum of the costs of searching, idle-timecalculation and updating the task list, as mentioned in Section3. Thus, for a single tasklist, also called a slot list or ED list, the detailed scheduling costs are shown to arise from(i) finding a place in the list for task insertion, (ii) determining the available idle time, and(iii) actually inserting an accepted task and updating the slot list.

As explained earlier, the length of the slot list has implications on the time required tosearch the ED list for insertion, and the time to update the ‘later’ entries in the ED list.The measurements in Figure4 assume that the slot list is garbage-collected after its lengthbecomes equal to the value on thex-axis. In other words, the length of the slot list can never

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 9: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 229

become greater than the corresponding value on thex-axis. The topmost line denotes theactual (measured) time for schedulability analysis, the three lowermost lines denote thebreakdown of times required individually for (i), (ii) and (iii) above, while the second linefrom the top represents the sum of the three lowermost lines.

From this graph, it should be apparent that the costs in (i) through (iii) form the majorpart of the aggregate cost for schedulability analysis. In the next Section, we outline thestructure of our proposed scheduler, which reduces each of these costs through the use ofcomposable parts instead of monolithically combining real-time scheduling activities withother thread-specific activities.

3.3. Proposed scheduler structure

Our hypothesis concerning the separation of data structures for maintaining schedulinginformation and for the task dispatcher is motivated by the discussion concerning reductionof the costs for (i) through (iii) as discussed in Section3.2:

• The cost of updating ‘later slots’ is a large part of the overall cost of scheduling.Later slots have to be inspected/updated in operations (ii) and (iii).

• Scheduling cost also arises from computing the total idle time available for a newtask. This involves finding the minimum laxity of all accepted tasks with higherdeadline.

• The third major cost of scheduling arises from the time taken to search the slot listfor inserting the slot for a new task.

The first of the costs itemized above arises because start and end times of individualcomputations have to be updated every time a new computation is accepted for execution.However, a crucial point to note is that the start and end times are used by the dispatcherand not the schedulerper se. Further, the dispatcher is concerned only with the currentcomputation being run and the immediately next one to be run; it does not need theinformation about all tasks to be updated every time a new computation is accepted. Thisindicates that we can achieve significant savings bynotperforming such updates, if we cansomehowgeneratethe start and end time for the dispatcher, instead of pre-computing andstoring the same.

To this end, we separate the structures used by the slot list, and the dispatcher. InSection5.1, we present an approach for generating the dispatcher-related information on-the-fly, which results in a reduction of scheduling latency.

We now turn to the second of the costs itemized above. A simple search of all the laterslots in the ED list, thereby finding the one with least laxity, is costly, as is evident fromthe ‘Time to determine schedulability’ part in Figure4. A crucial insight is that, typically,there are specific ‘minima’ in the laxities of incoming computations (these are preciselythe ‘most urgent’ computations). A mechanism that allows one to check these minima oflaxity, instead of ploughing through the entire slot list, would significantly reduce the costof idle-time-calculation.

To this end, we reorganize the slot list to keep a pointer from each slot to the slot withminimum laxity of all slots with higher deadline. Then, calculation of idle time necessitatesa few pointer dereferences, rather than costly traversals of the slot list. The cost of updatingidle time is also reduced by the use of such a data structure. Details of this approach arepresented in Section5.2.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 10: Composing high-performance schedulers: a case study from real-time simulation

230 K. GHOSHET AL.

We note that the last of the costs itemized above is ‘internal’ to the scheduler list (thereis no interaction between searching the slot list and any activity in the dispatcher, forinstance), and is somewhat orthogonal. This cost can be reduced by efficient data structuresand search methods in the scheduler list, and does not directly affect any interaction withthe other modules (e.g. dispatcher) that one has to consider in building composable real-time thread schedulers. Thus, we have considered this cost only briefly. We have usedtwo schemes: a hashing scheme, which is expected to reduce search costs in general, andanother approach whereby the last point of insertion is used as the starting point of thesearch. Results are presented in Section6.1.

To summarize, the organization and functioning of the proposed real-time scheduler is asfollows. When a new computation arrives in the system, an appropriate event (rather thana new thread) is created to handle it. The deadline of this computation is also the event’sdeadline. Searching through the ED list (which is sorted by deadline), one finds the point atwhich the event would be inserted if it were accepted for execution. Now, the available idletime to run the new event is determined, by following the ‘minimum laxity’ pointer for thatevent in the ED list which has a deadline immediately higher than the new event. If thereis not enough idle time to run the new event to completion, it is rejected. Otherwise, the‘minimum laxity’ structures are updated to reduce available laxity by the execution timeof the new event, a slot is created for this new event, and its ‘minimum laxity’ pointer isset up. When an event computation completes, the immediately next event (as per the slotlist) is dispatched, with its start and end times beinggeneratedfor the dispatcher from theexecution characteristics for the event, as available in the ED list.

Before we consider the scheduler structures in greater detail, in the next Section wepresent sample applications used to evaluate the scheduler. This is because the schedulerstructures are optimized forspeculative executionand reconfiguration with respect tofloating point hardware availability, application characteristics that we detail in the nextSection.

4. APPLICATIONS USED FOR PERFORMANCE EVALUATION

This Section describes the principal applications motivating the novel event schedulers andtheir composition developed in this research. These applications are (i) hybrid discrete-event simulations blending simulated with real world events and (ii) the control of robotsin dynamic execution environments. Though disparate, both of these application domainseffectively utilize a speculative execution paradigm, which has been shown to be generallyuseful in dynamic systems[12,17,18].

It should be noted that the principles and mechanisms of scheduler configurationdiscussed in this paper are of general applicability. We have chosen the speculativeexecution paradigm to demonstrate a sample method of scheduler configuration, and theadvantages that accrue from it. The rollback/re-execute nature of speculative executionrequires low-latency real-time schedulability analyses. The dynamic nature of theapplications discussed hereunder requires the efficient determination and update of idletime. The applications can be easily written as discrete-event systems. All of theseattributes afford a fertile ‘application’ ground to test ideas about the configuration ofschedulers for dynamic target systems (e.g. dynamic real-time systems, high-performancecomputations and mobile systems).

Similarly, we discuss the configuration of schedulers vis-`a-vis floating-point hardware

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 11: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 231

availability. While some of this research was actually driven by the availability of floating-point hardware in the parallel processor of our choice, this is a general problem in resource-poor systems. Embedded systems often do not have enough on-chip real-estate to beable to afford floating-point hardware. In mobile systems, such resource poverty may beaggravated by cost constraints. Theremight be ways to cleverly use integer arithmeticfor deadlines by complicated manipulation of integer representations of time. However, afloating-point representation of deadlines affords greater accuracy and is more natural andprevalent.

4.1. A hybrid real-time system

A hybrid real-time system is one in which some parts of the system consist of actuallyimplemented software or hardware, while the remainder consists of a simulation ofthe actual software or hardware that will finally form that part of the system. Theexternally visible real-time constraints are imposed by the actual hardware or softwarewhich has to interact with the environment within particular deadlines. This, in turn,produces real-time constraints on the simulated part, which must interact, and keep upwith the implemented parts. Such systems are becoming increasingly useful for currentand future real-time applications which operate in uncertain environments, and thereforecannot be developed without substantial simulation activities blending simulated withunimplemented components or mixing reality with simulation (e.g. to train users, toplay ‘what if’ games, etc.). This is because it is difficult or expensive to implement theentire system at once. Furthermore, the ‘working environment’ of these systems may bedangerous, expensive or impossible to reproduce at the development stage.

Since meaningful executions of the simulated components of hybrid real-timesimulations can be extremely time-consuming, the use of parallel computing techniquesis indicated. However, the data-dependence constraints between the tasks in the simulationare often irregular, rendering conservative execution strategies almost useless in speedingup the simulation. While optimistic/speculative execution has been successfully used tospeed up such irregular simulations, the use of speculative techniques in conjunctionwith maintaining desired real-time constraints remains largely unexplored. Theoreticalinvestigations of this issue appear in [12,15]. This paper presents experimental solutionsand evaluations.

The difficulties of associating real-time constraints with optimistic simulations derivefrom the overheads of real-time scheduling of simulation tasks and of re-schedulingrolled-back tasks sufficiently fast to permit the simulation to keep up with the externalenvironment’s timing. At a minimum, for real-time simulations to succeed, the overheadsof task scheduling must be kept low. To evaluate these overheads, this paper employs oneof the standard models used for evaluation of parallel discrete event simulations, calledthe PHold model[19]. PHold is augmented with real-time deadlines, as discussed in detailin [12]. For the purpose of this paper, we note that the PHold application is written as acollection of simulation processes (represented as threads in our implementation) each ofwhich has typed events with deadlines, as stated in the example in Section2.3. Executingan event on any thread can update the ‘system state’ and/or produce further events on thesame or on a different thread.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 12: Composing high-performance schedulers: a case study from real-time simulation

232 K. GHOSHET AL.

4.2. A robotics application

This Section presents the salient features of a dynamic robot control application, and pointsout why speculative execution is indispensable for attaining high performance for thisapplication.

Consider a robot moving from a specific source to a specific goal on a terrain.Information about the terrain is only partially known. As the robot moves, it collectsinformation about obstacles within its ‘visual’ radius, thereby updating the ‘map of terrain’.The control software for the robot consists of two main parts: a planner which generatesthe best path for the robot to follow depending on the terrain characteristics, and ‘motor’software controlling the actuators that move the robot and gather information from thesensors about the part of the terrain within visual radius.

Since the terrain is not completely known, planning must start with incompleteknowledge. Potential obstacles, which are detected by the sensor part of the motorsoftware, transform to potential data dependencies between the motor module and theplanner module. Thus, in a classical ‘parallel’ execution, motor and planner modules wouldexecute in lockstep.

We contend that this is overly pessimistic. We propose to speculatively execute tasksconcurrently with other tasks that potentially depend on them, thereby harnessing as muchparallelism as possible. Special ‘detect-and-recover’ software will detect any dependencyviolation thatactually arises, and recover from it using a rollback-like approach. Theproblem is to be able to perform this in real time, keeping up with the deadlines imposedby the robot’s speed and the obstacles in its external environment.

The motor software does not execute speculatively, and thus cannot be made to rollback; it consists of ‘actual’ code with timing constraints. The planner, on the other hand, isimplemented as a speculatively executed program which interacts with the motor software,and must ‘keep up’ with the timing constraints required by that code. Both motor andplanner are written using the event-oriented approach mentioned earlier. The applicationstarts up one thread per processor¶; that thread runs events on the particular processor onan earliest deadline next basis, as mentioned in the example in Section2.3.

The application has several attributes of interest. First, the execution times, start timesand deadlines of planning tasks cannot be determined statically due to the robot’s initiallack of terrain knowledge (i.e. an incomplete or inaccurate ‘map’). Second, since planningis not essential to robot safety, it is both possible and useful to pre-execute planning tasksassuming that undiscovered obstacles do not exist along the current path. This permitsre-planning to take advantage of idle time made available by speculative execution. Suchexecution mimics primary/secondary tasks often used in the implementation of iterativereal-time applications. Third, scheduling latency should be low, since excessive overheadsin re-scheduling can lead to the advantages of speculative execution being annulled.

5. SCHEDULER: DETAILED IMPLEMENTATION

Having explained the type of dynamic applications that we address, we now present thedetails of scheduler implementation in the next two Sections. Specially, we consider theseparation of scheduler and dispatcher data structures and a method to efficiently calculate

¶The motor activities all run on one processor; the planner can be configured to run on multiple processors.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 13: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 233

the amount of idle time available for a new computation. We end this Section with aconsideration of the advantages that accrue from the new scheduler interface.

5.1. Generating the dispatcher information

In this Section we demonstrate an approach to reducing the stored information used by thedispatcher. Instead of storing information about when to start/end particular computations,the dispatcher generates such information on-the-fly. This results in a reduction ofscheduling latencies.

When a new computation is accepted, the entries with higher deadline in a slot list haveto be updated to reflect new start/end times and the reduction in the available idle time onthe processor. The start time and the end time of the slots corresponding to these ‘later’tasks have to be increased by an amount equal to the execution time of the new task.

The dispatcher starts running the thread for a particular task at a real time equal to thestart time of the slot, and pre-empts that thread when real time becomes equal to the endof the slot. However, for the purposes of schedulability analysis, the minimum informationrequired is the deadline and worst-case execution time of the task. Of course, some otherinformation is also typically kept in the slot list for efficient schedulability analysis (e.g.laxity: see Section5.2).

As shown in Figure5, instead of maintaining the start and end times of slots for thedispatcher, we cangeneratesuch information from the slot list. The real-time schedulereffectively chooses the next computation to execute (the one with the closest deadline);the execution time of that computation is known; therefore, we just need to schedulean interrupt after an interval equal to the worst-case execution time. At this time thecomputation, if still running, will be pre-empted. In the expository example shown inFigure5, we assume that every task is ready to be run at the ‘current time’.

5.2. Efficiently finding and updating the minimum laxity

Finding and updating the idle time efficiently are the main issues involved in constructingreal-time schedulers. It ispossibleto traverse the slot-list and find/update the idle timeby modifying the start/end times on each slot. However, recall that the start/end timesare floating point values. Hence, the above approach becomes too costly if floating-pointhardware is not available.

In this Section, we show how a separation of scheduler and dispatcher data structuresallows a different organization of the slot list – one which allows re-configuration of the slotlist for fast calculation of the idle time on machines where floating-point hardware is notavailable. Specifically, we demonstrate how our scheduler can use pointer manipulation tocircumvent increased latency of scheduling due to the absence of floating-point hardware.We first discuss the general principles, and then re-do the example shown in Section3.1.This clearly brings out the reduction in cost.

To find and update the minimum laxity efficiently, from every slotS in the ED list,it should be possible to find the identity of the slot which has the minimum laxity ofall slots with deadline greater thanS. This would reduce the time required for idle-timedetermination, which would then involve reading a single structure. Further, the time forupdating later slots (to show that their laxities have reduced) would also be lowered, sincethat would now involve updating the ‘minimum laxity’ values rather than the individual

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 14: Composing high-performance schedulers: a case study from real-time simulation

234 K. GHOSHET AL.

Figure 5. Generating information for dispatcher from the ED list

start and end times of all future slots. The number of such minimum laxity values areexpected to be much smaller than the total number of slots with higher timestamp in theED list.

To this effect, we associate a pointer with every slot. This pointer (called MinlaxPtrin Figure6) points from a slotS to a slot with equal or higher deadline which has theminimum laxity of all the later tasks. In the example in Figure6, the slot with deadline 15has the minimum laxity of all the tasks. Therefore, the MinlaxPtr of the slots with deadline5, 7, 10 and 15 point to the slot with deadline 15, while the MinlaxPtr of the slot of the taskwith deadline 19 points to itself (since MinlaxPtr must point to a slot ofequal or greaterdeadline)‖. Now, finding the minimum laxity of all tasks with deadline greater than orequal to the task corresponding to a slotS involves accessing the slot pointed to by theminimum-laxity pointer ofS. The arrayCumExec in Figure6 is an auxiliary structure

‖For expository purposes, all later laxities have been updated to their proper values in Figure6. In actuality, onlythe values in slots pointed to by MinlaxityPtrs will be updated.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 15: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 235

Figure 6. The use of MinlaxPtr to reduce time for finding idle time, and updating later laxities

used to speed up schedulability analyses. Its elements contain the sum of the executiontimes of all slots up to the slot pointed to. Schedulability analysis for an arriving event isperformed as follows. If the execution time of the event is less than the laxity obtained fromthe MinlaxPtr (this part ensures that no later event will miss its deadline due to the newevent), and the sum of the cumulative execution time of all the previous slots and its ownexecution time is less than its deadline (this part ensures that the new event will completebefore its deadline), then the new event is accepted.

It should be noted that there may be several minimum-laxity pointers associated withthe complete slot list – as many as the number of slots in the worst case. When a new slotis inserted into the ED list, the laxities associated with all the minimum-laxity pointersassociated with ‘later’ slots are reduced by an amount equal to the execution time ofthe new task. In addition, minimum-laxity pointers of slots of tasks with lower deadlinemay have to be updated, since they might now have to point to a lower minimum: onecorresponding to the minimum-laxity pointers of the new task’s slot. However, the number

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 16: Composing high-performance schedulers: a case study from real-time simulation

236 K. GHOSHET AL.

Figure 7. Why the MinlaxPtr approach does not work as such for non-speculative execution

of such updates is expected to be much lower than the total number of slots in the averagecase.

Note that speculative execution is crucial for the approach outlined above. This isbecause the minimum-laxity strategy requires that, when a new task arrives, the task withthe minimum laxity of all higher-deadline tasks should determine the amount of idle timeavailable for running the new task. In speculative execution, the start time of each task isessentially the time when the system starts execution. In non-speculative execution, withspecific start times on tasks, the condition above may not hold. In Figure7, assume thattask deadlinesDi are increasing from tasksT 0 to T 4, and that taskT 3 has the minimumlaxity of all the tasks. Assume, further, that tasksT 1 throughT 4 are present initially, andlater T 0 arrives in the system. In the speculative case, schedulability ofT 0 is determinedby T 3 (in general, by the task with least laxity of all tasks with greater deadline), since in‘pushing back’ the later tasks’ slots to scheduleT 0, theT 3 must also be pushed back.T 3’slaxity, therefore, determinesT 0’s schedulability. However, in the non-speculative case –where no taskT i begins executing before its start timeSi – ‘pushing back’T 1 andT 2might produce enough ‘free’ time to executeT 0; we might not need to involveT 3 in thescheduling decision at all. This is a direct result of tasks having specific start times.

The Minlax approach has other attractive features. Finding ‘tight’ worst-case executiontimes for the tasks is difficult, since execution time might depend on input values.Separating the dispatcher and the scheduler, and having the minimum-laxity pointers, helpsto quickly ‘absorb’ unused execution time (when a task executes for an interval less than itsworst-case execution time) in schedulability analysis. Had the dispatcher and the schedulerused the same slot list, we would have to update the start and end times of the individualslots of all tasks with higher deadlines to accommodate the unused time, as shown throughan example in Figure8.

However, as also shown in Figure8, in the new slot list, we need to update only thelaxity values associated with later slots pointed to by minimum-laxity pointers. SuchMinlax-pointed slots are expected to be far fewer than the slots of later tasks; thus, this

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 17: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 237

Figure 8. The use of MinlaxPtr to efficiently utilize idle time

operation too is speeded up. Similarly, if there is an ‘exception condition’ in any of theother schedulers on the same processor (as stated in Sections5.1, when the applicationneeds multiple scheduling policies and multiple schedulers on each node), laxity from onescheduler can be transferred to another. This policy is supported by the mechanism ofreducing the laxity values associated with the minimum-laxity pointers, analogously to themechanism of increasing the values when unused time is to be utilized.

Generating the start and end times for the dispatcher can be done exactly as wasdiscussed in Section5.1. This is shown for the running example with the new slot listin Figure9.

It is clear that we are performing many more pointer operations (pointer chasing,dereferencing and update) in our new scheduling strategy than in ‘traditional’ real-timescheduling schemes, as discussed in Section3.1. On the other hand, the traditionalscheme performs many more floating-point operations than our scheme. The hardware that

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 18: Composing high-performance schedulers: a case study from real-time simulation

238 K. GHOSHET AL.

Figure 9. Generating information for dispatcher from the slot list

performs floating-point operations and pointer operations are different. As will becomeevident in Section6.2, a clear separation between the scheduler and the dispatcher allowsone or the other approach to be utilized as required, depending on hardware availability.

The scheduler accepts two types of queries: to return a yes/no answer, and to returnthe available idle time, in addition. To calculate the amount of idle time on a processorbefore the deadline of any eventE (note that this is slightly different from finding theschedulability ofE), we proceed as follows. Find the sum of the execution times ofslots with higher deadline thanE∗∗ up to the MinlaxPtr slot corresponding toE . Let theMinlaxPtr slot of E be S, and the deadline ofS be D. Let the cumulative execution timefound above beC2. If D − C2 ≤ the deadline ofE , the idle time is given by the laxityof S. If D − C2 > the deadline ofE , the idle time required is given by the deadline ofE − C1, whereC1 is the cumulative execution time of all events with deadlinelower thanE . Currently (on the KSR-2), returning the available idle time is between 5 and 6% moreexpensive than a yes/no response to schedulability analysis.

5.3. Advantages of new interfaces

It should be clear that the only major differences in the new scheduler as compared toexisting scheduler interfaces were in (i) use of events rather than new threads to handleincoming computations and (ii) separation of the concerns and data structures of thescheduler from those of the dispatcher. Other differencesinternal to the scheduler(e.g.

∗∗This can be obtained quickly using the CumExec array shown in Figure6.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 19: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 239

the minimum laxity approach) were facilitated by these major changes in the interface.Here we briefly consider the advantages that we expect to see as a result of using these newinterfaces.

The advantage of using an event to handle an incoming computation is higherperformance. The separation of the concerns of schedulability analysis and dispatchingaffords a reduction in the cost of updating later entries in the ED list, and in findingand updating available idle time. This is because there will be far fewer floating-pointupdates necessary in the new scheduler. Real time is usually represented as floating-pointquantities for high precision. Floating-point hardware is often unavailable due to powerconsumption/dissipation and scarce on-chip real-estate concerns in embedded real-timesystems. Under such situations, the potentially large number of floating-point updateshamper real-time scheduling performance in traditional schedulers. The separation ofconcerns of scheduling and dispatching allowed us to easily reconfigure the scheduler toachieve fast real-time scheduling even in the absence of floating-point hardware.

We discuss details of these performance improvements and report measurements in thenext Section.

6. PERFORMANCE EVALUATION

In this Section, we state the performance of our framework in response to the applicationsmentioned in Section4. Recall that there are three components to our schedulingframework: the dispatcher, the thread layer and the scheduler. Detailed measurementsof our thread support, the performance of which is comparable to other thread packageselsewhere, are reported in [20]. For completeness, here we briefly mention the costs of thethread operations relevant to us.

The cost relevant to traditional scheduler frameworks are as follows. Forking a threadtakes approximately 75µs, forking and detaching a thread requires roughly 380µs, andyielding a thread requires about 40µs. Switching context back and forth between twothreads on the same processor takes 78µs. In comparison, the costs relevant to ourframework are as follows. A null function call requires about 2µs, and creating a newevent requires approximately 60µs.

If there aren periodics mapped on a processor, the costs of thread management on atraditional framework implementation would be 75n µs for forking one ‘periodic thread’per periodic task, and (at least) 78µs for each context switch between instances of twodifferent periodic tasks thereafter. In contrast, our mechanism requires 75µs initiallyirrespective of the number of periodics (assuming that one thread is enough for allnperiodics), and 62µs for transferring control from one event to another thereafter††. Thus,it is clear that our framework performs better than traditional frameworks at the ‘threadmanagement’ layer.

In the remainder of this Section, we investigate the reduction in each of the componentsof the scheduler that contribute to real-time scheduling latencies, as mentioned inSection3.1. We present measurements on KSR-1 and KSR-2, to show how disparate

††It should be noted that the 60µs time to create a new event includes some of the overheads – such as statesaving, and maintaining dependency information – due to the speculative execution paradigm[11,12]. While wehave not yet tested our system with non-speculative execution, we believe that the time to create a new event willreduce significantly in non-speculative execution.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 20: Composing high-performance schedulers: a case study from real-time simulation

240 K. GHOSHET AL.

Table 1. Comparison of the costs (µs) of insertion into slot list (on KSR-1)

Number of slots original-cost hash-cost guess-cost hash-time

3000 3350 780 595 321000 1130 284 342 30500 683 220 167 25100 130 53 69 55

hardware affects scheduling latencies, and how reconfiguring the scheduler for theappropriate hardware help reduce latencies in each case.

6.1. Reducing the search time

A scheduler in a speculative scenario has to keep track of several ‘processed’ events, whichcannot be garbage-collected till real time becomes greater than their deadline. The slots ofthese events should not be removed from the slot list, since speculatively executed eventsmight have to be undone and re-done. Removing the processed slots from the slot listwould necessitate explicit schedulability analysis, and re-introduction into the slot list, foreach of those events when they are undone and re-executed. Retaining the slots in the listessentially ‘reserves’ time for a processed event to be undone and re-done, until we are surethat it will not be undone any more‡‡. Therefore, the slot list can be much larger than thecorresponding list in the case of a non-speculative real-time ED scheduler. While we havediscussed a reconfigurable garbage-collection scheme for real-time speculative executionin [16], here we discuss the time taken to search the slot list when a new slot has to beinserted in it.

The slot list can be organized in several ways. In traditional real-time systems, wherethere is no rollback, one often arranges the slot list as a heap or some other tree-like datastructure for fast access. However, if there are rollbacks, repeated re-organization of theheap becomes expensive. Therefore, we organize our slot list as a linear list: rollbackmerely results in update of a pointer corresponding to the slot of the event from whichexecution should restart. But a simple linear search is too expensive, as shown in Figure4.To alleviate costs, we used a hashing strategy, whereby, given the deadline of a event, weobtain a pointer to a slot of a event with deadline less than or equal to the new event. Thus,bypassing a part of the slot list, we search forward until we reach the point of insertion.

A second search method relies on the basic speculative-execution paradigm: that mostof the time ‘guesses’ will be correct, and insertions of a new slot will tend to be close tothe point where the last insertion was made. Thus, the scheduler ‘remembers’ the positionof the current insertion, and starts searching backward/forward from there for the nextinsertion (see Table1). While the variance of this method might be high, it produces thebest performance (see Section6.2).

Hashing reduces the cost of searches (we use an adaptive hashing scheme: the numberof buckets is reconfigured based on the observed average difference between deadlinesof events). However, this reduction itself comes at a cost. With very small slot lists, thenumber of times one has to re-hash (we re-hash the slot list every time garbage-collection

‡‡This happens when real time reaches the deadline of that event.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 21: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 241

Table 2. Comparison of the costs (µs) of scheduling in original and improved scheme (on KSR-1)

Number of slots original-cost new-cost

3000 5300 13371000 2722 913500 1750 534100 500 330

is performed, since the pointers change) becomes large, thereby increasing the net cost ofsearch through hashing.

6.2. Calculating and updating idle time

Table2 shows the reduced cost of scheduling as a result of using the ‘best-guess’ strategyfor search, minimum-laxity access structures for updating laxities, and separating thedispatcher from the scheduler. These measurements were taken on a KSR-1.

Once again, the costs increase with very small slot lists because of the overheadsassociated with maintaining the access structures, and updating them while garbage-collecting, become overwhelming due to the frequency of garbage collection and therelatively small amount of time required to directly search a short slot list.

Since we want to obtain precision in deadlines, all real times were represented as doublesin the system. Thus, hardware support for floating-point operations would be useful forreal-time schedulability analyses. However, on-chip real-estate being a scarce commodityin embedded systems, such systems are not often equipped with floating-point hardware.Thus, one should be able to perform low-laxity scheduling even in the absence of floating-point hardware. The measurements above were taken on the KSR-1 which has a 20 MHzclock and on which floating-point arithmetic was performed in software. Though co-processors have been suggested for fast real-time schedulability analyses, we contend thatit is possible to perform fast real-time scheduling using stock hardware. It should be notedthat even for fairly large slot lists – a maximum of 3000 slots in the list before garbagecollection – we perform within an order of magnitude of what researchers elsewhere haveachieved (130µs with co-processors running at 50 MHz[19].

In comparison with measurements quoted in [20], our measurements on the KSR-1suffered on two counts: a much slower processor (20 MHz rather than 50 MHz co-processors in [21]) and the absence of hardware floating-point operations. We repeatedour measurements on the KSR-2. Processors on the KSR-2 run at 40 MHz, and performfloating-point operations in hardware. Scheduler laxities on the KSR-2 are quoted inTable3.

It is interesting to compare the laxities to those on the KSR-1. The scalar speeds ofthe processors on the KSR-2 are double those of the KSR-1. Figures10 and 11 showgraphically the costs of finding idle time intervals available for a new event, and updatinglaxities upon accepting a new event (on the KSR-1 and KSR-2, respectively).

Thus, schedulability analyses were speeded up in general due to traversals of datastructures being speeded up. Further, the operations of laxity updates in the direct-updatescheme were vastly improved due to floating-point hardware. While the speed of the

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 22: Composing high-performance schedulers: a case study from real-time simulation

242 K. GHOSHET AL.

Table 3. Comparison of the cost (µs) of scheduling in direct update, and minimum-laxity schemes(on KSR-2)

Number of slots Direct-update Minimum-laxity

3000 1413 13121000 450 536500 169 185100 70 90

Figure 10. The costs of real-time scheduling (without floating-point hardware, on KSR-1)

Figure 11. The costs of real-time scheduling (with floating-point hardware, on KSR-2)

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 23: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 243

minlaxity approach increased on the KSR-2, direct-update outperformed the minlaxityscheme (though by a small amount) most of the time. However, when the slot lists werevery long, the minlaxity approach outperformed the direct-update strategy (once again, bya small amount). This is because the number of floating-point updates became too largewith long slot lists. We might add that the number of slots prevalent at any time in the slotlist will be in the 50–100 range for the applications we are targeting. This was the case inthe robot-control application.

6.3. Summary

Thus, we were able to achieve significant performance gains through the two chief designchanges of using events instead of forking new threads where possible, and separatingscheduler from dispatcher. This separation, in turn, allowed alternative implementationsof scheduler modules, which afforded performance gains through reconfiguration of thescheduler to suit the underlying (floating-point) hardware.

7. RELATED WORK

Much of our insight about the framework discussed here arose while using an older real-time thread package developed by our group[2], the interface of which is similar to suchefforts elsewhere[1]. The differences between these bodies of work, and our current efforthave been detailed in Section2. Separation of concerns for real-time schedulability hasbeen addressed by Musilneret al.[22], but the emphasis in that work is in separating AIand real-time concerns, thereby distinguishing the unpredictable parts of the system fromthe parts requiring timeliness. In contrast, our distinction is among the real-time entities inthe system.

Significant work has been done in reducing the amount of context and other overheadsrequired by threads. Thus, in [23], Mohr et al. examine a technique for creating tasksonly when resources favor task creation in a fine-grain MIMD programming model.Eager and Zahorjan consider the merits and demerits of thread-based models of user-levelconcurrency, and propose an extension of the work-heap model, called Chores[24]. Freehet al.forward another extension of a lightweight thread model with minimal context, calledDistributed Filaments[8]. The ‘threads’ model used in our framework has similarities withthese efforts.

Niehauset al. have reported the performance of special-purpose hardware to performreal-time schedulability analyses[21]. As reported in Section6.2, we have found that itis possible to perform schedulability analyses just as fast through software on general-purpose computers.

8. CONCLUSIONS AND FUTURE WORK

We have discussed a framework for implementing schedulers in a dynamic real-timesystem. Identifying drawbacks in existing frameworks, we have specified a new frameworkthat allows ease of re-configuration, and affords better performance. We have substantiatedour claims by implementing a configurable scheduler for dynamic real-time systems –the dynamism arising from the nature of the applications, as well as from a speculativeexecution paradigm.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 24: Composing high-performance schedulers: a case study from real-time simulation

244 K. GHOSHET AL.

Our reconfigurations were performed in response to the presence of specific hardware.Through detailed measurements on KSR-1 and KSR-2 machines, we pointed out the basicdrawbacks of extant real-time scheduler interfaces. Though both of the KSR machines arenow obsolete, the lessons learned from floating-point hardware on those machines are ofgeneral applicability in a real-time environment, and are applied in our reconfigurations. Itshould, for example, be clear that the ‘dispatcher’ data structures can be made local to aprocessor, while the ‘scheduler’ structures might be more global: this would be useful forreducing remote memory accesses in a NUMA machine.

We proposed a new interface, and through measurements on the same machines,demonstrated the efficiency of our scheduler interface in ease of reconfiguration, and low-latency scheduling decisions. Measurements of scheduling latency are very encouraging:performance of our scheduler is close to that obtained by researchers elsewhere usingdedicated hardware co-processors for real-time scheduling analyses.

This is the first implementation of a scheduler that can handle low-latency, real-timescheduling in a speculative-execution (Time Warp-like) environment.

Our immediate future work concerning the current effort includes formalizing theinterfaces at the dispatcher, scheduler and ‘threads’ levels. Our longer-term work isin kernel re-configuration from the application level, and in extending the speculative-execution paradigm for distributed platforms, with scope for multimedia applications.

ACKNOWLEDGEMENT

Tucker Balch implemented an earlier sequential version of the robot control software on aSparc-station.

REFERENCES

1. Hideyuki Tokuda, Tatsuo Nakajima and Prithvi Rao, ‘Real-time mach: towards predictable real-time systems’,Proceedings of the USENIX 1990 Mach Workshop, October 1990.

2. Karsten Schwan, Hongyi Zhou and Ahmed Gheith, ‘Multiprocessor real-time threads’,Oper.Syst. Rev., 25(4), 35–46 (1991). Also appears in the Jan. 1992 issue ofOper. Syst. Rev.

3. Bill O. Gallmeister and Chris Lanier, ‘Early experience with POSIX 1003.4 and POSIX1003.4A’, inProceedings of the Real-time Systems Symposium, IEEE Computer Society Press,December 1991, pp. 190–198.

4. Brian Ford and Sai Susarla, ‘CPU inheritance scheduling’,Proceedings of the Second USENIXSymposium on Operating Systems Design and Implementation (OSDI), October 1996, pp. 91–105.

5. Pawan Goyal, Xingang Guo and Harrick M. Vin, ‘A hierarchical CPU scheduler for multimediaoperating systems’,Proceedings of the Second USENIX Symposium on Operating SystemsDesign and Implementation (OSDI), October 1996, pp. 107–121.

6. Carl A. Waldspurger and William E. Weihl, ‘Lottery scheduling: flexible proportional-shareresource management’,Proceedings of the First USENIX Symposium on Operating SystemsDesign and Implementation (OSDI), November 1994, pp. 1–12.

7. J. Misra, ‘Distributed-discrete event simulation’,ACM Comput. Surv., 18(1), 39–65 (1986).8. Vincent W. Freeh, David K. Lowenthal and Greogry R. Andrews, ‘Distributed filaments:

efficient fine-grain parallelism on a cluster of workstations’,Proceedings of the First USENIXSymposium on Operating Systems Design and Implementation (OSDI), November 1994,pp. 201–214.

9. Hongyi Zhou, ‘Task scheduling and synchronization for multiprocessor real-time systems’,PhD thesis, College of Computing, Georgia Institute of Technology, May 1992.

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)

Page 25: Composing high-performance schedulers: a case study from real-time simulation

COMPOSING HIGH-PERFORMANCE SCHEDULERS 245

10. Steve Whitney, John McCalpin, Nawaf Bitar, John L. Richardson and Luis Stevens, ‘The SGIOrigin software environment and application performance’,Proceedings of COMPCON ’97,November 1997.

11. R. M. Fujimoto, ‘Time warp on a shared memory multiprocessor’,Trans. Soc. Comput. Simul.,6(3), 211–239 (1989).

12. Kaushik Ghosh, Kiran Panesar, Richard M. Fujimoto and Karsten Schwan, ‘PORTS: a parallel,optimistic, real-time simulator’,Proceedings of the 8th Workshop on Parallel and DistributedSimulation (PADS), July 1994.

13. Houssine Chetto and Maryline Chetto, ‘Some results of the earliest deadline schedulingalgorithm’, IEEE Trans. Softw. Eng., SE-15, 1261–1269 (1989).

14. C. L. Liu and James W. Layland, ‘Scheduling algorithms for multiprogramming in hard real-time environment’,J. Assoc. Comput. Mach., 20(1), 46–61 (1973).

15. Kaushik Ghosh, Richard M. Fujimoto and Karsten Schwan, ‘Time warp simulation in timeconstrained systems’,Proceedings of the 7th Workshop on Parallel and Distributed Simulation(PADS), May 1993. Expanded version available as Technical Report GIT-CC-92/46.

16. Kaushik Ghosh, ‘Reconfigurable garbage collection of data structures in a speculative real-time system’,Technical Report GIT-CC-94-57, College of Computing, Georgia Institute ofTechnology, May 1994.

17. Azer Bestavros and Spyridon Braoudakis, ‘Timeliness via speculation for real-time databases’,Proceedings of the 15th IEEE Real-Time Systems Symposium, May 1994.

18. M. Younis, T. J. Marlowe and A. D. Stoyenko, ‘Compiler transformations for speculativeexecution in a real-time system’,Proceedings of the 15th IEEE Real-Time Systems Symposium,December 1994.

19. R. M. Fujimoto, ‘Performance of time warp under synthetic workloads’,Proceedings of theSCS Multiconference on Distributed Simulation, January 1990, Vol. 22, No. 1 pp. 23–28.

20. Bodhisattwa Mukherjee, Greg Eisenhauer and Kaushik Ghosh, ‘A machine independentinterface for lightweight threads’,Oper. Syst. Rev., 28(1), 33–47 (1994).

21. Douglas Niehaus, Krithi Ramamritham, John A. Stankovic, Gary Wallace and Charles Weems,‘The spring scheduling co-processor: design, use and performance’, inProceedings of the Real-Time Systems Symposium, IEEE Computer Society Press, December 1993, pp. 106–111.

22. David J. Musliner, Edmund H. Durfee and Kang G. Shin, ‘CIRCA: a cooperative intelligentreal-time control architecture’,IEEE Trans. Syst. Man Cybern., 23(6), 1561–1574 (1993).

23. Eric Mohr, David A. Kranz and Robert H. Halstead, ‘Lazy task creation: a technique forincreasing the granularity of parallel programs’,IEEE Trans. Parallel Distrib. Syst., 2(3), 264–280 (1991).

24. Derek L. Eager and John Zahorjan, ‘Chores: enhanced run-time support for shared-memoryparallel computing’,ACM Trans. Comput. Syst., 11(1), 1–32 (1993).

Copyright 1999 John Wiley & Sons, Ltd. Concurrency: Pract. Exper.,11, 221–245 (1999)