4838281 operating-system-scheduling-on-multicore-architectures

1. Seminar Parallel Computing Summer term 2008 Seminar paper Parallel Computing (703525) Optimisation: Operating System Scheduling on multi-core architecturesLehrveranstaltungsleiter: T. Fahringer Name Matrikelnummer Thomas Zangerl 1

2. Seminar Parallel Computing Summer term 2008AbstractAs multi-core architectures begin to emerge in every area of computing, operating systemscheduling that takes the peculiarities of such architectures into account will becomemandatory. Due to architectural differences to traditional multi-processors, such as sharedcaches, memory controllers and smaller cache sizes available per computational unit, it doesnot suffice to simply schedule tasks on multi-core processors in the same way as on SMPsystems.Furthermore, current research motivates architectural changes in CPU design, such as multi-core processors with asymmetric core-performance and so called many-core architectures thatintegrate up to 100 cores in one package. Such architectures will exhibit a fundamentallydifferent behaviour with regard to shared resource utilization and performance of non-parallelizable code compared to current CPUs. It will be the responsibility of the operatingsystem to spare the programmer as much platform-specific knowledge as possible andoptimize overall performance by employing intelligent and configurable schedulingmechanisms. 2

3. Seminar Parallel Computing Summer term 2008 Abstract......................................................................................................................................21. Introduction............................................................................................................................4 1.1 Why use multi-core processors at all?..............................................................................4 1.2 Whats so different about multi-core scheduling?............................................................52. OS process scheduling state of the art...................................................................................7 2.1 Linux scheduler................................................................................................................7 2.1.1 The Completely Fair Scheduler ................................................................................7 2.1.2 Scheduling domains...................................................................................................7 2.2 Windows scheduler..........................................................................................................9 2.3 Solaris scheduler.............................................................................................................103. Ongoing research on the topic.............................................................................................11 3.1 Cache-Fairness................................................................................................................11 3.2 Balancing core assignment..............................................................................................12 3.3 Performance asymmetry.................................................................................................13 3.4 Scheduling on many-core architectures..........................................................................154. Conclusion...........................................................................................................................185. References............................................................................................................................20 3

4. Seminar Parallel Computing Summer term 20081. Introduction1.1Why use multi-core processors at all?In the last few years, multi-core CPUs have become a standard component in nearly all sortsof computers not only servers and high-end workstations but also desktop and laptop PCsfor consumers and even game consoles nowadays usually come with CPUs with more thanone core.This development is not surprising; already in 1965 Gordon Moore predicted that the numberof transistors that can be cost-effectively built onto integrated circuits were going to doubleevery year ([1]). In 1975, Moore corrected that assumption to a period of two years; nowadaysthis period is frequently assumed to be 18 months).Moores projection has more or less been accurate up to today and consumers have gottenused to the constant speedup of computer hardware it is expected by buyers that a newcomputer shows a significant speedup to a two years older model (even though an increase intransistor density does not always lead to an equal in increase in computing speed). For chipmanufacturers, it has become increasingly difficult however to keep up with Moores law. Inorder to implement the exponential increase of integrated circuits, the transistor structureshave to become steadily smaller. On the one hand, the extra transistors were used for theintegration of more and more specialized instruction sets on CISC chips. On the other hand,smaller transistor sizes led to higher clock rates of the CPUs, because due to physical factors,the gates in the transistors could perform faster state switches.However, since electronic activity always produces heat as an unwanted by-product, the moretransistors are packed together in a small CPU die area, the higher the resulting heatdissipation per unit area becomes ([2]). With the higher switching frequency, the electronicactivity was performed in smaller intervals, and hence more and more heat-dissipationemerged. The cooling of the processor components became more and more a crucial factor indesign considerations and it became clear, that the increasing clock frequency could no longerserve as the primary reason for processor speedup.Hence, there had to be a shift in paradigm in order to still make applications run faster; on theone hand the amazing abundance of transistors on processor chips was used to increase thecache sizes. This alone, however, would not result in an adequate performance gain, since itonly helps memory intensive applications to a certain degree. In order to effectivelycounteract the heat problem while making use of the small structures and high number oftransistors on a chip, the notion of multi-core processors for consumer PCs was introduced.Since the CMOS technology met its limits for the further increase of CPU clock frequencyand the number of transistors that could be integrated on a single die allowed for it, the ideaemerged, that multiple processing cores could be placed in a single processor die.In 2006, Intel released the Core microprocessor, a die package with two processor coreswith their own level 1 caches and a shared level 2 cache ([3]).Also in 2006, AMD, the second major CPU manufacturer for the consumer market, releasedthe Athlon X2, a processor with quite similar architecture to the Core platform, butadditionally featuring the concept of also sharing a CPU-integrated memory-controller amongthe cores ([[4]). 4

5. Seminar Parallel Computing Summer term 2008Both architectures have been improved and sold with a range of consumer desktop and laptopcomputers - but also servers and workstations - up to today; therefore the presence of multi-core processors in a large number of todays PCs can be assumed.1.2Whats so different about multi-core scheduling?One could assume that the scheduling process on such multi-core processors wouldnt differmuch from conventional scheduling intuitively the run-queue would just have to be replacedby n run-queues, where n is the number of cores and processes would simply be scheduled tothe currently shortest run-queue (with some additional process-priority treatment, maybe). While that might seem reasonable, there are some properties of current multi-corearchitectures that speak strongly against such a nave approach. First, in many multi corearchitectures, each core manages its own level 1 cache (Figure 1). By just navelyrescheduling interrupted processes to a shorter queue which belongs to another core (taskmigration), parts of the processes cache working set may become unnecessarily lost and theoverall performance may slow down. This effect becomes even worse if the underlyingarchitecture is not a multi-core but a NUMA system where memory access can become verycostly if the process is scheduled on the wrong node. Figure 1: Typical multi-core architectureA second important point is that the performance of different cores in a multi-core systemmay be asymmetric regarding the performance of the different cores ([5]). This effect canemerge due to several reasons: 5

6. Seminar Parallel Computing Summer term 2008 Design considerations. Many slow cores can be used for increasing the throughput in parallel computation while a few faster cores contribute to the efficient processing of costly tasks which can not be parallelized. ([6]). Even algorithms that are parallelizable contain parts that have to be executed sequentially, which will benefit from the higher speed of the fastest core. Hence performance-asymmetry has been shown to be a very efficient approach in multi-core architectures ([7]). Transistor failures. Some parts of the CPU may get damaged over time and become automatically disabled by the CPU. Since such components may fail in certain cores independently from the other cores, performance asymmetry may arise in symmetric cores over time ([5]). Power-saving policies. Different cores may switch to different P- or C-power-states at different times in order to save power. At different P-states, equal cores show a different clock-frequency. If an OS scheduler manages to take this into account for processes not in need of all system resources, the system can remain more energy- efficient over the execution time while giving away only little or no performance at all. ([8]).Hence, performance asymmetry, the fact that various CPU components can be shared amongcores, and non-uniform access to computation resources such as memory, mandate the designof efficient multi-core scheduling mechanisms or scheduling frameworks at the operatingsystem level.Multi-core processors have gone mainstream and while there may be the demand that they areefficiently used in terms of performance, the currently fashionable term Green-IT alsomotivates the energy-efficient use of the CPU cores.Section 2 will explore how far current operating systems have evolved in support of the newarchitectures. 6

7. Seminar Parallel Computing Summer term 20082. OS process scheduling state of the art2.1 Linux scheduler2.1.1The Completely Fair SchedulerThe Linux scheduler in versions prior to 2.6.23 performed its tasks in complexity O(1) bybasically just using per-CPU run-queues and priority arrays ([9]). Kernel version 2.6.23,which was released on October 9 2007, introduced the so-called completely fair scheduler(CFS). The change in scheduler was mainly motivated by the failure of the old scheduler tocorrectly predict whether applications are interactive (I/O-bound) or CPU-intensive ([10]).Therefore the new scheduler has completely abandoned the notion of different kinds ofprocesses and treats them all equally. The data-structure of a red-black tree is used to align thetasks according to their right to use the processor resources for a predefined interval untilcontext-switch. The process positioned at the leftmost node in the data structure is entitledmost to use the processor resources at the time it occupies that position in the tree. Theposition of a process in the tree is only dependent on the wait-time of the process in the run-queue (including the time the process is actually waiting for events) and the process priority([11]). This concept is fairly simple, but works with all kinds of processes, especiallyinteractive ones, since they get a boost just by getting account for their I/O-waiting time.However, the total scheduling complexity is increased to O(log n) where n is the number ofprocesses in the run-queue, since at every context-switch, the process has to be reinserted intothe red-black tree.The scheduling algorithm itself has not been designed in special consideration of multi-corearchitectures. When Ingo Molnar, the designer of the scheduler, was asked, what theimplications on HT/SMP/NUMA architectures would be, he answered, that there wouldinevitably be some effect, and if it is negative, he will fix it. He admits that the fairness-approach can result in increased cache-coldness for processes in some situations. ([12]).However, the red-black trees of CFS are managed per runqueue ([13]), which assists incooperation with the Linux load-balancer.2.1.2Scheduling domainsLinux load-balancing takes care of different cache models and computing architectures but atthe moment not necessarily of performance asymmetry. The underlying model of the Linuxload balancer is the concept of scheduling domains, which was introduced in Kernel version2.6.7 due to the unsatisfying performance of Linux scheduling on SMP and NUMA systemsin prior versions ([14]).Basically, scheduling domains are hierarchical sets of computation units on which schedulingis possible; the scheduling domain architecture is constructed based on the actual hardwareresources of a computing element ([9]). Scheduling domains contain lists with schedulinggroups that share common properties. 7

8. Seminar Parallel Computing Summer term 2008For example, the way scheduling should be done on two logical processors of a HT-systemsand two physical processors of a SMP system is different; the HT cores share a commoncache and memory hierarchy, therefore task migration is not a problem, if one of the logicalcores becomes idle. However, in a SMP or multi-core system, in which the cache or parts ofthe cache is administrated by each core separately, migrating tasks with a large working setmay become problematic.This applies even more to NUMA machine, where different CPU may be closer or moreremote to the memory the process is using. Therefore all this architectures have to be treateddifferently.The scheduling domain concept introduces scheduling domains, a logical union of computingresources that share common properties, with whom it is reasonable to treat them equally andCPU groups within these domains. Those groups contain hardware-addressable computingresources that are part of the domain on which the balancer can try to even the domain loadout.Scheduling domains are hierarchically nested there is a top-level domain containing allother domains of the physical system the Linux is running on. Depending on the actualarchitecture, the sub-domains represent NUMA node groups, physical CPU groups, multi-core groups or SMT groups in a respective hierarchical nesting. This structure is builtautomatically based on the actual topology of the system and for reasons of efficiency eachCPU keeps a copy of every domain it belongs to. For example, a logical SMT processor thatat the same time is a core in a physical multi-core processor on a NUMA node with multiple(SMP) processors would totally administer 4 sched_domain structures, one for each level ofparallel computing it is involved in ([15]). Figure 2: Example hierarchy in the Linux scheduling domains Load-balancing takes place at scheduling domain level, between the different groups. Eachdomain level is sensitive with respect to the constraints set by its properties regarding loadbalancing. For example, load balancing happens very often between logical simultaneous 8

9. Seminar Parallel Computing Summer term 2008multithreading cores, but very rarely on the NUMA level, where remote memory access iscostly.The scheduling domain for multi-core processors was added with Kernel version 2.6.17 ([16])and especially considers the shared last level cache that multi-core architectures frequentlypossess. Hence, on a SMP machine with two multi-core packages, two tasks will be scheduledon different packages, if both packages are currently idle, in order to make use of the overalllarger cache.In recent Linux-kernels, the multi-core processor scheduling domain also offers support forenergy-efficient scheduling, which can be used if e.g. the powersave governor is set in thecpufreq tool. Saving energy can be achieved by changing the P- and the C-states of the coresin the different packages. However, P-states are transitions are made by adjusting the voltageand the frequency of a core and since there is only one voltage regulator per socket on themainboard, the P-state is dependent on the busiest core. So, as long as any core in a package isbusy, the P-state will be relatively low, which corresponds to a high frequency and voltage.While the P-states remain relatively fixed, the C-states can be manipulated. Adjusting the C-states means turning off parts of the registers, blocking interrupts to the processor, etc. ([17])and can be done on each core independently. However, the shared cache features its own C-state regulator and will always stay in the lowest C-state that any of the cores has.Therefore, energy-efficiency is often limited to adjusting the C-state of a non-busy core whileleaving other C-states and the packages P-state low.Linux scheduling within the multi-core domain with the powersave-governor turned on willattempt to schedule multiple tasks on one physical package as long as it is feasible. This way,other multi-core packages will be allowed to transition into higher P- and C-states. The authorof [9] claims, that generally the performance impact will be relatively low and that theperformance loss/power saving trade-off will be rewarding, if the energy-efficient schedulingapproach is used.2.2 Windows schedulerIn Windows, scheduling is conducted on threads. The scheduler is priority-based withpriorities ranging from 0 to 31. Timeslices are allocated to threads in a round-robin fashion;these timeslices are assigned to highest priority threads first and only if know thread of agiven priority is ready to run at a certain time, lower priority threads may receive thetimeslice. However, if higher-priority threads become ready to run, the lower priority threadsare preempted.In addition to the base priority of a thread, Windows dynamically changes the priorities oflow-prioritized threads in order to ensure felt responsiveness of the operating system. Forexample, the thread associated with the foreground window on the Windows desktop receivesa priority boost. After such a boost, the thread-priority gradually decays back to the basepriority ([21]).Scheduling on SMP-systems is basically the same, except that Windows keeps the notion of athreads processor affinity and an ideal processor for a thread. The ideal processor is theprocessor with for example the highest cache-locality for a certain thread. However, if theideal processor is not idle at the time of lookup, the thread may just run on another processor.In [21] and other sources, no explicit information is given on scheduling mechanismsespecially specific to multi-core architectures, though. 9

10. Seminar Parallel Computing Summer term 20082.3Solaris schedulerIn the Solaris scheduler terminology, processes are called kernel- or user-mode threadsdependant on the space in which they run. User threads dont only exist in the user space whenever a user thread is created, a so-called lightweight process is set up that connects theuser thread to a kernel thread. These kernel threads are object to scheduling.Solaris 10 offers a number of scheduling classes, which may co-exist on one system. Thescheduling classes provide an adaptive model for the specific types of applications which areintended to run on top of the operating system. ([18]) mentions Solaris 10 scheduling classesfor: Standard (timeshare) threads, whose priority may be adapted by the scheduler. Interactive applications (the currently active window in the graphical user interface). Fair sharing of system resources (instead of priority-based sharing) Fixed-priority threads. The priority of threads scheduled with this scheduler does not vary over the scheduling time. System (kernel) threads. Real-time threads, with a fixed priority and time share. Threads in this scheduling class may preempt system threads.The scheduling classes for timeshare, fixed-priority and fair sharing are not recommended forsimultaneous use on a system, while other combinations of scheduling classes on a set ofCPUs are possible.The timeshare and interactive schedulers are quite similar to the old Linux scheduler (beforeCFS) in their attempt of trying to identify I/O bound processes and providing them with apriority boost. Threads have a fixed time quantum they may use once they get the context andreceive a new priority based on whether they fully consume their time quantum and on theirwaiting time for the context. Fair share scheduling uses a fixed time quantum (share) allocatedto processes ([19]) as a base for scheduling. Different processes (actually collection ofprocesses, or, in Solaris 10 terminology, projects) compete for quanta on a computingresource and their position in that competition depends on how large the value they have beenassigned is in relation to the total quanta number on the computing resource.Solaris explicitly deals with the scenario, that parts of the processors resources may beshared, as it is likely with typical multi-core processors. There is a kernel abstraction calledprocessor group (pg_t), that is built according to the actual system topology and representslogical CPUs that share some physical properties (like caches or a common socket). Thesegroupings can be investigated by the dispatcher e.g. in order to maintain logical CPU affinityfor the purpose of cache-hotness where it is reasonable. Quite similar to the concept ofLinuxs scheduling domains, Solaris 10 tries to simultaneously achieve load balancing onmultiple levels (for example if there are physical CPUs with multiple cores and SMT) ([20]). 10

11. Seminar Parallel Computing Summer term 20083. Ongoing research on the topicResearch on multi-core scheduling deals with a number of different topics, many of which areorthogonal (e.g. maximizing fairness and throughput). The purpose of this section is topresent an interesting selection of different approaches to multi-core scheduling. Sections 3.1and 3.2 summarize proposals to improve fairness and load-balancing on current multi-corearchitectures while sections 3.3 and 3.4 concentrate on approaches for scheduling onpromising new computing architectures, such as multi-core processors with asymmetricperformance and many-core CPUs.3.1Cache-FairnessSeveral studies (e.g. [22], [23]) suggest that operating system schedulers insufficiently dealwith threads that allocate large parts of the shared level 2 cache and thus slow-up threadsrunning on the other core that uses the same cache.The situation is unsatisfactory due to several reasons: First, it can lead to unpredictableexecution times and throughput and second, scheduling priorities may loose theireffectiveness because of threads running on cores with aggressive co-runners (i.e. threadsrunning on another core in the same package).Figure 3 shows such a scenario: Thread B uses the larger part of the shared cache and thusmaybe negatively influences the cycles per instruction that thread A achieves during its CPUtime share.L2-cache-misses are more costly than L1-cache-misses, because the latency to the memory isbigger than to the next cache-level. However, it is mostly the L2-cache that is shared amongdifferent cores.The authors of [22] try to mitigate the above mentioned effects by introducing a cache-fairness aware scheduler for the Solaris 10 operating system. Figure 3: Unfair cache utilization by thread BIn their scheduling algorithm, the threads on a system are grouped into a best effort class anda cache-fair class. Best effort threads are penalized for the sake of performance stability ofcache-fair threads, if necessary, but not vide-versa. However, it is taken care, that this doesnot result in inadequate discrimination of best effort threads.Fairness is enforced by allocating longer time shares to cache-fair threads that suffer fromcache-intensive co-runners at the expense of these co-runners, if they are best effort threads.Figure 4 illustrates that process. 11

12. Seminar Parallel Computing Summer term 2008 Figure 4: Restoring fairness by adjusting timesharesIn order to compute the quantum that the thread is entitled to, the algorithm uses an analyticalmodel to estimate a few reference values that would hold, if the thread had run under faircircumstances, namely the fair L2 cache miss rate, the fair CPI rate and the fair number ofinstructions (during a quantum). All those values are based on the assumption of faircircumstances, so the difference to the actual values is computed and a quantum extension iscalculated which should serve as compensation.Those calculations are done once for new cache-fair class threads their L2 cache miss ratesand other data is measured with different co-runners. Subsequently, the dispatcherperiodically selects best-effort threads from whom he takes parts of their quanta and assignsthem as compensation to the cache-fair threads. New cache-fair threads are not explicitlyscheduled with different co-runners in the analysis phase, but whenever new combinations ofcache-fair threads with co-runners occur, analytical input is mined from the CPUs hardwarecounters.The authors of [22] state that according to their measurements, the penalties on best-effortthreads are low, while the algorithm actually enforces priority better than standard schedulingand improves fairness. These claims are supported by experimental data gathered using theSPEC CPU2000 suite on an UltraSPARC T1 processor. The experiments measure the time ittakes a thread to complete a specific benchmark while a cache-intensive second thread isexecuted in the same package. The execution times under these circumstances are comparedwith the times of threads running with co-runners with low-cache requirements. Thiscomparison shows differences of up to 37% in execution time between the two scenarios on asystem with a standard scheduler, while using the cache-fair scheduler, the variabilitydecreases to 7%.At the same time, however, measurements of the execution times of threads in the best-effortscheduling class reveal a slowdown of up to 8% for some threads (while some evenexperience a speed-up).3.2Balancing core assignmentFedorova et al. ([27]) argue, that during operating system scheduling, several aspects have tobe considered besides optimal performance. It is shown that scheduling tasks on the cores inan imbalanced way results in jittery performance (i.e. unpredictability of a tasks completiontime) and, as a consequence, insufficient priority enforcement.It could be part of the operating system scheduler to ensure that the jobs are evenly assignedto cores. In the approach described in [27] this is done by using a self-tuning schedulingalgorithm based on a per-core benefit function. This benefit function is based on three inputcomponents: 12

13. Seminar Parallel Computing Summer term 2008 1) The normalized core preference of a thread, which is based on the instructions per cycle that a thread j can achieve on a certain core i ( IPC j ,i ), normalized by max( IPC j ,k ) (where k is an arbitrary CPU/core) 2) The cache-affinity, a value which is 1 if the thread j was scheduled on core i within a tuneable time period and 0 otherwise 3) The average cache investment of a thread on a core which is determined by inspecting the hardware cache miss counter from time to timeThis benefit function can then be used to determine whether it would be feasible to migratethreads from core i to core j. For each core, a benefit function Bi , 0 is computed that representsthe case of no thread migrations taking place. For each thread k on a core, and updated benefitvalue for the hypothetical scenario of the migration of the thread onto another core Bi ,k , iscomputed. Of course, the benefit will increase, since fewer threads are executed on the core.But the thread that would have been taken away in the hypothetical scenario would have to bemigrated to another core, which would influence the benefit value of the target core.Therefore, also the updated benefit value of any other system core j to which the thread inquestion would be migrated to, has to be computed and is called B j ,k + .The hypothetical migration of thread k from core i to core j becomes reality if Bi ,k + B j ,k + > Bi , 0 + B j ,0 + a * DCAB + b * DRTF . DCAB represents a system-wide balance-constraint, while DRTF ensures a per-job response-time-fairness (i.e. the slowdown that resultsfor the thread in question from the thread-migration does not exceed some maximum value).These two constants, together with the criterions included in the benefit function itself (mostnotably cache-affinity) should help to ensure that the self-tuning fulfils the three goals ofoptimal performance, core-assignment balance and response-time-fairness.However, the authors have not yet actually implemented the scheduling modification in Linuxand hence the statements on its effectiveness remain somehow theoretical.3.3Performance asymmetryIt has been advocated that building multi-core chips with asymmetric performance of thedifferent cores can have advantages for the overall processing speed of a CPU. For example, itcan prove beneficial if one fast core can be used to speed up parts that can hardly beparallelized while multiple slower cores come to play when parallel code parts are executed.By keeping the cores for parallel execution slower than the core(s) for serial execution, die-area and cost can be saved ([24]) while power consumption may be reduced.[25] closely analyzes the impact of performance asymmetry on the average speedup of anapplication with increase of cores. The paper concludes, that [a]symmetric multicore chipscan offer maximum speedups that are much greater than symmetric multicore chips (andnever worse). Hence, performance asymmetry at the processor core level seems to be apromising approach for future multi-core architectures.[26] suggests that asymmetric multiprocessing CPUs do exceptionally well for moderatelyparallelized applications, but dont scale much worse with highly parallel programs. (seeFigure 5) 13

14. Seminar Parallel Computing Summer term 2008 Figure 5: Comparison of speedup with SMP and AMP using highly parallel programs (left), moderately parallel programs (middle) and highly sequential programs (right)1Apart from explicit design, performance asymmetry can also occur in initially symmetricmulti-core processors by either power-saving mechanisms (increasing the C-state) or failureof transistors that leads to disabling of parts of the cores components.Problems arise by the fact that the programmer usually assumes symmetric performance ofthe processor cores and designs her programs accordingly. Therefore, the operating systemscheduler should support processor performance asymmetry, which is currently not the casefor most schedulers. However, it would be imaginable to see this as a Linux schedulingdomain in the future.[5] describes AMPS, an approach on how a performance-asymmetry aware scheduler forLinux could look like. Basically the scheduler consists of three components: An asymmetry-specific load balancing, a faster-core first scheduler and a migration mechanism specificallyfor NUMA machines that will not be covered in detail here. The scheduling mechanism triesto achieve a better performance, fairness (with respect to the thread priorities) andrepeatability of performance results.In order to conduct load-balancing, the core performance is assessed in a first step. AMPSmeasures core performance at boot time, using benchmarks, and sets the performancequantifier for the slowest core to 1 and for faster cores, to a number higher than 1. Then thescaled load of a core is the run-queue length of a core divided by its performance quantifier.If this scaled load is within a maximum and a minimum threshold, then the system isconsidered to be load-balanced, otherwise threads will be migrated. By the scaling factor,faster cores will receive more workload than slower ones.Besides the load-balancing, cores with lower scaled load are preferred in thread placement.Whenever new threads are scheduled, the new scaled load that would apply if the thread wasscheduled to one specific core is computed. Then the thread is scheduled to the core with theleast new scaled load; in case of a tie, faster cores are preferred. Additionally, the load-balancer may migrate threads even if the source core of migration can become idle by doingit. AMPS only migrates threads, if the new scaled load of the destination core does not exceed1 Image source: http://www.intel.com/technology/magazine/research/power-efficiency-0206.htm 14

15. Seminar Parallel Computing Summer term 2008the scaled load on the source-core. This way, the benefits of available fast cores can be fullyexploited while not overburdening them.It can be expected, that frequent core migration results in performance loss by cache misses(e.g. in the level cache). However, experimental results in [5] reveal no excessiveperformance loss by the fact that task migration among cores occurs more often than instandard schedulers. Figure 6: Speedup of the AMPS scheduler compared to Linux scheduler on AMP with two fast and six slow cores2Instead, performance on SMP systems measurably improves (see Figure 6, standard schedulerspeed would be 1, median speedup is 1.16); while fairness and repeatability is preservedbetter than with standard Linux (there is less deviation in the execution times of differentprocesses).3.4Scheduling on many-core architecturesCurrent research points at the potential of future CPU architectures that consist of a multitudeof different cores (tens to hundreds) in order to prolong the time the industry can keep up withMoores prediction. Such CPU architectures are going to require novel approaches in threadscheduling, since the available shared memory per core is potentially smaller, while mainmemory access time increases and single-threaded performance decreases. Even more so thanwith current CMP chips, schedulers that treat such large scale CMP architectures just likeSMP systems, are going to fail with respect to performance.[28] identifies three major challenges in scheduling on many-core architectures, namely thecomparatively small cache sizes which render memory access costly, the fact that non-specialized programmers will have to program code for them and the wide range ofapplication scenarios that have to be considered for such chips. The latter two challengesresult from the projected general-purpose use of future many-core architectures.In order to deal with these challenges, an OS-independent experimental runtime-environmentcalled McRT is presented in [28]. The runtime environment was built from scratch and is2 Picture taken from [5] 15

16. Seminar Parallel Computing Summer term 2008independent of operating system kernels hence the performed operations occur at user-level.The connection to the underlying operating system is established using so called hostadapters while programming environments like pthreads or OpenMP can invoke thescheduling environment via client adaptors. For programmability it provides high leveltransactions (instead of locks) and the heterogeneity is alleviated by giving the user the choiceamong different runtime policies (which influence the scheduling behaviour).The overall target of the scheduler is to maximize resource utilization and provide the userwith flexible scheduler configurability. Basically the scheduler is configured using threeparameters P, Q and T which respectively denote the number of (logical) processors, taskqueues and threading abstractions. P and Q change the scheduling behaviour from strictaffinity to work stealing. T can be used to specify different behaviour for different threads.This way, the concept of scheduler domains can be realized.It is notable, that the scheduling system does not use pre-emption; instead threads areexpected to yield at certain times. This design choice has been motivated by the authorsbelief that pre-emption stems from the need of time-sharing expensive and fast resourceswhich will become obsolete with many-core architectures. The runtime actively supportsconstructs such as barriers, which are often needed in parallel programming. Barriers aredesigned to avoid busy waiting for example, a thread yields once it has reached the barrierbut wont be re-scheduled until all other threads have reached the barrier.With a pre-emptive scheduling mechanism, the thread would receive the context from time totime just to check whether other threads have reached the barrier with the integrated barriersupport based on a co-operative scheduling approach used in McRT, this wont happen.The client-side adaptor, e.g. for OpenMP, promises to directly translate many OpenMPconstructs to the McRT API.[28] also contains a large section on experimental results from benchmarks of typical desktopand scientific applications, such as the XviD MPEG4 encoder, singular value decomposition(SVD) and neural networks (SOM). The results were gathered on a cycle-accurate many-coresimulator with 32 Kbyte L1 cache shared among 4 simultaneous multithreaded cores whichform one physical core. 32 such physical cores share 2 Mbyte L2 cache and 4 Mbyte off-chipL3 cache. The simulator provides a cost-free MWait instruction that allows a thread to tell theprocessor that it is blocked and the resource it is waiting for. Only if the resource becomesavailable the CPU will execute the thread. Hence, threads that are waiting barriers and locksdont consume system resources. It is important to consider that such a mechanism does notexist on current physical processors, when viewing the experimental speedup results forMcRT 16

17. Seminar Parallel Computing Summer term 2008 Figure 7: Speedup of XviD encoding in the McRT framework (compared to single core performance)3The experiments reveal that XviD encoding scales very well on McRT (Figure 7, 1080p and768p denote different video resolutions; the curve labelled linear models the ideal speedup).However, the encoding process was explicitly tweaked for the many-core scenario under thecondition, that only very little fast memory exists per logical core, parallelization wasntconducted at frame level instead single frames were partitioned into parts of frames whichwere encoded in parallel using OpenMP.The scalability of SVD and SOM is quite similar to that of XviD; more figures andevaluations of different scheduling strategies can be found in [28].3 Picture is from [28] 17

18. Seminar Parallel Computing Summer term 20084. ConclusionThe probability is high, that processor architectures will undergo extensive changes in orderto keep up with Moores law in the future. AMPs and many-core CPUs are just two proposalsfor innovative new architectures that may help in prolonging the time horizon within whichMoores law can stay valid. Operating system schedulers are going to have to adapt to thechanging underlying architectures.The scheduler domains of Linux and Solaris add some urgently needed OS-flexibility because computing architectures that exhibit different behaviours with regard to memoryaccess or single-threaded performance can be quite easily integrated into the load-balancinghierarchy; however, it must be argued that in the future probably more will have to be done inthe scheduling algorithms themselves. But scheduler domains at least provide the requiredflexibility at the level of threads.Sadly, it has to be said, that current Windows schedulers wont scale with the number of coresor performance asymmetry at any rate. Basically, the Windows scheduler treats multi-corearchitecture like SMP systems and hence can not give proper care to the peculiarities of suchCPUs, like the shared L2 cache (and possibly, the varying or simply bad single-threadedperformance that is going to be a characteristic of emerging future architectures). Ty Carlson,director of technical strategy at Microsoft, even mentioned at a panel discussion that currentWindows releases (including Vista) were designed to run on 1,2, maybe 4 processors4 butwouldnt scale beyond. He seems to be perfectly right, when he says that future versions ofWindows would have to be fundamentally redesigned.Current research shows the road such a redesign could follow. The approach described insection 3.3 seems to perform pretty well for multi-core processors with asymmetricperformance. The advantage is that the load balancing algorithm can be (and was)implemented as a modification to current operating system kernels and hence, can be madeavailable quickly, once asymmetric architectures gain widespread adaption. Experimentalevaluation of the scheduling algorithm reveals promising results, also with regard to fairness.Section 3.4 presents the way, in which futuristic schedulers on upcoming many-corearchitectures could operate. The runtime environment McRT makes use of interestingtechniques and the authors of the paper manage to intelligibly explain why pre-emptivescheduling is going to be obsolete on many-core architectures. However, their implementationis realized in user-space and burdens the programmer/user with a multitude of configurationoptions and programming decisions that are required in order for the framework to guaranteeoptimal performance.[29] introduces an easier-to-use thread scheduling mechanism based on the efforts of McRT,experimental assessments which could testify on its performance, although planned, haventbeen conducted yet. It will be interesting to keep an eye on the further development ofscheduling approaches for many-core architectures, since they might gain fundamentally inimportance in the future.Achieving fairness and repeatability on todays available multi-core architectures are themajor design goals of the scheduling techniques detailed in sections 3.1 and 3.2. The firstapproach is justified by a number of experimental results that show that priorities are actuallyenforced much better than with conventional schedulers; however it remains to be seen,4 See: http://www.news.com/8301-10784_3-9722524-7.html 18

19. Seminar Parallel Computing Summer term 2008whether the complexity of the scheduling approach and the amount of overhead potentiallyintroduced by it, justify that improvement. Maybe it would be advantageous to considerimplementing such mechanisms already at the hardware level, if possible.The algorithm mentioned in section 3.2 hasnt been implemented yet, so it is an open questionwhether such a rather complicated load-balancing algorithm would be feasible in practice.From the description one can figure, that it takes a lot of computations and thread migrationsin order to ensure the load-balance and it would be interesting to see the overhead from thecomputations and the cache-misses imposed by the mechanism on the system. Without anyexperimental data, those figures are hard to assess. 19

20. Seminar Parallel Computing Summer term 20085. References[1] G. E. Moore: Cramming more components onto integrated circuits, Electronics, Volume 38, Number 8, 1965.[2] Why Parallel Processing: http://www.tc.cornell.edu/Services/Education/Topics/Parallel/Concepts/ 2.+Why+Parallel+Processing.htm[3] O. Wechsler: Inside Intel Core Microarchitecture, Intel Technology Whitepaper, http://download.intel.com/technology/architecture/new_architecture_06.pdf[4] Key Architectural Features AMD Athlon X2 Dual-Core Processors, http://www.amd.com/us- en/Processors/ProductInformation/0,,30_118_9485_13041%5E13043,00.html[5] Li et al.: Efficient Operating System Scheduling for Performance-Asymmetric Multi- Core Architectures, In: International conference on high performance computing, networking, storage, and analysis, 2007.[6] Balakrishnan et al.: The Impact of Performance Asymmetry in Emerging Multicore Architectures, In Proceedings of the 32nd Annual International Symposium on Com- puter Architecture, pages 506517, June 2005.[7] M. Annavaram, E. Grochowski, and J. Shen: Mitigating Amdahls law through EPI throttling. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 298309, June 2005.[8] V. Pallipadi, S.B. Siddha: Processor Power Management features and Process Scheduler: Do we need to tie them together? In: LinuxConf Europe 2007[9] S.B. Siddha: Multi-core and Linux Kernel, http://oss.intel.com/pdf/mclinux.pdf[10] http://kernelnewbies.org/Linux_2_6_23[11] http://lwn.net/Articles/230574/[12] J. Andrews: Linux: The Completely Fair Scheduler, http://kerneltrap.org/node/8059[13] A. Kumar: Multiprocessing with the Completely Fair Scheduler, http://www.ibm.com/developerworks/linux/library/l-cfs/index.html[14] Kernel 2.6.7 Changelog: http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.7[15] Scheduling domains: http://lwn.net/Articles/80911[16] Kernel 2.6.17 Changelog: http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.17 20

21. Seminar Parallel Computing Summer term 2008[17] T. Kidd: C-states, C-states and even more C-states, http://softwareblogs.intel.com/2008/03/27/update-c-states-c-states-and-even-more-c-states/[18] Solaris 10 Process Scheduling: http://www.princeton.edu/~unix/Solaris/troubleshoot/schedule.html[19] Solaris manpage FSS(7): http://docs.sun.com/app/docs/doc/816-5177/fss-7?l=de&a=view&q=FSS[20] Eric Saxe: CMT and Solaris Performance, http://blogs.sun.com/esaxe/entry/cmt_performance_enhancements_in_solaris[21] MSDN section on Windows scheduling: http://msdn.microsoft.com/en-us/library/ms685096%28VS.85%29.aspx[22] A. Fedorova, M. Seltzer and M. D. Smith: Cache-Fair Thread Scheduling for Multicore Processors, Technical Report TR-17-06, Harvard University, Oct. 2006[23] S. Kim, D. Chandra and Y. Solihin: Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture, In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2004[24] D. Menasce and V. Almeida: Cost-Performance Analysis of Heterogeneity in Supercomputer Architectures, In: Proceedings of the 4th International Conference on Supercomputing, June 1990.[25] M. D. Hill and M. R. Marty: Amdahls Law in the Multicore Era, In: IEEE Computer, 2008[26] B. Crepps: Improving Multi-Core Architecture Power Efficiency through EPI Throttling and Asymmetric Multiprocessing, Intel Technology Magazine, http://www.intel.com/technology/magazine/research/power-efficiency-0206.htm[27] A. Fedorova, D. Vengerov and D. Doucette: Operating System Scheduling on Heterogeneous Core Systems, to appear in Proceedings of the First Workshop on Operating System Support for Heterogeneous Multicore Architectures, 2007.[28] B. Saha et al.: Enabling Scalability and Performance in a Large Scale CMP Environment, Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, 2007.[29] M. Rajagopalan, B. T. Lewis and T. A. Anderson: Thread Scheduling for Multi-Core Platforms, in: Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems, 2007. 21