load-balancing-method-for-embedded-rt-system-20120711-0940

Load-Balancing for Improving user Responsiveness on Multicore Embedded Systems

Jul-11, 2012

Geunsik Lim

Samsung Electronics Co., Ltd. Sungkyungkwan University

2/24

Who am I ?

• Full name: Geunsik Lim

• E-Mail : [email protected], [email protected]

• Current : Senior software engineer at Samsung Electronics (http://www.samsung.com)

• Android localization: Korea Android community (http://www.kandorid.org)

• Past: S/W membership manager at Samsung Electronics

Senior engineer at ROBOST company

Systems administrator at Daegu Bank, Ltd.

South Korea

Ottawa

http://www.samsung.com/

3/24

1. Introduction

2. Existing methods

3. Operation zone based load-balancer

4. Evaluation

5. Further work

6. Conclusions

TOC

4/24

SMP Scheduler(Load-balancing) : scheduler( ), load_balance( ), migration_thread( )

Synchronization : Semaphore, Spin-Lock, FUTEX, Atomic op., Per-CPU variable, RCU, Work-Queue

Interrupt Load-balancing ( or user-space level irqbalance daemon)

Affinity (Interface to protect the movement of tasks into another CPU for system administrator)

• CPU Affinity (Shielded CPU)

• I/O Affinity

• IRQ Affinity

CPUSET(with Process Container; cgroups): Assign CPU and Memory on NUMA

CPU Isolation: Isolate a specific CPU (If you don‟t need Load-balancing)

Tasks

Tasks Multi-core

Parallelism

Load-balancing

Introduction – Linux Features for Multicore

5/24

• 2.6.00 SMP scalability (Per-CPU data structures)

• 2.6.16 SMP IRQ affinity

• 2.6.24 CPU isolation

• 2.6.28 Block: add support for IO CPU affinity

• 2.6.32 Enable rq CPU completion affinity by default (speeds up significantly databases)

• 2.6.33 Includes full support for ARM9 MPCore

• 2.6.37 Outdated Big Kernel Lock (BKL) technology

• 2.6.38 Improve cpu-cgroup performance for smp systems significantly by rewriting tg_shares_up

• 2.6.39 Ext4 SMP scalability - SMP speed-ups

• 3.1.00 Block: Dynamic writeback throttling - SMP scaling problem fixed , Strict CPU affinity,

• 3.4.00 Memory resource controller (with cgroups)

Latest Linux have the matured SMP features

• 2.6.15: SMP support for ARM11 MPCore

• 2.6.18: SMPnice

• 2.6.36: Support for S5PV310 (ARM Cortex-A9 Multi-Core)

The major features for ARM is merged into mainline Kernel.

Change-logs of Linux Kernel for SMP and ARM.

Introduction – SMP Linux

Up-to-date

6/24

Considerable Problem Solution

Avoiding destruction of sharing resource according to Concurrent workers (e.g. Writers)

Use Locking mechanism. (e.g: kernel lock facilities, app level thread library)

Synchronization overhead Increase or decrease parallel level suitably.

Task Migration Adjust Affinity manually. (ideal OS will schedule tasks automatically)

Resource Contention Operate well-programmed s/w, well-designed OS scheduler like Cgroups. Utilize sched_yield( )

False sharing Allocated data into cache line size ( via compiler ASAP).

Routines used by many agents Implement thread-safe and re-entrant software

Cache line depending task migration (e.g. Ping-pong effect)

Affinitize tasks to a specific CPU

Unfair cache request case. Affinitize tasks to a specific CPU

Introduction – Considerable Factors for SMP Environment

7/24

Related work - CPU Affinity Policy

This technique affinitize specific tasks into some CPUs to avoid load-balancing operation

• Apparatus and method for improved CPU affinity in a multiprocessor systemRA Alfieri - US Patent 5,745,778, 1998, Citations 167 • Affinity scheduling of processes on symmetric multiprocessing systemsKD Abramson, HB Butts Jr… - US Patent 5,506,987, 1996 • Migration policies for multi-core fair-share scheduling, D Choffnes, M Astley,ACM SIGOPS Operating Systems, 2008

8/24

Related work - Classification of RT & NRT tasks

This technique isolates a time-critical tasks into a specific CPU physically.

• Shielded CPUs: real-time performance in standard Linux, ecee.colorado.edu, S Brosky, Linux Journal, 2004, Citations 11 • Shielded processors: Guaranteeing sub-millisecond response in standard Linux, S Brosky, Parallel and Distributed Processing, 2003 • A real-time Linux, V Yodaiken, Proceedings of the Linux Applications, 1997, Citations 167

9/24

Related work - A Partitioning method for Multi-processor

• Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors, S Soltesz, H Pötzl, ME Fiuczynsk, ACM SIGOPS, 2007 , Citations 169

•Task partitioning: An innovation process variable, Eric von Hippel, MIT Sloan School of Management, Cambridge, MA 02139, U.S.A., 1 April 2002. •Process Partitioning for Distributed Embedded Systems, CODES '96 Proceedings of the 4th International Workshop on Hardware/Software Co-Design, 1996

These techniques schedule by grouping/partitioning for tasks‟ goals in kernel space.

10/24

Related work: Load-balancing on Linux for multicore system

• Load balancing operation periodically whenever load imbalance for optimal CPU utilization

• The problems of this mechanism process task migration unnecessarily although the CPU isn't used as fully as 100%.

• Real-time performance and middleware on multi-core linux platforms, Yuanfang Zhang, Washington University, 2008 • Load balancing control method for a loosely coupled multi-processor system and a device for realizing same, Toshio Hirosawa, Hitachi, Japan, Patent No. 4748558, May-1-1986 • Improve load balancing when tasks have large weight differential, Nikhil Rao, Google, http://lwn.net/Articles/409860

11/24

Problems of the existing load-balancer

1. Direct cost

• The load-balancing co

st by checking the loa

d imbalance of CPUs f

or utilization and scala

bility in the multicore

system

2. Indirect cost

• Cache invalidation

• Power consumption

3. Latency cost

• Scheduling latency

• Longer non-preempta

ble Period

In general, more CPU load leads to more frequent task migration, and thus, incurs higher cost. The cost can be broken down into direct, indirect, and latency costs as follows;

12/24

Operation zone based load-balancer: Task migration time

Figure shows the time that has to inspect the needs of task migration to keep the CPU load fairly.

(1)

(2)

(3)

13/24

Operation zone based load-balancer : Load-balancing operation zone

• load-balancing operation zone consists of three scheduling-aware control areas.

• "Cold zone" policy may executes load-balancing operation loosely for low CPU utilization system

• "Hot zone" policy must executes load-balancing operation enthusiastically like the existing mechanism

• "Warm zone" policy is located in middle level between "Cold zone" and "Hot zone".

100 90 80 70 60 50 40 30 20 10 0

CPU u

sage (%

) Hot Zone

Warm Zone

Cold Zone

Fluctuation Spot

(Always load-balancing)

(No load-balancing)

High spot

Mid spot

Low spot

(No load-balancing)

14/24

Operation zone based load-balancer : Calculating CPU utilization

• Warm Zone consists of three spots based on management system of score.

• Control of tasks isn't simple because CPU utilization of "Warm zone“ policy occurs fluctuations, Therefore, support Weight-based score management.

Please see the paper for the detail

Weight-based score management for Warm zone

Based on Local CPU (Default policy)

Based on Average CPUs

15/24

Hardware Latency

Interrupt

Per CPU Latency

Interrupt Latency

Preemption Latency

Switching Latency

WakeUp Latency

Latency Factors in Linux Kernel

Misc. Latency

Latency factors in kernel-space

• The major factors that happen latency damage in kernel-space

Scheduling

Latency

16/24 16/10

RT Task

Go to sleep (1000 usec)

NRT/lower PR Tasks

5,000 usec

RT Task

Go to sleep (1000 usec)

NRT/lower PR Tasks

5,000 usec Latency

Preemption latency

Switching latency

Interrupt latency

… … Wakeup latency

…

Evaluation environment

17/24

Evaluation scenario for worst-case

# Evaluate latency of 1 user-space thread with static priority 99 # ps -eo comm,pid,tid,class,rtprio,wchan:35 | grep 99 | awk '{print $2}„ time ./cyclictest ( –a 0 )-t1 -p 99 -i 1000 -n -l 1000000

# Create 50 threads as background tasks. time ./cyclictest -t50 -p 80 -i 10000 -n -l 100000

# To maximize I/O Load ASAP cd /opt tar cvzf test1.tgz ./linux-2.6.X & tar cvzf test2.tgz ./linux-2.6.X & tar cvzf test3.tgz ./linux-2.6.X & tar cvzf test4.tgz ./linux-2.6.X &

# To maximize CPUs Load /bin/ping -l 100000 -q -s 10 -f localhost & /bin/ping -l 100000 -q -s 10 -f localhost & /bin/ping -l 100000 -q -s 10 -f localhost & /bin/ping -l 100000 -q -s 10 -f localhost & /bin/ping -l 100000 -q -s 10 -f localhost &

# To get the highest CPU stress with Ingo Molnar’s dohell. #!/bin/sh while true; do /bin/dd if=/dev/zero of=bigfile bs=1024000 count=1024; done & while true; do /usr/bin/killall hackbench; sleep 5; done & while true; do /sbin/hackbench 20; done & ( cd ./ltp-full-20120401; while true; do ./runalltests.sh -x 40; done & )

Evaluate scheduling Latency of a urgent task

Stress conditions

http://rt.wiki.kernel.org # Calculate the usage of disk for CPU & I/O load /bin/du / &

BACKGRO

UN

D

FORERO

UN

D

18/24

Evaluation on CPU affinity based system 1/2

• Test Scenario: Foreground task is affinity (CPU0). Background stress is affinity (CPU1~3).

• Test Environment : Intel Q9400 , Linux 2.6.32

• Test Utilities : LTP-FULL-20120401 , Cyclictest of rt-test package

• Load-balancer setting: With Warm Zone (High spot) Policy

Scheduling latency of our test thread is reduced more than three times: from 53 microseconds to 16 microseconds on average

19/24

Evaluation on CPU non-affinity based system 2/2

• Test Scenario: Foreground task is affinity (CPU0). Background stress is non-affinity.

• Test Environment : Intel Q9400 , Linux 2.6.32

• Test Utilities : LTP-FULL-20120401 , cyclictest of rt-test package

• Load-balancer setting: With Warm Zone (High spot) Policy

Scheduling latency of our test thread is reduced more than two times: from 72 microseconds to 31 microseconds on average

20/24

Performance counter stats for 'sync': 3.837029 task-clock # 0.012 CPUs utilized 13 context-switches# 0.003 M/sec 0 CPU-migrations # 0.000 M/sec 140 page-faults # 0.036 M/sec 9,594,609 cycles# 2.501 GHz <not counted> stalled-cycles-frontend <not counted> stalled-cycles-backend 2,221,867 instructions # 0.23 insns per cycle 404,846 branches # 105.510 M/sec 14,400 branch-misses # 3.56% of all branches 0.321459666 seconds time elapsed

sync-2389 [001] 325.763989: wakeup: 2389:120:0 ==+ 620:120:0 [000] sync-2389 [001] 325.764012: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.764076: wakeup: 2389:120:0 ==+ 394:120:0 [002] sync-2389 [001] 325.764082: wakeup: 2389:120:0 ==+ 620:120:0 [000] sync-2389 [001] 325.764089: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.764108: wakeup: 2389:120:0 ==+ 2342:120:0 [000] sync-2389 [001] 325.764116: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.764134: wakeup: 2389:120:0 ==+ 2343:120:0 [000] sync-2389 [001] 325.764136: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.764157: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.799064: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.799200: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.799329: context_switch: 2389:120:2 ==> 0:120:0 [002] sync-2389 [001] 325.799456: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.799580: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.799661: wakeup: 2389:120:0 ==+ 620:120:0 [000] sync-2389 [001] 325.799663: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.917879: wakeup: 2389:120:0 ==+ 394:120:0 [003] . . . . . Below Omission . . . . . . .

Evaluation - Task migration of sync command

• Test Environment : Android device, Linux 2.6.32

• Test Scenario : Sync (To synchronize files of a storage like micro-sdcard)

• Load-balancer policy: With Warm Zone (Mid spot) Policy

Performance counter stats for 'sync': 3.837029 task-clock # 0.012 CPUs utilized 13 context-switches# 0.003 M/sec 3 CPU-migrations # 0.005 M/sec 140 page-faults # 0.036 M/sec 9,594,609 cycles# 2.501 GHz <not counted> stalled-cycles-frontend <not counted> stalled-cycles-backend 2,221,867 instructions # 0.23 insns per cycle 404,846 branches # 105.510 M/sec 14,400 branch-misses # 3.56% of all branches 0.321459666 seconds time elapsed

sync-2389 [001] 325.763989: wakeup: 2389:120:0 ==+ 620:120:0 [000] sync-2389 [001] 325.764012: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.764076: wakeup: 2389:120:0 ==+ 394:120:0 [002] sync-2389 [001] 325.764082: wakeup: 2389:120:0 ==+ 620:120:0 [000] sync-2389 [001] 325.764089: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.764108: wakeup: 2389:120:0 ==+ 2342:120:0 [000] sync-2389 [001] 325.764116: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.764134: wakeup: 2389:120:0 ==+ 2343:120:0 [000] sync-2389 [001] 325.764136: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.764157: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.799064: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.799200: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [002] 325.799329: context_switch: 2389:120:2 ==> 0:120:0 [002] sync-2389 [001] 325.799456: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.799580: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.799661: wakeup: 2389:120:0 ==+ 620:120:0 [000] sync-2389 [001] 325.799663: context_switch: 2389:120:2 ==> 0:120:0 [001] sync-2389 [001] 325.917879: wakeup: 2389:120:0 ==+ 394:120:0 [003] . . . . . Below Omission . . . . . . .

Tracing

with

Ftrace

Skip the activity of unnecessary task migration for real-time characteristics.

Before After

21/24

Evaluation – Migration Handling of one threaded application

• Test Environment : Android device, Linux 2.6.32

• Test Scenario : CPU intensive process‟s scheduling with one threaded application

• Test Example : tar xvf *** ./

• System Interface: /proc/sys/kernel/balance_one_threaded_app (ON=1, OFF=0)

Time

CPU 0 Before

CPU 1

CPU2

CPU 3

95%

94%

89%

91%

86%

92%

97%

91%

89%

84%

Idle status

Idle status

Idle status

Idle status

Idle status

Idle status Idle status

Idle status

Idle status

Idle status

Idle status

Idle status

Start End

CPU 0 After

CPU 1

CPU2

CPU 3

92% (CPU usage of one process)

Time

Idle status

Idle status

Idle status

Start End

22/24

Further work

• If the deadline guarantee for real-time characteristics in the worst conditions is very critical for real-time systems, this approach has the technical limitation to max latency protection of running tasks anytime.

• We need to figure out the best method such as a hybrid design by mixing our technique and the physical CPU shielding technique.

• To recognize low power consumption of mobile devices, we need further experimental research to design an ideal algorithm for vital task migration according to the CPU on-line and the CPU off-line.

• We have to evaluate various scenarios such as direct cost, indirect cost, and latency cost to improve our load-balancer as a next generation SMP scheduler.

23/24

Conclusion

• We do not need any modification of user-space because this approach is the only technique in the operating system.

• Our design reduces non-preemptive intervals that always generate double-locking cost for task migration among the CPUs.

• Our approach suppress the “task migration” kernel thread which executes inefficient CPU instructions to move a task to another CPU

• Our idea pushes cost reduction aggressively regarding CPU cache invalidation and synchronization cause by the update of local cache.

24/24

Thank you for your attention! Any questions?

load-balancing-method-for-embedded-rt-system-20120711-0940

Documents