operating system design

46
Operating System Design Operating System Design LINUX KERNEL DESIGN (2.6/3.X) Dr. C.C. Lee Ref: Linux Kernel Development by R. Love Ref: Operating System Concepts by Silberschatz…

Upload: benoit

Post on 12-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Operating System Design. LINUX SYSTEM DESIGN & KERNEL (2.6, 3.X) Ref: Linux Kernel Development by R. Love Ref: Operating System Concepts by Silberschatz…. Introduction. Monolithic & dynamically loadable kernel module SMP support (run queue per CPU, load balance) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Operating System Design

Operating System DesignOperating System Design

LINUX KERNEL DESIGN (2.6/3.X)Dr. C.C. Lee

Ref: Linux Kernel Development by R. Love

Ref: Operating System Concepts by Silberschatz…

Page 2: Operating System Design

IntroductionIntroduction Monolithic & dynamically loadable kernel module SMP support (run queue per CPU, load balance) Kernel preemptive, schedulable, thread support CPU (soft & hard) affinity Kernel memory not pageable Source in GNU C (not ANSI C) with extension, in-

line for efficiency, Kernel source tree – architecture indep/dep. part Portable to different architecture

Page 3: Operating System Design

CPU AffinityCPU Affinity

CPU affinity: less overhead, in cache Soft affinity means that processes do not

frequently migrate between processors. Hard affinity means that processes run on

processors you specify Reason 1: You have a hunch – computations

Reason 2: Testing complex applications – linear scalability?

Reason 3: Running time-sensitive, deterministic processes

sched_setaffinity (…) set CPU affinity mask

Page 4: Operating System Design

Process (Task) BasicsProcess (Task) Basics

Process States

TASK_RUNNING (run or ready) TASK_INTERRUPTIBLE (sleeping or blocked, may be

waken by signal) TASK_UNTERRUPTIBLE (sleeping/blocked, only event can

wake this task) TASK_STOPPED (SIGSTOP, SIGTTIN, SIGTTOU signals) TASK_ZOMBIE (pending for parent task to issue wait)

Page 5: Operating System Design

Process (Task) Basics - ContinueProcess (Task) Basics - Continue

Context Process context – user code or kernel (from system calls) Interrupt context – kernel interrupt handling

Task (Process) Creation Fork (may be implemented by: COW i.e.Copy On Write)

Vfork :same as fork (but shared page table, parent wait for child) Clone system call is used to implement fork and vfork Threads are created the same as normal tasks except that the

clone system call is passed with spec. resources shared

Task (Process) Termination Memory/files/timers/semaphores released, notify parent

Page 6: Operating System Design

Process (Task) Process (Task) SchedulingScheduling

Preemptive Scheduler Classes (priority for classes) Real-time: FIFO and RR (timeslice), fixed priority Normal (SCHED_NORMAL)

SMP (Run queue/structure per CPU, why?) Processor Affinity (Soft & Hard) Load balancing

Page 7: Operating System Design

Process (Task) Process (Task) Scheduling Cont.Scheduling Cont.

Two process-scheduling Classes: Normal time-sharing (dynamic) (Nice value: 19 to -20, with default 0 = 120)

Real-time algorithm (FIFO/RR) - Soft Absolute priorities (static): 0-99

FIFO run till Exit , Yield, or Block

RR run with time slice

Preemption possible with priority

Normal Processes: to be studied here

Page 8: Operating System Design

Early Kernel 2.6 - O(1) SchedulerEarly Kernel 2.6 - O(1) Scheduler

O(1) Scheduler (Early Kernel 2.6)

Improved scheduler with O(1) operations

using bit map operations to search highest

priority queue Active and Expired Array (Run Queues per

CPU) Scalable Heuristics for CPU/IO bound, Interactivities

Page 9: Operating System Design

21.9 Silberschatz, Galvin and Gagne ©2005Operating System Concepts

O(1) Scheduler Priority ArrayO(1) Scheduler Priority Array

Page 10: Operating System Design

O(1) Scheduler SummaryO(1) Scheduler Summary

Implements a priority-based array of task entries that enables the highest-priority task to be found quickly (by using a priority bitmap with a fast instruction).

Recalculates the timeslice and priority of an expired task before it places it on the expired queue. When all the tasks expire, the scheduler simply needs to swap the active and expired queue pointers and schedule the next task. Long scans of runqueues are, thus, eliminated

This process takes the same amount of processing, irrespective of the number of tasks in the system. It no longer depends on the value of n, but is a fixed constant

Page 11: Operating System Design

O(1) Scheduler ProblemsO(1) Scheduler Problems

Although O(1) scheduler performed well and scaled effortlessly for large systems with many tens or hundreds of processors,

IT FAILS ON:

Slow response to latency-sensitive

applications i.e. interactive processes

for typical desktop systems

Not achieving Fair (Equal) CPU Allocation

Page 12: Operating System Design

Current: Completely Fair Scheduler Current: Completely Fair Scheduler (CFS) (CFS)

Since Kernel 2.6.23 CFS Aiming at

Giving each task a fair share (portion) of the processor time (Completely Fair)

Improving the interactive performance of O(1) scheduler for desktop. While O(1) scheduler is ideal for large server workloads

Introduces simple/efficient algorithmic approach (red-black tree) with O(log N). While O(1) scheduler uses heuristics and the code is large and lacks algorithm substance.

Page 13: Operating System Design

Completely Fair Scheduler (CFS)Completely Fair Scheduler (CFS)

Page 14: Operating System Design

CFS – Processor Time AllocationCFS – Processor Time Allocation Select next that has run the least. Rather than

assign each process a time slice, CFS calculates how long a process should run as a function of the total number of runnable processes and its niceness (default: 1 ms as minimum granularity)

Nice values are used to weight the portion of processor a process is to receive (not by additive increases, but by geometric differences). Each process will run for a “timeslice” proportional to its weight divided by total weight of all runnable processes. Assume TARGETED_LANTENCY = 20ms: Two threads: the niceness are 0(10), and 5(15),

CFS assigns relative weight 3 : 1 (approx.) – *particular algorithm

Niceness 0(10) receives 15ms and Niceness 5(15) receives 5ms Here, CPU portion is determined only by the relative value.

Page 15: Operating System Design

CFS – The Virtual Runtime (vruntime) CFS – The Virtual Runtime (vruntime)

The virtual runtime (vruntime) is the actual runtime (the amount of time spent) weighted by its niceness

nice=0, factor=1; vruntime is same as real run time spent by task

nice<0, factor< 1; vruntime is less than real run time spent. vruntime

grows slower than real run time used.

nice>0, factor> 1; vruntime is more than real run time spent. vruntime grows faster than real run time used.

(The virtual runtime is measured in nano seconds)

Every time a thread runs for t ns, vruntime += t (weighted by task niceness i.e. priority)

The virtual runtime (vruntime) is used to account for how long a process has run. CFS will then pick up the process with the smallest vruntime.

Page 16: Operating System Design

CFS – Task SelectionCFS – Task Selection

CFS select the task with the minimum virtual runtime i.e. vruntime

CFS use a red-black tree (rbtree – a type of self-balancing binary search tree) to manage the list of runnable processes and efficiently (algorithm) find the process with the smallest vruntime

The selected task with the smallest vruntime is the leftmost node in the (rbtree) tree.

Page 17: Operating System Design

CFS – Task just Created or AwakenCFS – Task just Created or Awaken

A new task is created vruntime = Current min_vruntime (some adjustment) and

it will be inserted into the rbtree

A task is awakened from blocking vruntime = Maximum (old vruntime, min_vruntime –

targeted_latency). The currently running task will be preempted and the awakened task will be scheduled. Using “min_vruntime – sched_latency” as a lower bound on an awakened task prevents a task that blocked for a long time from monopolizing the CPU

Page 18: Operating System Design

CFS – Group SchedulingCFS – Group Scheduling

In plain CFS, if there are 25 runnable processes, CFS will allocate 4% to each (assume same). If 20 belong to user A, and 5 belong to user B, then user B is at an inherent disadvantage.

Group scheduling will first try to be fair to the group and then individual in the group, i.e. 50% to user A and 50% to user B.

Thus for A, the allocated 50% of A will be divided fairly among A’s 20 tasks. For B, the allocated 50% will be divided fairly among B’s 5 tasks.

Page 19: Operating System Design

CFS – Run Queue (Red-Black Tree)CFS – Run Queue (Red-Black Tree)

Tasks are maintained in a time-ordered (i.e. vruntime) red-black tree for each CPU

Red-Black Tree: Self-balancing binary search tree Balancing is preserved by painting each node with one of two colors in a

way to satisfy certain properties. When the tree is modified , the new tree is rearranged and repainted to restore the coloring properties.

The balancing of the tree can guarantee that no leaf can be more than twice as deep as others and the tree operations (searching/insertion/deletion/recoloring) can be performed in O(log N) time

CFS will switch to the leftmost task in the tree, that is, the one with the lowest virtual runtime (most need for CPU) to maintain fairness.

Page 20: Operating System Design

CFS – Red-Black TreeCFS – Red-Black Tree(www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/)(www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/)

Page 21: Operating System Design

Interrupt HandlingInterrupt Handling

Interrupts (Hardware) Asynchronous Dev.->Interrupt Controller->CPU-->Interrupt Handlers Device has unique value for each interrupt line: IRQ

(Interrupt ReQuest number) On PC, IRQ 0 = timer interrupt, IRQ 1 is keyboard interrupt

Exceptions (Soft Interrupt) Synchronous Fault (segment fault, page fault,…) Trap (system call) Programming exception

Page 22: Operating System Design

Top Halves and Bottom HalvesTop Halves and Bottom Halves

Top Half Interrupts disabled (Line, local) Run (immediately) ACK & reset hardware, copy data from hardware buffer

Bottom Half Interrupt enabled Run (deferred) Detailed work processing

Example of Network Card Top half: alert the kernel to optimize network throughput, copy

packets to memory, ACK network hardware and ready network card for more packets

The rest will be left to bottom half

Page 23: Operating System Design

Top-HalfTop-Half Writing an Interrupt Handlers (for vectored

interrupt table)

Registering an Interrupt Handler int request_irq (irq#, *handler, irqflags, *devname, *dev_id)

When kernel receives interrupt

From interrupt table (IRQ number)

invokes sequentially each registered

handler on the line (till device is found)

Page 24: Operating System Design

Bottom Halves and Deferring WorkBottom Halves and Deferring Work Softirqs – interrupt context routine(can not block) Handling those with time-critical and high concurrency. Handling routines run right after top-half that raised softirq.

Tasklets: Special softirqs, intended for those with

less time-critical/concurrency/locking requirements It has simpler interface and implementation

Work Queues – A different form of deferring work Work queues run by kernel threads in process context – thus schedulable. Therefore, If the deferred work needs to sleep (allocate a lot of memory, obtain semaphores…), work queues should be used. Otherwise, softirqs/tasklets

are used.

Page 25: Operating System Design

Bottom Halves - KsoftirqdBottom Halves - Ksoftirqd When the system is overwhelmed with softirqs activities, low-priority user processes can not run and may become starved. Thus

A per-CPU kernel thread Ksoftirqd (run with the lowest priority i.e. nice value=19) will be awakened.

With this low-level priority Ksoftirqd to handle softirqs under the busy situation, user processes can be relieved from starvation.

Page 26: Operating System Design

Which Bottom Half to Use Which Bottom Half to Use Bottom Half Context Inherent Serialization Softirq Interrupt None Tasklet Interrupt Against the same tasklet Work Queues Process None

If the deferred work needs to run in process context: work queue The highest overhead: work queue (kernel thread, context switch) Ease of use: work queue The fastest, highly threaded, timing critical use: softirq Same as softirq, but simple interface and ease of use: tasklets

Normal driver writers have two choices: Need a schedulable entity to perform the work (sleep for events?) If so, work queue is the only choice. Otherwise, tasklets are preferred, unless scalability is a concern which will use softirq (highly threaded)

Page 27: Operating System Design

Kernel SynchronizationKernel Synchronization

Kernel has concurrency (threads) and need synchronization

Code safe from concurrent access - Terminology

Interrupt safe (from interrupt handler)

SMP safe

Preempt safe (kernel preemption)

Spinlock, R/W spinlock, semaphore, R/W semaphore, sequential lock, completion variables

Page 28: Operating System Design

Spin LocksSpin Locks

Spin locks: Lightweight For short durations to save context switch overhead

Spin Locks and Top-Half Kernel must disable local interrupts before obtaining

the spin locks. Otherwise the Interrupt Handler (IH) may interrupt kernel and attempts to acquire the same lock while the lock is held by the kernel – spin?

Spin Locks and Bottom Halves Kernel must disable bottom-half before obtaining the

spin locks. Otherwise, the bottom-half may preempt kernel code and attempts to acquire this same lock while the lock is held by the kernel – spin?

Page 29: Operating System Design

Reader-Writer Spin LocksReader-Writer Spin Locks

Shared/Exclusive Locks

Reader and Writer Path read_lock(&my_rwlock) write_lock(…)

CR CR

read_unlock(…) write_unlock(…)

Linux 2.6 favors readers over writers (starvation of writers) for Reader-Writer Spin Locks

Page 30: Operating System Design

SemaphoresSemaphores

Semaphores for long wait Semaphores are for process context (can sleep)

Can not hold a spin lock while acquiring a semaphore (may sleep)

Kernel code holding semaphore can be interrupted or preempted

Using Semaphores: down, up

Page 31: Operating System Design

Reader-Writer SemaphoreReader-Writer Semaphore

Reader-Writer flavor of semaphores Reader-Writer Semaphores are mutexes Reader-Writer Semaphores : locks use

uninterruptible sleep As with semaphores, the following are

provided: down_read_trylock(), down_write_trylock()

down_read, down_write, up_read, up_write

Page 32: Operating System Design

Completion variablesCompletion variables

A task signals other task for an event

One task waits on the completion variable while other task performs work. When it completes, it uses a completion variable to wake up the other task

init_completion(struct completion *) or DECLARE_COMPLETION (mr_comp)

wait_for_completion (struct completion *) complete (struct completion *)

Page 33: Operating System Design

Sequential LocksSequential Locks

Simple mechanism for reading and writing shared data by maintaining a sequence counter

write lock obtained seq# incr; unlock -> seq# incr.

Prior to and after read: the sequence number is read

The sequence number must be even (prior read) and equal at end

Writer always succeed (if no other writers), Readers never block

Favors writers over readers Readers does not affect writer’s locking Seq locks provide very light weight and scalable

lock for use with many readers and a few writers

Page 34: Operating System Design

Sequential Locks (Cont.)Sequential Locks (Cont.)

Example:

seqlock_t mr_seq_lock *s1 WRITE:

write_seq_lock (s1); {spin_lock(s1->lock); ++s1->sequence; SMP_wmb();}

/* Write Data */ write_sequnlock (s1); {SMP_wmb(); s1->sequence++; spin_unlock(s1-> lock);}

READ: do { seq = read_seqbegin (s1); {ret = s1->sequence; SMP_rmb(); return ret;}

/* read data */ } while (read_seqretry (s1, seq)); {SMP_rmb(); return (seq&1) |

s1->sequence^seq) }

Pending writers continually cause read loop to repeat until writers are done.

Page 35: Operating System Design

Ordering and BarriersOrdering and Barriers

Both compiler and CPU can reorder reads/writes: Compiler: optimization, CPU: performance i.e. pipeline

Instruct CPU not to reorder R/W Barrier() call to instruct compiler not to reorder R/W

Memory Barrier and Compiler Barrier Methods barrier()// compiler barrier - load/store

smp_rmb(), wmb(), mb()

Intel X86 processors: do not ever reorder writes

Page 36: Operating System Design

Memory ManagementMemory Management

Main Memory : Three (3) parts kernel memory (never paged out),

kernel memory for memory map (never paged out)

pageable page frames (user pages, paging cache, etc.)

Memory Map : mem_map Array of page descriptor for each page frame in system

with pointers to address space they belong to (if not free) or

with linked list for free frames

Page 37: Operating System Design

Memory ManagementMemory Management

Physical Memory For kernel (never paged out)

For memory map table (never paged out) For page frame to virtual page mapping

For maintaining free page list

For pageable page frames User pages and paging caches

Arbitrary size, contiguous kernel memory Kmalloc(…)

Page 38: Operating System Design

Memory Allocation Mechanisms Memory Allocation Mechanisms

Page allocator - buddy algorithm (2**i split or combined) 65 page chunk->ask for 128 page chunk

Slab allocator: carves chunk (from buddy algorithm) into slabs - one or more physically contiguous pagesA cache (for each kernel data structure): one or more slabs and is populated with kernel objects (TCBs, semaphores)

Example: To allocate a new task_struct, Kernel looks in the object cache. Try: partially full slab?, empty slab?, then a new slab?

kmalloc(): Similar to user-space malloc. It returns a pointer to a region of (physically contiguous) memory that is at least requested ‘size’ bytes in length.

Vmalloc(): allocates chunk of physical memory (that may not be contiguous) and fix up the page tables to map the memory into a contiguous chunk of logical address space.

Page 39: Operating System Design

Virtual MemoryVirtual Memory

Virtual Address Space Homogeneous, contiguous, page-aligned areas (text, mapped files) Page size: 4KB (Pentium), 8KB (Alpha) – Linux also support 4MB

Memory Descriptor A process address space is represented by mm_struct (pointed to by mm field of task_struct)

struct mm_struct { struct vm_area_struct *mmap; // list of memory areas – text, data,… pgd_t *pgd; // page global directory atomic_t mm_users // addr. space users – 2 for 2 threads atomic_t mm_count; // primary reference count struct list_head mmlist; // list of all mm_struct

… // lock, semaphore……. // start/end addr. Of code, data, heap, stack

}

Page 40: Operating System Design

Virtual Memory - PagingVirtual Memory - Paging

Four-level paging (for 64 bit architectures) global/upper/middle directory, and page table

Pentium using two-level paging (global directory points to page table)

Demand paging (no pre-paging) With only user structure (PCB), and page tables

need to be in memory

Page daemon (process 2): awaken (periodically or demand) – check ‘free’

Page 41: Operating System Design

Page ReplacementPage Replacement

Modified Version of LRU Scheme One particular failure of the LRU strategy (besides its

cost of implementation) is that many files are accessed once and then never again. Putting them at the top of the LRU list is thus not optimal.

In general, the kernel has no way of knowing that a file is going to be accessed only once.

However, it does know how many times it has been accessed in the past. This leads to a modified version of LRU i.e. Two-List Strategy as follows:

Page 42: Operating System Design

Page Replacement (Cont.)Page Replacement (Cont.) Two-list strategy (modified version of LRU)

Active list (hot) and Inactive list (reclaim candidate)

Pages when first allocated are placed on inactive list

If referenced while on that list, it will be placed on active list

Both lists are maintained in a pseudo-LRU manner: items are added to the tail and remove from the head as a queue.

Lists balanced: if active list becomes larger, items will be moved from the active list back to the inactive list for potential eviction. The action starts from the head item:

The reference bit is checked. If it was set, it will be reset, the item

is moved back to the list, and the next page is checked. Otherwise it will be moved to the inactive list (resembles a Clock algorithm)

Page 43: Operating System Design

Page Replacement (Cont.)Page Replacement (Cont.)

A Global Policy

All reclaimable pages are contained in just two lists and pages belonging to any process may be reclaimed, rather than just those belonging to a faulting process

The two-list strategy enables simpler, pseudo-LRU semantics to perform well

Solves the only-used-once failure in a classical LRU scheme

Page 44: Operating System Design

The FilesystemThe Filesystem

To the user, Linux’s file system appears as a hierarchical directory tree obeying UNIX semantics

Internally, the kernel hides implementation details and manages the multiple different file systems via an abstraction layer, that is, the virtual file system (VFS)

The Linux VFS is designed around object-oriented principles: Write -> sys_write() // VFS Then --> filesystem’s write method --> physical media

VFS Objects Primary: superblock, inode(cached), dentry (cached), and file objects An operation object is contained within each primary object:

super_operations, inode_operation, dentry_operation, file_operations Other VFS Objects: file_system_type, vfsmount, and three per-process

structures such as file_struct, fs_struct and namespace structures

Page 45: Operating System Design

File System and Device DriversFile System and Device Drivers

User mode

Kernel mode

Libraries

User applications

File subsystem

Buffer/page cache

Hardware control

Block device driverCharacter device driver

Page 46: Operating System Design

Virtual File SystemVirtual File System