Windows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas PolzeWindows Operating System Internals - by David A. Solomon and Mark E. Russinovich with Andreas Polze
Unit OS B: Comparing the Linux Unit OS B: Comparing the Linux and Windows Kernels and Windows Kernels
2
Copyright NoticeCopyright Notice© 2000-2005 David A. Solomon and Mark Russinovich© 2000-2005 David A. Solomon and Mark Russinovich
These materials are part of the These materials are part of the Windows Operating Windows Operating System Internals Curriculum Development Kit,System Internals Curriculum Development Kit, developed by David A. Solomon and Mark E. developed by David A. Solomon and Mark E. Russinovich with Andreas PolzeRussinovich with Andreas Polze
Microsoft has licensed these materials from David Microsoft has licensed these materials from David Solomon Expert Seminars, Inc. for distribution to Solomon Expert Seminars, Inc. for distribution to academic organizations solely for use in academic academic organizations solely for use in academic environments (and not for commercial use)environments (and not for commercial use)
3
Roadmap for Section BRoadmap for Section B
A Brief History of Windows and LinuxA Brief History of Windows and Linux
Comparing the Windows and Linux kernel Comparing the Windows and Linux kernel architecturesarchitectures
Linux: becoming more like WindowsLinux: becoming more like Windows
Benchmarks and other liesBenchmarks and other lies
What does the future hold?What does the future hold?
4
Scope Scope
We’re going to look at the technology of the We’re going to look at the technology of the kernelskernels
We’re not going to look at:We’re not going to look at:
CostCost
SupportSupport
ApplicationsApplications
ManagementManagement
Use as a desktop systemUse as a desktop system
5
The History of LinuxThe History of Linux
The real history of Linux starts in 1969, when Ken The real history of Linux starts in 1969, when Ken Thompson developed the first version of UNIX at Bell Thompson developed the first version of UNIX at Bell Labs Labs
After Dennis Ritchie, designer of the C programming language, After Dennis Ritchie, designer of the C programming language, joined the project it debuted to the research community in an joined the project it debuted to the research community in an academic paper in 1974academic paper in 1974
Bell Labs released the first commercial version in 1976 as UNIX Bell Labs released the first commercial version in 1976 as UNIX Version 6 (V6)Version 6 (V6)
UNIX spread throughout universities and in 1978 Bell UNIX spread throughout universities and in 1978 Bell Labs released UNIX Time-Sharing System, a version with Labs released UNIX Time-Sharing System, a version with portability in mindportability in mind
6
Linux History ContinuedLinux History Continued
Because Bell Labs distributed UNIX with source code, the Because Bell Labs distributed UNIX with source code, the early 1980’s saw three major branches grow on the UNIX early 1980’s saw three major branches grow on the UNIX tree:tree:
UNIX System III from Bell Lab’s UNIX Support Group (USG)UNIX System III from Bell Lab’s UNIX Support Group (USG)
UNIX Berkeley Source Distribution (BSD) from the University of UNIX Berkeley Source Distribution (BSD) from the University of California at BerkeleyCalifornia at Berkeley
Microsoft’s XENIXMicrosoft’s XENIX
The UNIX market fragmented further in the 1980’s, The UNIX market fragmented further in the 1980’s, despite the IEEE’s POSIX standard and the X/Open despite the IEEE’s POSIX standard and the X/Open Group’s Portability GuideGroup’s Portability Guide
7
Linus and LinuxLinus and Linux
In 1991 Linus Torvalds took a college computer science In 1991 Linus Torvalds took a college computer science course that used the Minix operating systemcourse that used the Minix operating system
Minix is a “toy” UNIX-like OS written by Andrew Tanenbaum as a Minix is a “toy” UNIX-like OS written by Andrew Tanenbaum as a learning workbenchlearning workbench
Linus wanted to make MINIX more usable, but Tanenbaum Linus wanted to make MINIX more usable, but Tanenbaum wanted to keep it ultra-simplewanted to keep it ultra-simple
Linus went in his own direction and began working on Linus went in his own direction and began working on LinuxLinux
In October 1991 he announced Linux v0.02In October 1991 he announced Linux v0.02
In March 1994 he released Linux v1.0 In March 1994 he released Linux v1.0
8
The History of Windows (NT)The History of Windows (NT)
The history of Windows really begins in the mid-1970s, The history of Windows really begins in the mid-1970s, when Dick Hustvedt, Peter Lipman and David Cutler when Dick Hustvedt, Peter Lipman and David Cutler designed the VMS operating system for Digital’s 32-bit designed the VMS operating system for Digital’s 32-bit VAX processorVAX processor
Digital shipped VMS v1.0 in 1978Digital shipped VMS v1.0 in 1978
Cutler moved to Seattle to open DECWest and worked on Cutler moved to Seattle to open DECWest and worked on the Digital Mica OS for a new CPU codenamed Prismthe Digital Mica OS for a new CPU codenamed Prism
12 engineers went with him and the facility grew to 20012 engineers went with him and the facility grew to 200
In 1988 Digital cancelled the projectIn 1988 Digital cancelled the project
9
The History of Windows ContinuedThe History of Windows Continued
Bill Gates wanted a UNIX rivalBill Gates wanted a UNIX rival
He hired Cutler and 20 Digital engineers in 1989He hired Cutler and 20 Digital engineers in 1989
The new project was called NT OS/2 because it focused on OS/2 The new project was called NT OS/2 because it focused on OS/2 backward compatibilitybackward compatibility
With the success of Windows 3.0’s 1990 release Gates With the success of Windows 3.0’s 1990 release Gates refocused the project on Windows compatibilityrefocused the project on Windows compatibility
The project renamed to Windows NTThe project renamed to Windows NT
Microsoft released Windows NT 3.1 in August 1993Microsoft released Windows NT 3.1 in August 1993
10
Windows and LinuxWindows and Linux
Both Linux and Windows are based on Both Linux and Windows are based on foundations developed in the mid-1970sfoundations developed in the mid-1970s
1970 1980 1990 2000
UNIX b
orn
UNIX p
ublic
UNIX V
6
Linu
x v1
.0v2
.0v2
.1
v2.2
v2.3
v2.4
v2.6
1970 1980 1990 2000
VMS v
1.0
Win
dows
NT 3.1
NT 4
.0W
indo
ws 20
00
Win
dows
XPSer
ver 2
003
11
Comparing the ArchitecturesComparing the Architectures
Both Linux and Windows are monolithicBoth Linux and Windows are monolithic
All core operating system services run in a shared address space All core operating system services run in a shared address space in kernel-modein kernel-mode
All core operating system services are part of a single moduleAll core operating system services are part of a single module
Linux: vmlinuz Linux: vmlinuz
Windows: ntoskrnl.exeWindows: ntoskrnl.exe
Windowing is handled differently:Windowing is handled differently:
Windows has a kernel-mode Windowing subsystemWindows has a kernel-mode Windowing subsystem
Linux has a user-mode X-Windowing systemLinux has a user-mode X-Windowing system
12
Kernel ArchitecturesKernel Architectures
Device Drivers
Process Management, Memory Management, I/O Management, etc.
X-Windows
Application
System Services
User ModeKernel Mode
Hardware Dependent Code
Linux
Device Drivers
Process Management, Memory Management, I/O Management, etc.
Win32Windowing
Application
System Services
User ModeKernel Mode
Hardware Dependent Code
Windows
13
Linux KernelLinux Kernel
Linux is a monolithic but modular systemLinux is a monolithic but modular system
All kernel subsystems form a single piece of code with no All kernel subsystems form a single piece of code with no protection between themprotection between them
Modularity is supported in two ways:Modularity is supported in two ways:
Compile-time optionsCompile-time options
Most kernel components can be built as a dynamically Most kernel components can be built as a dynamically loadable kernel module (DLKM)loadable kernel module (DLKM)
DLKMsDLKMs
Built separately from the main kernel Built separately from the main kernel
Loaded into the kernel at runtime and on demand (infrequently Loaded into the kernel at runtime and on demand (infrequently used components take up kernel memory only when needed)used components take up kernel memory only when needed)
Kernel modules can be upgraded incrementallyKernel modules can be upgraded incrementally
Support for minimal kernels that automatically adapt to the Support for minimal kernels that automatically adapt to the machine and load only those kernel components that are usedmachine and load only those kernel components that are used
14
Windows KernelWindows Kernel
Windows is a monolithic but modular systemWindows is a monolithic but modular system
No protection among pieces of kernel code and driversNo protection among pieces of kernel code and drivers
Support for Modularity is somewhat weak:Support for Modularity is somewhat weak:
Windows Drivers allow for dynamic extension of kernel Windows Drivers allow for dynamic extension of kernel functionalityfunctionality
Windows XP Embedded has special tools / packaging rules that Windows XP Embedded has special tools / packaging rules that allow coarse-grained configuration of the OSallow coarse-grained configuration of the OS
Windows Drivers are dynamically loadable kernel modulesWindows Drivers are dynamically loadable kernel modules
Significant amount of code run as drivers (including network Significant amount of code run as drivers (including network stacks such as TCP/IP and many services)stacks such as TCP/IP and many services)
Built independently from the kernelBuilt independently from the kernel
Can be loaded on-demandCan be loaded on-demand
Dependencies among drivers can be specifiedDependencies among drivers can be specified
15
Comparing PortabilityComparing Portability
Both Linux and Windows kernels are portableBoth Linux and Windows kernels are portableMainly written in CMainly written in C
Have been ported to a range of processor architecturesHave been ported to a range of processor architectures
WindowsWindowsi486, MIPS, PowerPC, Alpha, IA-64, x86-64i486, MIPS, PowerPC, Alpha, IA-64, x86-64
Only x86-64 and IA-64 currently supportedOnly x86-64 and IA-64 currently supported
> 64MB memory required> 64MB memory required
LinuxLinuxAlpha, ARM, ARM26, CRIS, H8300, i386, IA-64, M68000, Alpha, ARM, ARM26, CRIS, H8300, i386, IA-64, M68000, MIPS, PA-RISC, PowerPC, S/390, SuperH, SPARC, VAX, MIPS, PA-RISC, PowerPC, S/390, SuperH, SPARC, VAX, v850, x86-64v850, x86-64
DLKMs allow for minimal kernels for microcontrollersDLKMs allow for minimal kernels for microcontrollers
> 4MB memory required> 4MB memory required
16
Comparing Layering, APIs, ComplexityComparing Layering, APIs, Complexity
WindowsWindows
Kernel exports about 250 system calls (accessed via ntdll.dll)Kernel exports about 250 system calls (accessed via ntdll.dll)
Layered Windows/POSIX subsystems Layered Windows/POSIX subsystems
Rich Windows API (17 500 functions on top of native APIs)Rich Windows API (17 500 functions on top of native APIs)
LinuxLinux
Kernel supports about 200 different system callsKernel supports about 200 different system calls
Layered BSD, Unix Sys V, POSIX shared system librariesLayered BSD, Unix Sys V, POSIX shared system libraries
Compact APIs (1742 functions in Single Unix Specification Compact APIs (1742 functions in Single Unix Specification Version 3; not including X Window APIs)Version 3; not including X Window APIs)
17
Comparing ArchitecturesComparing Architectures
Processes and schedulingProcesses and scheduling
SMP supportSMP support
Memory managementMemory management
I/OI/O
File CachingFile Caching
Security Security
18
Process ManagementProcess ManagementWindowsWindows
ProcessProcess
Address space, handle Address space, handle table, statistics and at least table, statistics and at least one threadone thread
No inherent parent/child No inherent parent/child relationshiprelationship
ThreadsThreads
Basic scheduling unitBasic scheduling unit
Fibers - cooperative user-Fibers - cooperative user-mode threadsmode threads
LinuxLinux
Process is called a TaskProcess is called a Task
Basic Address space, Basic Address space, handle table, statisticshandle table, statistics
Parent/child relationshipParent/child relationship
Basic scheduling unitBasic scheduling unit
ThreadsThreads
No threads per-seNo threads per-se
Tasks can act like Windows Tasks can act like Windows threads by sharing handle threads by sharing handle table, PID and address table, PID and address spacespace
PThreads – cooperative PThreads – cooperative user-mode threadsuser-mode threads
19
Scheduling PrioritiesScheduling PrioritiesWindowsWindows
Two scheduling classesTwo scheduling classes““Real time” (fixed) - Real time” (fixed) - priority 16-31priority 16-31
Dynamic - priority 1-15Dynamic - priority 1-15
Higher priorities are Higher priorities are favoredfavored
Priorities of dynamic Priorities of dynamic threads get boosted on threads get boosted on wakeupswakeups
Thread priorities are Thread priorities are never lowerednever lowered
31
15
16
0
Fixed
DynamicI/O
Windows
20
Scheduling PrioritiesScheduling PrioritiesWindowsWindows
Two scheduling classesTwo scheduling classes““Real time” (fixed) - Real time” (fixed) - priority 16-31priority 16-31
Dynamic - priority 1-15Dynamic - priority 1-15
Higher priorities are Higher priorities are favoredfavored
Priorities of dynamic Priorities of dynamic threads get boosted on threads get boosted on wakeupswakeups
Thread priorities are Thread priorities are never lowerednever lowered
LinuxLinux
Has 3 scheduling classes:Has 3 scheduling classes:
Normal – priority 100-139Normal – priority 100-139
Fixed Round Robin – priority Fixed Round Robin – priority 0-990-99
Fixed FIFO – priority 0-99Fixed FIFO – priority 0-99
Lower priorities are favored Lower priorities are favored
Priorities of normal threads Priorities of normal threads go up (decay) as they use go up (decay) as they use CPUCPU
Priorities of interactive Priorities of interactive threads go down (boost)threads go down (boost)
21
Scheduling Priorities (cont)Scheduling Priorities (cont)
31
15
16
0
Fixed
DynamicI/O
Windows
140
100
99
0
Fixed FIFO
Fixed Round-Robin
NormalCPU
I/O
Linux
22
Linux Scheduling DetailsLinux Scheduling Details
Most threads use a dynamic priority policy Most threads use a dynamic priority policy
Normal class - similar to the classic UNIX schedulerNormal class - similar to the classic UNIX scheduler
A newly created thread starts with a base priority A newly created thread starts with a base priority
Threads that block frequently (I/O bound) will have their Threads that block frequently (I/O bound) will have their priority gradually increasedpriority gradually increased
Threads that always exhaust their time slice (CPU bound) will Threads that always exhaust their time slice (CPU bound) will have their priority gradually decreasedhave their priority gradually decreased
““Nice value” sets a thread’s base priorityNice value” sets a thread’s base priority
Larger values = less priority, lower values = higher priorityLarger values = less priority, lower values = higher priority
Valid nice values are in the range of -20 to +20 Valid nice values are in the range of -20 to +20
Nonprivileged users can only specify positive nice valueNonprivileged users can only specify positive nice value
Dynamic priority policy threads have static priority zero Dynamic priority policy threads have static priority zero
Execute only when there are no runnable real-time threadsExecute only when there are no runnable real-time threads
23
Real-Time Scheduling on LinuxReal-Time Scheduling on Linux
Linux supports two static priority scheduling policies:Linux supports two static priority scheduling policies:
Round-robin and FIFO (first in, first out)Round-robin and FIFO (first in, first out)
Selected with the sched-setscheduler( ) system callSelected with the sched-setscheduler( ) system call
Use static priority values in the range of 1 to 99Use static priority values in the range of 1 to 99
Executed strictly in order of decreasing static priorityExecuted strictly in order of decreasing static priority
FIFO policy lets a thread run to completion FIFO policy lets a thread run to completion
Thread needs to indicate completion by calling the sched-yield( )Thread needs to indicate completion by calling the sched-yield( )
Round-robin lets threads run for up to one time slice Round-robin lets threads run for up to one time slice
Then switches to the next thread with the same static priorityThen switches to the next thread with the same static priority
RT threads can easily starve lower-prio threads from executing RT threads can easily starve lower-prio threads from executing
Root privileges or the CAP-SYS-NICE capability are required for the Root privileges or the CAP-SYS-NICE capability are required for the selection of a real-time scheduling policyselection of a real-time scheduling policy
Long running system calls can cause priority-inversionLong running system calls can cause priority-inversionSame as in Windows; but cmp. rtLinuxSame as in Windows; but cmp. rtLinux
24
Windows Scheduling DetailsWindows Scheduling Details
Most threads run in variable priority levelsMost threads run in variable priority levels
Priorities 1-15; Priorities 1-15;
A newly created thread starts with a base priority A newly created thread starts with a base priority
Threads that complete I/O operations experience priority Threads that complete I/O operations experience priority boosts (but never higher than 15)boosts (but never higher than 15)
A thread’s priority will never be below base priorityA thread’s priority will never be below base priority
The Windows API function SetThreadPriority() sets the The Windows API function SetThreadPriority() sets the priority value for a specified threadpriority value for a specified thread
This value, together with the priority class of the thread's This value, together with the priority class of the thread's process, determines the thread's base priority levelprocess, determines the thread's base priority level
Windows will dynamically adjust priorities for non-realtime Windows will dynamically adjust priorities for non-realtime threadsthreads
25
Real-Time Scheduling on WindowsReal-Time Scheduling on Windows
Windows supports static round-robin scheduling policy Windows supports static round-robin scheduling policy for threads with priorities in real-time range (16-31)for threads with priorities in real-time range (16-31)
Threads run for up to one quantumThreads run for up to one quantum
Quantum is reset to full turn on preemptionQuantum is reset to full turn on preemption
Priorities never get boostedPriorities never get boosted
RT threads can starve important system servicesRT threads can starve important system services
Such as CSRSS.EXESuch as CSRSS.EXE
SeIncreaseBasePriorityPrivilege required to elevate a thread’s SeIncreaseBasePriorityPrivilege required to elevate a thread’s priority into real-time range (this privilege is assigned to priority into real-time range (this privilege is assigned to members of Administrators group)members of Administrators group)
System calls and DPC/APC handling can cause priority System calls and DPC/APC handling can cause priority inversioninversion
26
Scheduling TimeslicesScheduling TimeslicesWindowsWindows
The thread timeslice The thread timeslice (quantum) is 10ms-120ms(quantum) is 10ms-120ms
When quanta can vary, When quanta can vary, has one of 2 valueshas one of 2 values
Reentrant and Reentrant and preemptible preemptible
Fixed: 120ms
20ms
Foreground: 60ms
Background
LinuxLinux
The thread quantum is The thread quantum is 10ms-200ms10ms-200ms
Default is 100msDefault is 100ms
Varies across entire Varies across entire range based on priority, range based on priority, which is based on which is based on interactivity levelinteractivity level
Reentrant and Reentrant and preemptible preemptible
100ms
200ms10ms
27
Multiprocessor SupportMultiprocessor SupportWindowsWindows
Supports symmetric multiprocessing Supports symmetric multiprocessing (SMP)(SMP)
Up to 32 processors on 32-bit Up to 32 processors on 32-bit WindowsWindows
Up to 64 processors on 64-bit Up to 64 processors on 64-bit WindowsWindows
All CPUs can take interruptsAll CPUs can take interrupts
Supports Non-Uniform Memory Access Supports Non-Uniform Memory Access systemssystems
Scheduler favors the node a thread Scheduler favors the node a thread prefers to run onprefers to run on
Memory manager tries to allocate Memory manager tries to allocate memory on the node a thread memory on the node a thread prefers to run onprefers to run on
Supports HyperthreadingSupports HyperthreadingScheduler favors idle physical Scheduler favors idle physical processors when it has a choiceprocessors when it has a choice
Doesn’t count logical CPUs against Doesn’t count logical CPUs against licensing limitslicensing limits
PhysicalCPU 0
PhysicalCPU 1
0 1 3 4
Ready Thread
28
Multiprocessor SupportMultiprocessor SupportWindowsWindows
Supports symmetric multiprocessing Supports symmetric multiprocessing (SMP)(SMP)
Up to 32 processors on 32-bit Up to 32 processors on 32-bit WindowsWindows
Up to 64 processors on 64-bit Up to 64 processors on 64-bit WindowsWindows
All CPUs can take interruptsAll CPUs can take interrupts
Supports Non-Uniform Memory Access Supports Non-Uniform Memory Access systemssystems
Scheduler favors the node a thread Scheduler favors the node a thread prefers to run onprefers to run on
Memory manager tries to allocate Memory manager tries to allocate memory on the node a thread memory on the node a thread prefers to run onprefers to run on
Supports HyperthreadingSupports HyperthreadingScheduler favors idle physical Scheduler favors idle physical processors when it has a choiceprocessors when it has a choice
Doesn’t count logical CPUs against Doesn’t count logical CPUs against licensing limitslicensing limits
LinuxLinux
Supports SMPSupports SMP
No upper CPU limit: set as No upper CPU limit: set as kernel build constantkernel build constant
All CPUs can take interruptsAll CPUs can take interrupts
Supports Non-Uniform Memory Supports Non-Uniform Memory Access systemsAccess systems
Scheduler favors the node a Scheduler favors the node a thread last ran onthread last ran on
Memory manager tries to Memory manager tries to allocate memory on the node a allocate memory on the node a thread is running onthread is running on
Supports HyperthreadingSupports Hyperthreading
Scheduler favors idle Scheduler favors idle physical processors when it physical processors when it has a choicehas a choice
29
Virtual Memory ManagementVirtual Memory ManagementWindowsWindows
32-bit versions split 32-bit versions split user-mode/kernel-mode from user-mode/kernel-mode from 2GB/2GB to 3GB/1GB2GB/2GB to 3GB/1GB
Demand-paged virtual memoryDemand-paged virtual memory32 or 64-bits32 or 64-bits
Copy-on-writeCopy-on-write
Shared memoryShared memory
Memory mapped filesMemory mapped files
User
System
0
2GB
4GB
LinuxLinux
Splits user-mode/kernel-mode Splits user-mode/kernel-mode from 1GB/3GB to 3GB/1GBfrom 1GB/3GB to 3GB/1GB
2.6 has “4/4 split” option where 2.6 has “4/4 split” option where kernel has its own address kernel has its own address spacespace
Demand-paged virtual memoryDemand-paged virtual memory32-bits and/or 64-bits32-bits and/or 64-bits
Copy-on-writeCopy-on-write
Shared memoryShared memory
Memory mapped filesMemory mapped files
User
System
0
3GB
4GB
30
Physical Memory ManagementPhysical Memory ManagementWindowsWindows
Per-process working setsPer-process working sets
Working set tuner adjust Working set tuner adjust sets according to memory sets according to memory needs using the “clock” needs using the “clock” algorithmalgorithm
No “swapper”No “swapper”
Process
LRU
Reused Page
LinuxLinux
Global working set Global working set managementmanagementuses “clock” algorithmuses “clock” algorithm
No “swapper” (the working No “swapper” (the working set trimmer code is called set trimmer code is called the swap daemon, however)the swap daemon, however)
LRU
Reused Page
Other ProcessLRU
31
I/O ManagementI/O ManagementWindowsWindows
Centered around the file objectCentered around the file object
Layered driver architecture Layered driver architecture throughout driver typesthroughout driver types
Most I/O supports asynchronous Most I/O supports asynchronous operationoperation
Internal interrupt request level Internal interrupt request level (IRQL) controls interruptability(IRQL) controls interruptability
Interrupts are split between an Interrupts are split between an Interrupt Service Routine (ISR) Interrupt Service Routine (ISR) and a Deferred Procedure Call and a Deferred Procedure Call (DPC)(DPC)
Supports plug-and-playSupports plug-and-play
LinuxLinux
Centered around the vnodeCentered around the vnode
No layered I/O modelNo layered I/O model
Most I/O is synchronousMost I/O is synchronous
Only sockets and direct disk Only sockets and direct disk I/O support asynchronous I/O support asynchronous I/OI/O
Internal interrupt request level Internal interrupt request level (IRQL) controls interruptability(IRQL) controls interruptability
Interrupts are split between an Interrupts are split between an ISR and soft IRQ or taskletISR and soft IRQ or tasklet
Supports plug-and-playSupports plug-and-play
IRQL
Masked
32
File CachingFile CachingWindowsWindows
Single global common cacheSingle global common cache
Virtual file cacheVirtual file cache
Caching is at file vs. disk block Caching is at file vs. disk block levellevel
Files are memory mapped into Files are memory mapped into kernel memory kernel memory
Cache allows for zero-copy file Cache allows for zero-copy file servingserving
File Cache
File System Driver
Disk Driver
LinuxLinux
Single global common cacheSingle global common cache
Virtual file cacheVirtual file cache
Caching is at file vs. disk block Caching is at file vs. disk block levellevel
Files are memory mapped into Files are memory mapped into kernel memory kernel memory
Cache allows for zero-copy file Cache allows for zero-copy file servingserving
File Cache
File System Driver
Disk Driver
33
SecuritySecurityWindowsWindows
Very flexible security model based on Very flexible security model based on Access Control ListsAccess Control Lists
Users are defined withUsers are defined withPrivilegesPrivileges
Member groupsMember groups
Security can be applied to any Object Security can be applied to any Object Manager objectManager object
Files, processes, synchronization Files, processes, synchronization objects, …objects, …
Supports auditingSupports auditing
LinuxLinux
Two models: Two models:
Standard UNIX modelStandard UNIX model
Access Control Lists (SELinux)Access Control Lists (SELinux)
Users are defined with:Users are defined with:
Capabilities (privileges)Capabilities (privileges)
Member groupsMember groups
Security is implemented on an Security is implemented on an object-by-object basisobject-by-object basis
Has no built-in auditing supportHas no built-in auditing support
Version 2.6 includes Linux Security Version 2.6 includes Linux Security Module framework for add-on Module framework for add-on security modelssecurity models
34
Monitoring - Linux procfsMonitoring - Linux procfs
Linux supports a number of special filesystemsLinux supports a number of special filesystems
Like special files, they are of a more dynamic nature and tend to have side Like special files, they are of a more dynamic nature and tend to have side effects when accessedeffects when accessed
Prime example is procfs Prime example is procfs (mounted at /proc)(mounted at /proc)
provides access to and control over various aspects of Linux (I.e.; scheduling provides access to and control over various aspects of Linux (I.e.; scheduling and memory management)and memory management)
/proc/meminfo contains detailed statistics on the current memory usage of Linux/proc/meminfo contains detailed statistics on the current memory usage of Linux
Content changes as memory usage changes over timeContent changes as memory usage changes over time
Services for Unix implements procfs on WindowsServices for Unix implements procfs on Windows
35
Windows’ Evolution Towards LinuxWindows’ Evolution Towards Linux
Services for Unix 3.5 - really targeted at POSIX, not LinuxServices for Unix 3.5 - really targeted at POSIX, not Linux
POSIX threads, full POSIX subsystem (Interix)POSIX threads, full POSIX subsystem (Interix)
X Window clients+server (X-Win32 LX)X Window clients+server (X-Win32 LX)
nfs, NIS, pamnfs, NIS, pam
proc-file system for Windowsproc-file system for Windows
Configurability / Module ManagementConfigurability / Module Management
Windows XP EmbeddedWindows XP Embedded
Target Designer/Component Designer/Target Designer/Component Designer/Component Management DatabaseComponent Management Database
Editions targeting new Application DomainsEditions targeting new Application Domains
Windows Compute Cluster Server 2003Windows Compute Cluster Server 2003
POSIX compatibility in Windows actually
predates Linux and was one of the original
design goals
36
Linux’s Evolution Towards WindowsLinux’s Evolution Towards Windows
I/O processingI/O processing
Kernel reentrancyKernel reentrancy
Kernel preemptibilityKernel preemptibility
Per-processor memory allocationPer-processor memory allocation
O(1) scheduler and per-CPU ready queuesO(1) scheduler and per-CPU ready queues
Zero-Copy SendFileZero-Copy SendFile
Wake-One socket semanticsWake-One socket semantics
Asynchronous I/OAsynchronous I/O
Light-weight synchronizationLight-weight synchronization
37
I/O ProcessingI/O Processing
Linux 2.2 had the notion of bottom halves (BH) for low-Linux 2.2 had the notion of bottom halves (BH) for low-priority interrupt processingpriority interrupt processing
Fixed number of BHsFixed number of BHs
Only one BH of a given type could be active on a SMPOnly one BH of a given type could be active on a SMP
Linux 2.4 introduced Linux 2.4 introduced taskletstasklets, which are non-preemptible , which are non-preemptible procedures called with interrupts enabledprocedures called with interrupts enabled
Tasklets are the equivalent of Windows Deferred Tasklets are the equivalent of Windows Deferred Procedure Calls (DPCs)Procedure Calls (DPCs)
38
Kernel ReentrancyKernel Reentrancy
Mark Russinovich’s April 1999 Windows NT Magazine article, “Linux Mark Russinovich’s April 1999 Windows NT Magazine article, “Linux and the Enterprise”, pointed out that much of the Linux 2.2 was not and the Enterprise”, pointed out that much of the Linux 2.2 was not reentrantreentrant
Ingo Molnar stated in rebuttal:Ingo Molnar stated in rebuttal:
““his example is a clear red herring.”his example is a clear red herring.”
A month later he made all major paths reentrantA month later he made all major paths reentrant
cpu 1
cpu 2
cpu 1
cpu 2
Non-reentrant
Reentrant
Time Saved
39
Kernel PreemptibilityKernel Preemptibility
A preemptible kernel is more responsive to high-priority A preemptible kernel is more responsive to high-priority taskstasks
Through the base release of v2.4 Linux was only Through the base release of v2.4 Linux was only cooperativelycooperatively preemptible preemptible
There are well-defined safe places where a thread running in the There are well-defined safe places where a thread running in the kernel can be preemptedkernel can be preempted
The kernel is preemptible in v2.4 patches and v2.6The kernel is preemptible in v2.4 patches and v2.6
Windows NT has always been preemptibleWindows NT has always been preemptible
40
Per-CPU Memory AllocationPer-CPU Memory Allocation
Keeping accesses to memory localized to a CPU Keeping accesses to memory localized to a CPU minimizes CPU cache thrashingminimizes CPU cache thrashing
Hurts performance on enterprise SMP workloadsHurts performance on enterprise SMP workloads
Linux 2.4 introduced per-CPU kernel memory buffersLinux 2.4 introduced per-CPU kernel memory buffers
Windows introduced per-CPU buffers in an NT 4 Service Windows introduced per-CPU buffers in an NT 4 Service Pack in 1997Pack in 1997
0 1
Buffer Cache 0 Buffer Cache 1
CPUs
41
SchedulingScheduling
The Linux 2.4 scheduler is O(n)The Linux 2.4 scheduler is O(n)If there are 10 active tasks, it scans 10 of them in a list in order to If there are 10 active tasks, it scans 10 of them in a list in order to decide which should execute nextdecide which should execute next
This means long scans and long durations under the scheduler lockThis means long scans and long durations under the scheduler lock
103 112 112 101ReadyList
Highest PriorityTask
42
SchedulingScheduling
Linux 2.6 has a revamped scheduler that’s O(1) from Ingo Molnar Linux 2.6 has a revamped scheduler that’s O(1) from Ingo Molnar that:that:
Calculates a task’s priority at the time it makes scheduling decisionCalculates a task’s priority at the time it makes scheduling decision
Has per-CPU ready queues where the tasks are pre-sorted by priorityHas per-CPU ready queues where the tasks are pre-sorted by priority
112 112
101
103
Highest-priorityNon-empty Queue
43
SchedulingScheduling
Windows NT has always had an O(1) scheduler based Windows NT has always had an O(1) scheduler based on pre-sorted thread priority queueson pre-sorted thread priority queues
Server 2003 introduced per-CPU ready queuesServer 2003 introduced per-CPU ready queues
Linux load balances queues Linux load balances queues
Windows does notWindows does not
Not seen as an issue in performance testing by MicrosoftNot seen as an issue in performance testing by Microsoft
Applications where it might be an issue are expected to use affinityApplications where it might be an issue are expected to use affinity
44
Zero-Copy SendfileZero-Copy Sendfile
Linux 2.2 introduced Sendfile to efficiently send file data over a Linux 2.2 introduced Sendfile to efficiently send file data over a socketsocket
I pointed out that the initial implementation incurred a copy operation, I pointed out that the initial implementation incurred a copy operation, even if the file data was cachedeven if the file data was cached
Linux 2.4 introduced zero-copy SendfileLinux 2.4 introduced zero-copy Sendfile
Windows NT pioneered zero-copy file sending with TransmitFile, the Windows NT pioneered zero-copy file sending with TransmitFile, the Sendfile equivalent, in Windows NT 4Sendfile equivalent, in Windows NT 4
File DataBuffer
Network AdapterBuffer
Network
File DataBuffer
NetworkDriver
NetworkNetworkDriver
1-Copy 0-Copy
45
Wake-one Socket SemanticsWake-one Socket Semantics
Linux 2.2 kernel had the Linux 2.2 kernel had the thundering herdthundering herd or or overschedulingoverscheduling problem problem
In a network server application there are typically several In a network server application there are typically several threads waiting for a new connectionthreads waiting for a new connection
In v2.2 when a new connection came in all the waiters would In v2.2 when a new connection came in all the waiters would race to get itrace to get it
Ingo Molnar’s response: Ingo Molnar’s response: 5/2/99: “here he again forgets to _prove_ that overscheduling 5/2/99: “here he again forgets to _prove_ that overscheduling happens in Linux.”happens in Linux.”
5/7/99: “as of 2.3.1 my wake-one implementation and 5/7/99: “as of 2.3.1 my wake-one implementation and waitqueues rewrite went in”waitqueues rewrite went in”
In Linux 2.4 only one thread wakes up to claim the new In Linux 2.4 only one thread wakes up to claim the new connection connection
Windows NT has always had wake-1 semanticsWindows NT has always had wake-1 semantics
46
Asynchronous I/OAsynchronous I/O
Linux 2.2 only supported asynchronous I/O on socket Linux 2.2 only supported asynchronous I/O on socket connect operations and tty’sconnect operations and tty’s
Linux 2.6 adds asynchronous I/O for direct-disk accessLinux 2.6 adds asynchronous I/O for direct-disk access
AIO model includes efficient management of asynchronous I/OAIO model includes efficient management of asynchronous I/O
Also added alternate epoll modelAlso added alternate epoll model
Useful for database servers managing their database on a Useful for database servers managing their database on a dedicated raw partitiondedicated raw partition
Database servers that manage a file-based database suffer from Database servers that manage a file-based database suffer from synchronous I/Osynchronous I/O
Windows I/O is inherently asynchronousWindows I/O is inherently asynchronous
Windows has had completion ports since NT 3.5Windows has had completion ports since NT 3.5
More advanced form of AIO More advanced form of AIO
47
Light-Weight SynchronizationLight-Weight Synchronization
Linux 2.6 introduces FutexesLinux 2.6 introduces Futexes
There’s only a transition to kernel-mode when there’s There’s only a transition to kernel-mode when there’s contentioncontention
Windows has always had CriticalSectionsWindows has always had CriticalSections
Same behaviorSame behavior
Futexes go further:Futexes go further:
Allow for prioritization of waitsAllow for prioritization of waits
Works interprocess as well Works interprocess as well
48
A Look at the FutureA Look at the Future
The kernel architectures are fundamentally similarThe kernel architectures are fundamentally similarThere are differences in the detailsThere are differences in the details
Linux implementation is adopting more of the good ideas used in Linux implementation is adopting more of the good ideas used in WindowsWindows
For the next 2-4 years Windows has and will maintain an edgeFor the next 2-4 years Windows has and will maintain an edgeLinux is still behind on the cutting edge of performance tricksLinux is still behind on the cutting edge of performance tricks
Large performance team and lab at Microsoft has direct ties into the Large performance team and lab at Microsoft has direct ties into the kernel developerskernel developers
As time goes on the technological gap will narrowAs time goes on the technological gap will narrowOpen Source Development Labs (OSDL) will feed performance test Open Source Development Labs (OSDL) will feed performance test results to the kernel teamresults to the kernel team
IBM and other vendors have Linux technology centersIBM and other vendors have Linux technology centers
Squeezing performance out of the OS gets much harder as the OS Squeezing performance out of the OS gets much harder as the OS gets more tunedgets more tuned
49
Linux Technology UnknownsLinux Technology Unknowns
Linux kernel forkingLinux kernel forking
RedHat has already done it: Red Hat Enterprise Server v3.0 is RedHat has already done it: Red Hat Enterprise Server v3.0 is Linux 2.4 with some Linux 2.6 featuresLinux 2.4 with some Linux 2.6 features
Backward compatibility philosophyBackward compatibility philosophy
Linus Torvalds makes decisions on kernel APIs and Linus Torvalds makes decisions on kernel APIs and architecture based on technical reasons, not business reasonsarchitecture based on technical reasons, not business reasons
50
Further ReadingFurther Reading
Transaction Processing Council: www.tpc.orgTransaction Processing Council: www.tpc.org
SPEC: www.spec.orgSPEC: www.spec.org
NT vs Linux benchmarks: www.kegel.com/nt-linux-benchmarks.htmlNT vs Linux benchmarks: www.kegel.com/nt-linux-benchmarks.html
The C10K problem: http://www.kegel.com/c10k.htmlThe C10K problem: http://www.kegel.com/c10k.html
Linus Torvald’s home: http://www.osdl.org/Linus Torvald’s home: http://www.osdl.org/
Linux Kernel Archives: http://www.kernel.org/Linux Kernel Archives: http://www.kernel.org/
Linux history: http://www.firstmonday.dk/issues/issue5_11/moon/Linux history: http://www.firstmonday.dk/issues/issue5_11/moon/
Veritest Netbench result: Veritest Netbench result: http://www.veritest.com/clients/reports/microsoft/ms_netbench.pdfhttp://www.veritest.com/clients/reports/microsoft/ms_netbench.pdf
Mark Russinovich’s 1999 article, “Linux and the Enterprise”: Mark Russinovich’s 1999 article, “Linux and the Enterprise”: http://www.winntmag.com/Articles/Index.cfm?ArticleID=5048http://www.winntmag.com/Articles/Index.cfm?ArticleID=5048
The Open Group's Single UNIX Specification:The Open Group's Single UNIX Specification:http://www.unix.org/version3/http://www.unix.org/version3/