unix 내부 구조 (linux kernel 을 중심으로 )

272
UNIX 내내 내내 (LINUX Kernel 내 내내내내 )

Upload: dooley

Post on 13-Jan-2016

99 views

Category:

Documents


2 download

DESCRIPTION

UNIX 내부 구조 (LINUX Kernel 을 중심으로 ). Part I. UNIX Operating System 1. Introduction 2. Process Management 3. Memory Management 4. File System 5. Synchronization & IPC 6. I/O System (Device Driver) Part II. Detailed Study: LINUX Kernel Internals 1. Where is everything? - PowerPoint PPT Presentation

TRANSCRIPT

  • UNIX

    (LINUX Kernel )

    ContentsPart I. UNIX Operating System1. Introduction2. Process Management3. Memory Management4. File System 5. Synchronization & IPC6. I/O System (Device Driver)

    Part II. Detailed Study: LINUX Kernel Internals1. Where is everything?System call Implementation Device Driver using Module Programming 2. Linux internals

    ReferencesU. Vahalia, Unix Internals, The New Frontiers, Prentice Hall, 1996.

    H. M. Deitel, Operating Systems, 2nd edition, Addison-Wesley, 1990Silberschatz and Galvin, Operating System Concepts (5th edition), Addison-Wesley, 1998Mukesh Singhal and Niranjan G. Shivaratri, Advanced Concepts in Operating Systems, McGraw-Hill, 1994.

    Maurice J. Bach, The Design of the UNIX Operating System, Prentice Hall, 1986. M. Beck, etc, Linux Kernel Internals, 2nd Ed, Addison-Wesley, 1997Marshall K. McKusick, K. Bostic, M. Karels and J. Quarterman, The Design and Implementation of the 4.4 BSD Operating System, Addison-Weseley Pub. Co., 1996.Benry Goodheart and James Cox, The Magic Garden Explained, Prentice Hall, 1994.

    I. IntroductionWhat is UNIX Operating System?Brief HistoryKernel ArchitectureFeatures of UNIX Operating System

    What is UNIX Operating System?

    Whats the similarity between Onion and UNIX?kernelHardwareX windowRDBMSNetwork Admin.Packagecshviwhoa.outdutelnetgreppslsgccsortwc

    What is UNIX Operating System? (Cont`)HardwareHardware Control (Interrupts handling, etc)File System ManagementBuffer CacheDevice DriversProcessManagement IPCContextMemory ManagementSystem Call Interface Libraries User Programs User Programs Trap User levelKernel level(Source : The design of the UNIX OS) HW level

    What is UNIX Operating System? (Cont`)UNIX Operating System is a Resource Manager Physical Resource CPU, Memory, Disk, Network Abstract Resource process, thread, page, file, inode, message, security,

    UNIX Operating System is the Computing Environmentsprovide resources service to userssystem call, API

    abstraction is just a set of data structure in kernel level

    Brief HistoryBefore UNIXMultics: 1965, AT&T (Bell Lab), General Electronic, MITEpoch1969, Ken Thompson, Space Travel on PDP-7Dennis Ritches5fs, ed, shell (Bourn shell )1973 The UNIX Time Sharing System in CACMBSDBilly Joy, Chuch Haley ()ex, csh, paging based virtual memory system, TCP/IP, ffs, socket1993 4.4BSD (final version, BSDI )AT&T System VVersion 1,2,,7, System III, System V, SVR4.2/ESMPregion based virtual memory, IPC, remote file sharing, STREAM,

    Brief History (Cont`)Commercial UNIXXENIX (MS, SCO), SCO UNIX (SCO), AIX (IBM, Journaling FS), HP-UX (HP), ULTRIX (DEC, MP), OSF/1 (Digital), .SunOS (Sun Microsystems, VFS, NFS), Solaris, Unixware (Novell)Mach micro-kernel chorus, Exo-kernel, SPIN, L4, .http://ssrnet.snu.ac.kr/~choijm/current_os.htmlstandardSVID(System V Interface Definition), POSIX (IEEE), X/OPEN (Inc.)UI (SUN, AT&T : Solaris), OSF (OSF/1)LinuxPerformance orientedPhilosophy of COPYLEFT

    Kernel ArchitectureMonolithic Kernel traditional UNIX, SVR4, Solaris, Linux, .

    OS Personality HardwareSystem CallIntegrated KernelOS Functionality processprocessprocess

    Kernel Architecture (Cont`)Monolithic KernelOS Personality HardwareSystem CallFile Systemprocess

    read()Disk Device DriverProcess ManagementMemory ManagerBuffer Cacheprocess

    fork()bread()sys_read()hd_request()do_hd_io()sys_fork()copy_mm()CPUcopy_thread()

    Kernel Architecture (Cont`)Micro-Kernel Mach, Chorus, L3/L4, SPIN, QNX, Window-NT

    HardwareSystem CallMicrokernelServerServerServerOS Functionality process

    Kernel Architecture (Cont`)Micro-Kernel

    what is the advantage of micro kernel ?HardwareSystem CallMicrokernelFile System Serverprocess

    read() Process Server.sys_read()hd_request()

    Windows-NT ArchitectureWindows-NT

    HardwareHardware Abstraction Layer(HAL)System ServicesKernelNT ExecutiveObject ManagerSecurityRef. MonitorProcess ManagerLPCFacilityI/O ManagerFile SystemCache ManagerDevice Drivers Network DriversWin32 ServerSecurityServerOS/2ServerPOSIXServerProtected Subsystem(Servers)ApplicationsLogonProcessOS/2ClientWin32ClientPOSIXClientVMMgt.MessageTrapHW ControlUser modeKernel mode(Source : Inside Windows NT)

    FeaturesWhat is Good about UNIXOpen system freeSmall is beautiful philosophyfile: just stream of bytesSimple and Coherentdata, device, pipe, socket, memory, process, can be treated as a single abstraction (file)Portabilityhigh-level languagenew paradigm: OO, client-server model, clustering, PDA, MM ServerTrue ParallelismMultitasking (Time Sharing), Multiprogramming, Multiprocessor, MPP

    Features (Cont`)What is Wrong with UNIXToo many variantdumping groundNot small and simple any moreuncontrolled growthBuilding-block approachinappropriate for beginnerLack of GUInot now

    Ritches words, It takes a genius to understand and appreciate the UNIXs simplicity

    II. Process Management

    OverviewWhat is process?process state transitioncontextschedulingkernel entry pointinterrupt, trap, system callsignal

    What is Process?Definitionan instance of a running program (runnable program)an execution environment of a programscheduling entitya control flow and address spacePCB (Process Control Block) : proc. table and U areaManipulation of Processcreate, destroy contextstate transitiondispatch (context switch)sleep, wakeupswap

    Process State Transition user running kernel running zombie initial (idle)fork ready to runsuspended ready asleepsuspended asleepforkreturn fromsyscall orinterruptsyscall,interruptswtchsleep, lockwakeup, unlockexitwaitswapswapswtch(Source : UNIX Internals)

    Process State Transition (Cont`)Flow of execution : execution mode (cf: address space) Kernel executionprocess A execution Kernel execution Kernel execution Kernel executionprocess C executionprocess B execution process B creationInterrupt or Trap cause change of execution modes (Source : Magic Garden)

    Contextcontext : system context, address (memory) context, H/W context proc tableU areasegment tablepage tablememorydiskfdfile tableRegisters (TSS)eip sp eflagseaxcs..swap

    Context : system contextSystem contextproc. Tableidentification: pid, process group id, family relationstatesleep channel: sleep queuescheduling information : p_cpu, p_pri, p_nice, ..signal handling informationaddress (memory) informationU areastores hardware context when the process is not running currentlyUID, GIDarguments, return values, and error status for system callsignal catch functionfile descriptorusage statistics May it be different according to the version and variant of UNIX

    Context : address contextfork example

    guess what can we get from this program?

    intglob = 6;charbuf[] = a write to stdout\n;

    int main(void){ int var;pid_t pid;

    var = 88;write(STDOUT_FILENO, buf, sizeof(buf)-1);printf(before fork\n);

    if ((pid = fork()) == 0) {/* child */glob++; var++;} elsesleep(2);/* parent */

    printf(pid = %d, glob = %d, var = %d\n, getpid(), glob, var);exit (0);} (Source : Adv. programming in the UNIX Env., pgm 8.1)

    Context : address context (Cont`)fork internal : compile resultstest.cgccheadertextdata bssstackusers perspective (virtual address)movl %eax, [glob]addl %eax, 1movl [glob], %eax...glob, bufvar, pidtextdatastackkernel0xffffffff0xbfffffff0x0 a.out : ELF formatExecutable and Linking Format

    Context : address context (Cont`)fork internal : before fork (after run a.out)

    cf) we assume that there is no paging mechanism in this figure. memorytextstackdatasegment T.proc T.pid = 11glob, bufvar, pid

    Context : address context (Cont`)fork internal : after fork

    address space : basic protection barrier memorytextstackdatasegment T.proc T.pid = 11segment T.proc T.pid = 12stackdataglob, bufvar, pidvar, pidglob, buf

    Context : address context (Cont`)fork internal : with COW (Copy on Write) mechanism

    after fork with COW after glob++ operation memorytextstackdatasegment T.proc T.pid = 11segment T.proc T.pid = 12textstackdatasegment T.proc T.pid = 11segment T.proc T.pid = 12data

    Context : address context (Cont`)execve internal

    memorytextstackdatasegment T.proc T.pid = 11stackdatatext

    Context : hardware contexttime sharing (multitasking)process 1process 2process 3time quantumWhere am I ??

    Context : hardware context (Cont`)brief reminds the 80x86 architectureALUControl UnitRegistersINOUT eip, eflags eax, ebx, ecx, edx, esi, edi, cs, ds, ss, es, ... cr0, cr1, cr2, cr3, GDTR, TR, ...

    Context : hardware context (Cont`)context swtchU areaProc T.CPUU areaProc T.restore contextsavecontext

    Context : hardware context (Cont`)context swtch : pseudo-code in UNIX

    trick : register (eg, eax in 80*86 CPU)

    Think about the difference between context switch and system call./* need context swtch */if (save_context()){/* pick another process to run from ready queue */.restore_context(new process)/* The control does not arrive here, NEVER !!! */}/* resuming process executes from here !!! */... (Source : The Design of the UNIX OS)

    Process SchedulingProcess scheduling allocate CPU resource among the competing processescriteria : fairness, efficiency (response time vs. throughput)

    types of processesInteractiveBatch (Computation-Intensive)Real-timevideo,hospital

    types of schedulingPreemptive schedulingother processes can take CPU away from the current running processNon preemptive scheduling(Windows98)other processes can not take CPU away from the current running process

    (utilization)(throughput)/ (turnaround) ->(waiting) (response)

    Process Scheduling (Cont`)Existing Policies

    FCFS (First Come First Served)RR (Round-Robin)time quantum(10-100milisec)SJF (Shortest Job First)Multilevel Feedback Queue

    EDF (Earliest Deadline First)RM (Rate Monotonic)

    Fair QueuingGang SchedulingCausality SchedulingProcess migration

    Process Scheduling (Cont`)UNIX : Round Robin with multilevel Feedback Queue

    Round-Robin

    Process Scheduling (Cont`)Multilevel Feedback QueueReady Queue nCPUReady Queue 1Ready Queue 2CPUCPU.higher priorityless time quantum

    Process Scheduling (Cont`)Round-Robin : real implementationscheduling information in proc. table : p_pri, p_cpu, p_niceevery clock tick : increments p_cpu for current running processevery second : p_cpu = p_cpu * decay factor (generally 1/2) p_pri = PUSER + p_cpu/2 + p_niceExample of System III3 process, PUSER=50, p_nice = 0, clock ticks 60 at every second

    second

    Process Scheduling (Cont`)Example of BSDdecay factor : (2*load_average) / (2*load_average + 1)p_pri = PUSER + (p_cpu/4) + (2*p_nice)clock tick is 10msectime quantum is 10 clock ticksExample of Machdecay factor : 5/8p_usrpri = PUSER + (3.8*(max(1,M/P) ) * p_cpu )/T + 0.5 * p_niceExample of SVR4support REAL-TIME class processclass independent scheduler / class dependent schedulerExample of LINUXsupport REAL-TIME processselect a process that has the highest value of priority + countercounter of the current process decreases at each clock tick.

    Process Scheduling (Cont`)Range of Process PrioritiesSwapperWaiting for Disk I/OWaiting for BufferWaiting for InodeWaiting for TTY IOWaiting for Child ExitUser Level 0 (50)User Level 1User Level nKernel Mode PriorityUser Mode Priority(Source : The Design of the UNIX OS)

    Kernel Entry PointInterruptTrapsystem call

    kernelPM FSMMDD HWM process device

    Interrupt HandlingInterrupta mechanism that peripheral devices inform an asynchronous event to UNIX Operating System

    whats the difference between polling and interrupt? PICReal time ClockCPUdiskttynetworkKernelIVTclock()nmi()tty_intr()disk_intr()net_intr().01234cdrom clock() disk_intr()interrupt handlers

    Interrupt Handling (Cont`)interrupt handling mechanismsimilar to the step of receiving a letter while telephoning

    stepif user mode, change kernel modesave context of current process (make new context layer)determine interrupt sourcefind interrupt vector and call interrupt handler . interrupt handling..restore saved context

    what if another interrupt is triggered while handling a interrupt?

    Interrupt Handling (Cont`)clock interrupt handler ( timer_interrupt() in Linux )clock(){restart clock /* will interrupt again */if (callout table not empty) (eg) timer_list in LINUX)adjust time and schedule callout function if necessaryif (profiling on)count program counter at time of interruptgather statistics per process and systemupdate CPU usage for the current running processif (one second elapsed) {alarm handlingcalculate the p_pri for all processreschedule if necessarywake up swapper or page daemon if necessary}} (Source : The Design of the UNIX OS)

    Trap Handlingtrap : an asynchronous software event

    IVT2021222324div_by_zero()invalid_opcode()overflow()segment_fault ()page_fault ().01234system_call().80clock()nmi()tty_intr()disk_intr()net_intr().

    System Call Handlingsystem call : an example of trap

    IVTdiv_by_zero()invalid_opcode()overflow()segment_fault ()page_fault ().01234system_call().80sys_call_table (sysent[])sys_no_syscall()sys_exit()sys_fork()sys_read ()sys_write ().01234sys_getpid().255 47sys_no_syscall()Kernel sys_fork() sys_read()trap system_call()

    System Call Handling (Cont`)invoke system call

    IVTdiv_by_zero()in_opcode()overflow()seg_fault ()page_fault ().01234system_call().80sys_call_table (sysent[])sys_no_sys()sys_exit()sys_fork()sys_read ()sys_write ().01234sys_getpid().255 47sys_no_sys()Kernelprocessmain(){ . fork()}libc.a.fork(){ . movl $2, eax trap $80 .}.read(){} sys_fork() sys_read()

    System Call Handling (Cont`)how to make a new system callcoding new system call function in kernel spaceallocate syscall_number (and an empty slot in sys_call_table[]) and registeringkernel rebuild

    reconfigure libraryar, ranlib

    coding your program with new system call

    Signala mechanism to inform an asynchronous event to processtypes of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, .action : abort, exit, ignore, stop, user level catch function

    whats the difference among interrupt, trap, and signal?

    void sig_handler(signo)int signo;{signal (SIGUSR1, sig_handler);/* reinstall */printf(received signal %d\n, signo);/* handle the signal */..}

    main (){signal (SIGUSR1, sig_handler);/* install the handler */.for ( ; ; )pause();}

    Signal (Cont`)register signal handler (signal catch function )send signal signal detection : state transition from kernel running to user runningcall signal handler

    variables for signal in task structure in LINUXint sigpending : is signal received or not?struct signal_struct *sigsigset_t signal, blocked

    struct signal_struct /* sched.h */countaction[_NSIG]siglockstruct sigaction /* asm-i386/signal.h */sa_handlersa_flagssa_restorersa_masktypedef struct { unsigned long sig[_NSIG_WORDS];} sigset_t; /* asm-i386/signal.h */

    III. Memory Management

    Memory Hierarchyhierarchy

    caching is more and more important (how to keep consistency?)registerCPU cacheMain MemorySecondary StorageServer (or INTERNET) larger capacity lower speed lower cost

    Memory Management StrategyThree strategiesFetch strategy: when a process (page) is brought into memory?demand fetchprefetch (agent in Web)Placement strategy: where a process (page) is put on memory? first fit, best fit, worst fitreplacement strategy: which process (page) is evicted from memory?LRU, LFU, MRU,

    History of Memory Management Systemsingle user system (stone age of memory management)overlay fixed partition multiprogramming systemabsolute assembler, relocating assemblervariable partition multiprogramming systemcoalescing , compactionvirtual memory systempagingsegmentation (segment, region, vm_object)paging/segmentation

    (Overlay) ) 2-pass (20K)(30K)(10K)pass 1 (70K)pass2 (80K)

    History (Cont`)variable partition multiprogramming system

    Scenario fork P1 (40K) fork P2 (20K) fork P3 (10K) fork P4 (20K) fork P5 (40K) fork P6 (20K) fork P7 (70K) exit P1 exit P3 exit P4 exit P6kernel 0100P1140P2160P3170P4190P5230P6250P7320400free memory map10014040160190302302502032040080memory and kernel internals

    Memory Management Strategy : PlacementScenario fork P1 (40K) fork P2 (20K) fork P3 (10K) fork P4 (20K) fork P5 (40K) fork P6 (20K) fork P7 (70K) exit P1 exit P3 exit P4 exit P6 fork P8 (25K)free memory map10014040160190302302502032040080memory and kernel internalsWhere to go??

    Memory Management Strategy : Placementissue : fragmentationemployed at swap management, KMA (kernel memory allocator)

    Scenario fork P8 (25K)free memory map10014040160190302302502032040080memory kernel internalsfirst fitbest fitworst fit

    Virtual Memoryvirtual memory : separate virtual address and physical address

    virtual address0xfffffffftextdatabssstackkernel0x0kernel textkernel datakernel stackkernel bsspage

    Virtual Memory (Cont`)virtual address : Linux case0xfffffffftextdatabssstackkernel0x00xc0000000start_code end_code end_databrkprogramtextdatabssstart_code end_code end_dataend_bssshared C librarytextdatabssother shared librarystart_stack arg_start arg_end env_end(Source : Linux Internals)shared memory

    Virtual Memory (Cont`)physical memoryconsists of kernel and a set of processesphysical memorykernel0x00x4ffffff P4 P1 P2 P3

    Virtual Memory (Cont`)physical memorya collection of page frame (4K or 8K) page frame 5 page frame 4 page frame 3 page frame 2 page frame 1 page frame n-1 page frame n.physical memory P2 P1 P3

    Virtual Memory (Cont`)address translation

    segment number spage number p offset

    dvirtual address v = (s, p, d)segment table origin register b + s' p' +page frame number p' offset

    dphysical addresssegment table page table

    Virtual Memory (Cont`)address translation : table structure

    cf) disk block descriptor per each page table entry

    segment table page table V page frame number (p) D R U W COW V segment start address (s) L R W E A swap (fs) number block number type (fill 0, demand fill)

    Virtual Memory (Cont`)execve (final)

    headertextdatastacka.out memorysegment T.proc T.48 K 0 K12 K32 K28 K24 K20 K16 K12 K 8 K 4 K 0 K T1 D1 T2 S1 n Kn-1 K4 K28 K20 K12 Kpage T. 1 1 0 1 0 0 0 0 0 0 1 0

    Virtual Memory (Cont`)SVR 4.0 virtual memory structure struct procp_asstruct asseg_listhint

    struct hatprivate datastruct segvn_dataanon_map vnodeas_ptrprivates_opsbasesizeas_ptrprivates_opsbasesizeas_ptrprivates_opsbasesizeas_ptrprivates_opsbasesizeresident pages of fileanonymous pages of segmentvirtual address space text data stack u areastruct seg

    Virtual Memory (Cont`)BSD (Mach) virtual memory structure struct taskvm_mapstruct vm_mapfirst hint last struct pmapresident page liststruct vm_map_entrystruct vm_objectstruct vm_page

    Virtual Memory (Cont`)Linux virtual memory structureDataCodevm_endvm_startvm_flagvm_inodevm_endvm_endvm_startvm_flagvm_inodevm_endcountpgdmmapmmvm_area_structvm_area_structmm_structtask_struct

    Virtual Memory (Cont`) advantage of virtual memorylarge address spaceno need of placement strategyflexible memory object sharing among the processes

    no free lunch : disadvantage of virtual memoryaddress translation

    memorysegment T. P1 4 K28 K20 Kpage T. 1 1 0 1 0 P2segment T.8 K28 K40 Kpage T. 1 1 0 1

    Virtual Memory (Cont`)address translation with TLB (Translation Lookahead Buffer)

    segment number spage number p offset

    dvirtual address v = (s, p, d)segment table origin register b + s' p' +page frame number p' offset

    dphysical addresssegment table page table p'TLB (associative memory) s p

    Virtual Memory (Cont`)HAT (Hardware Address Translation)isolate all hardware dependent codeHAT in SVR4, pmap in BSD, pgd in Linux, ...responsible all address translation transparently

    case study : 80*86 CPU

    virtual addresssegment descriptoroffset16bit32bitsegment descriptor table (GDT, LDT) 32bitlinear addresssegment translationcf) 80*86 reminds GDT - available for all tasks - segment for OS code data - descriptor for LDT, TSSLDT - for a specific taskIDT - interrupt service routine

    Virtual Memory (Cont`)HAT (Hardware Address Translation):Pagingcase study : 80*86 CPU

    page table entry DIR PAGE offsetpage directory linear address012 1122 21 31CR3page table PFN offsetphysical address0 11 31 PFN PFN 31 11 0 31 11 0 PFN D R U W P 31 11 6 5 2 1 0control register:Page Directory Base RegisterD: DirtyR: referencedU:User/SupervisorW:Read/WriteP:Present(valid)

    Replacement Strategy

    Which page can be evicted from memory ?

    goal : reduce the number of page fault and thrashing memoryp7 p3 p1 p4 p2p8 diskreplacement policypage fault for p8

    Replacement Strategy (Cont`)basic principle of replacement : localitytemporal locality : stack, tree traverse, counting variablespatial locality : array, sequential code, file reference

    replacement policyFIFO (First In First Out)LRU (Least Recently Used)LFU (Least Frequently Used)NUR (Not Used Recently)MRU (Most Recently Used)Working SetSecond Chance(FIFO+reference bit)

    Replacement Strategy (Cont`)example : FIFO, LRU, LFU

    guess which page will be evicted from memory under the LRU policy?which policy is the best policy?

    memoryp7 p3 p1 p4 p2p8 disk scenario : page reference orderp1, p2, p3, p1, p4, p2, p1, p3, p4, p7, p8 system internals

    Replacement Strategy (Cont`)Project I : program a simulator for FIFO, LRU, and LFU policy and compare their performance.

    assume - memory consists of 20 page frames - a range of page number is 0 ~ 49 - number of references is 300program the 3 policies - use linked list for FIFO and LRU - use priority tree for LFU if possible - use hash to fast find a pagecompare the performance and discuss it

    Replacement Strategy (Cont`)Example of real implementation in UNIX : buffer cachehash queue header(page_no % 5 ) = 0(page_no % 5 ) = 1(page_no % 5 ) = 2(page_no % 5 ) = 3(page_no % 5 ) = 42110 233242645302819 343(Source : The Design of the UNIX OS)lru list headerheadtail

    Replacement Strategy (Cont`)example : NURused by pagedaemon (two-handed clock algorithm)

    V page frame number (p) D R U W COW 0 0 0 1 possible combination 1 1 1 0 replace page having (0,0) combination first

    Swapper vs. PageDaemonswapping and pagingreplace some object from memory when memory is almost full.

    swappingobject : processswap in/ swap outswap space management similar to variable partition multiprogramming

    pagingobject : pagepage fault handling

    IV. File System

    Overview of File System Virtual File Systemffsnfsext2fsntfs.mmfsprocfsbuffer cache File System device driverUser modeSystem modeprocess 1process 2process n.

    User Interface System callopenread/writecloseduplinkpipe, mkfifomkdir, readdirmknodstatmountsync, fsck

    User Interface (Cont`)file descriptor, file table, inode (vnode)

    proc tableU areasegment tablefdfile tableTSSvnodeinode

    User Interface (Cont`)fork vs open

    fork open same file

    how about dup?proc tablefdfile tablevnodeproc tablefdparentchildproc tablefdfile tablevnodeparentfile table

    Disk systemphysical viewplotter, arm, headcylinder, track, sectorseek time, rotational latency, transmission timelogical view (a viewpoint of UNIX)disk is a collection of disk blocksthe disk block size is usually equal to the page frame size

    01234567891011121314.

    Structure of Filedisk block allocationwant to create a file with size of 14 Kassume - disk block size is 4 K.

    sequential allocationnon sequential allocationblock chain, indexed block, FAT

    01234567891011121314..1516

    Structure of File (Cont`)non sequential allocationblock chain

    new file name

    Structure of File (Cont`)non sequential allocationindex block

    what if the index block is full ?new file name index block...

    Structure of File (Cont`)non sequential allocationFAT (File Allocation Table)

    what is the adv. and disadv. among block chain, index block, and FAT ?new file name FAT 11 12 NIL 5 4 NIL 34 21 9 6 7 NIL UNUN

    Structure of File (Cont`)sequential allocation

    what is the adv. and disadv. between sequential and non sequential allocation ?

    new file name start size

    Structure of File (Cont`)inode in Unix File System

    i_inode_numberi_modei_nlinki_uid, gidi_rdevi_atime, ctime, mtimedirectindirect.inodetype (4bit) u g s r w x r w x r w xS_IFSOCKS_IFLNKS_IFREGS_IFBLKS_IFDIRS_IFCHRS_IFIFO

    Structure of File (Cont`)inode in Unix File System: find blockassume the size of disk block is 4Kwhich block is related if f_offset is 10000 ? (or 47000 )f_offsetindirect.inode file tabledirect74 1218 24 3341 165169

    Structure of Directoryconnect file name to disk block(s)

    provide hierarchical structure for file systeminode number file namedirectory entry in UNIX FSfile name extension attributes time first block numberdirectory entry in DOSdisk block 15 etc 4 dev 3 usr 1 . 1 ..9 mnt 7 var 6 vmunix i_modetime.1inode 1 disk block 717 lib 16 include 12 src 3 . 1 ..23 member20 bin i_modetime.7inode 325 local disk block 3937 mark33 tom 32 jim 23 . 3 ..42 mjc41 soonii_modetime.39inode 23

    Structure of Directory (Cont`)hierarchical view

    /usrdevetcvarmntvmunixsrcincludelibbinmemberlocaljimtommarksoonimjc

    Structure of Directory (Cont`)open exampleopen(/usr/member/sooni/test.c, O_RD)find inode using directory structure (namei())allocate fd, file table and initialize

    proc tablefdfile tableinodef_offset

    Structure of File Systemfile system: boot, super, inode, data blocksystem/dev/hda/dev/hdb/dev/hda1/dev/hda2/dev/hda3boot superi-node

    disk blocks

    Structure of File System (Cont`)super block : manage information for file system (cf: inode for file)

    iget, iputballoc, bfree

    s_type s_flags_devs_blocksizes_magics_name.s_free_inode []s_free_disk block []struct superblock free inode list (map) ...

    free disk block list (map) ...

    Structure of File System (Cont`)super block

    s_type s_flags_devs_blocksizes_magics_name.s_free_inode []s_free_disk block []struct superblock

    29 27 26 24 21 20 1961 57 56 54 51 50 48 46 45 43 42 41 39 38 37 34disk block 29 disk block 61

    Structure of File System (Cont`)mount mount /dev/hda3 /mnt

    open(/mnt/test.c, O_RD)

    inode for /mntvfsmntlist s_dev s_blocksize mounted point root inode ...inode for root on FS of /dev/hda3super block for /dev/hda3vfsmountvfsmount mmt_sb

    Inode for special fileinode structure for special filepipe no indirect block (unnamed pipe)readers, writers, read pointer, write pointer

    special device fileno direct, indirect blockdevice number : major number + minor numbermajor number : corresponding device type used as index for device switch tableminor number : corresponding device unit pass as argument to device driver

    Existing File System S5FSfirst and conventional UNIX file system FFS support 255 characters file namecylinder groupsfragmentsLFSsmall write optimizesuitable for RAID storage systemVxFS (Journaling File System)fast recovery using internal logging i_no size file_name directory entry for ffsboot blocksuper blockcylinder group 1(inode, disk blocks)

    cylinder group 2

    ...fast file system structure

    Existing File System ext2 File SystemLinux default file systemsimilar to Berkeleys FFSinode : 12 direct blockused bitmap for free block and inode managementfault-tolerant featuresboot block

    Block group 0

    Block group 1

    Block group nExt2 file system structure super block

    Group descriptor

    Block bitmap

    Inode bitmap

    Inode table

    Data Blocks

    Existing File System NFSstateless protocolXDR (Extended Data Representation)AFS, Coda File Systemdisconnected operationSprite File Systemstrong consistencyVFSto support various file systemmfsprocfs

    nfs client nfs serverapplicationsystem call VFSNFSRPC stubXDRRPC stub VFS nfsdNFSUFS

    swap space managementswap space management

    P1P2P3P4P5P6400swap spaceWhere to go?? 064Mtextdatastack P1textdatastack P2

    swap space managementswap used map

    why does UNIX manage swap space differently to the FS ? P1P2P3P4P5P6400swap used map3638124166448swap spaceWhere to go?? 064MScenario swap out P1 (3M) swap out P2 (3M) swap out P3 (2M) swap out P4 (1M) swap out P5 (3M) swap out P6 (4M) swap in P2 swap in P4 swap in P5

    V. Inter-Process Communication

    Inter-Process Communication (IPC)synchronization pipescommunication via filessignalSystem V IPCmessage queueshared memorysemaphoreIPC with sockets

    synchronizationparallelismmultiprocessor (true parallelism) or time sharing (quasi-parallelism)race condition : more than one process want to access a same resourceshared resource

    mutual exclusiononly one process can exclusively access a shared resource at a timecritical section : a portion of a program that accesses a shared resourcerepresentative mechanism: ipl, lock, semaphore, test&set

    deadlock

    synchronization (Cont)example of race condition I

    guess what the results are?int main(void){pid_t pid;

    if ((pid = fork()) == 0) {/* child */charatatime(output from child\n);} else {charatatime(output from parent\n); }exit (0);}

    void charatatime(char *str){char *ptr; int c;

    setbuf(stdout, NULL);for (ptr = str; c=*ptr++; )putc(c, stdout);} (Source : Adv. programming in the UNIX Env. pgm 8.7) outpuot utfprut froom chmild parent

    synchronization (Cont`)system internalstask structurefdfile structuref_posfdinodeshared resource

    synchronization (Cont`)example of race condition IIscenarioprocess P1 is currently dispatching (removing from ready queue)disk interrupt occursdisk interrupt handler wake up process P2 and want to insert it into ready queue

    synchronization (Cont`)ipl (interrupt priority level)

    synchronization (Cont`)lockassociate lock variable to each shared resourcelock before (unlock after) the critical section

    spin_lock primitive

    void spin_lock(spinlock_t *s) {while (test_and_set (s) != 0);}

    void spin_unlock (spinlock_t *s) {*s = 0;} (Source : UNIX internals)

    synchronization (Cont`)sleep_lock

    spin lock or sleep lock, lock granularity, rw_lock (try_lock)process wants resourcesleep on resourceawakened by any process wake up all waiting processeslock the resourceuse resourceunlock resourcedoes anyone want it?continue other processingYesYesNoNo

    synchronization (Cont`)semaphorean object that can be accessed P and V (and sem_initialize) method.

    semaphore primitivevoid initsem (semaphore_t *sem, int val) {*sem = val;}

    void P (semaphore_t *sem) {*sem -= 1;while (*sem < 0)sleep;}

    void V (semaphore_t *sem) {*sem += 1;if (processes slept on sem queue)wake up the processes slept on sem;} (Source : UNIX internals)

    synchronization (Cont`)semaphore : example

    clientservershared memoryput the item into shared memoryproduce an itemremove an item from shared memoryconsume the item

    synchronization (Cont`)semaphore : example

    clientservershared memoryput the item into shared memoryproduce an itemremove an item from shared memoryconsume the itemsem1, sem2initsem(sem1, 5) initsem(sem2, 0)P(sem1)V(sem2)P(sem2)V(sem1)

    synchronization (Cont`)semaphore in the linux kernelwidely used for wait until condition meet (eg read disk blocks)semaphore /* include/asm-i386/semaphore, kernel/sched.c */declare semaphore for each shared resourcevoid down (struct semaphore *sem) { while (sem->count wait); sem->count--;}void up (struct semaphore *sem) { sem->count++; wake_up (&sem->wait);}struct semaphore { atomic_t count; struct wait_queue *wait;}process 2process 1shared resourcestruct semaphore *xdown(x)critical sectionup(x)down(x)critical sectionup(x)

    synchronization (Cont`)semaphore in the linux kernelsleep, wakeup /* include/linux/wait.h kernel/sched.c */

    interruptible_sleep_on(), wake_up_interruptible()void sleep_on (struct wait_queue *queue) { void wake_up (struct wait_queue *queue) { struct wait_queue entry = {current, NULL}; struct wait_queue *p = *queue; current->state = TASK_UNINTERRUPTABLE; do { add_wait_queue (queue, &entry); p->task->state = TASK_RUNNING; schedule(); add_runqueue(p); p->p->next; remove_wait_queue(queue, &entry); } while (p != *queue);} }struct wait_queue { struct task_struct *task; struct wait_queue *next;}

    synchronization (Cont`)Deadlocksystem state that processes wait events that never occur.

    process 1resource 1process 2resource 2process 3process 4resource 3resource 4

    synchronization (Cont`)Deadlockdeadlock preventiondeadlock avoidancedeadlock detection and correction

    reduction of resource allocation graph

    P2 R1 P1 P3 R2 P2 R1 P1 P3 R2 P2 R1 P1 P3 R2 P2 R1 P1 P3 R2

    pipenamed pipe, unnamed pipepipe(fd[]), mkfifo(path, mode), mknod(path, mode, dev_t)

    no indirect blocks in inoderd_pointer, wr_pointer, number of readers, number of writersprocess 1pipeS_IFREGS_IFCHRS_IFBLKS_FIFO kernelwrite fdprocess 2read fdwrite fd

    pipepipe(unnamed pipe)limitcannot broadcastno object boundariescannot direct data to a specific readerFIFO(named pipe)FIFO filemust be explicitly deleted(unlink)namedless secure than pipe

    pipe (Cont`)example of pipe : % ls -l | morefor (;;) { read_command();parsing_command(); pipe(fd[]); if (fork()) { close(stdin); dup(fd[0]); if (fork()) { close(stdout) dup(fd[1]); exec(ls, ); } exec(more, ); }wait();}

    Communication via filesthe oldest way of data exchanging among processes

    race condition may be occurredreading a data before the other has completed modifying itmandatory or advisory lockinglockf, flock, fcntlfcntl(fd, cmd, arg)

    PfilePF_GETLK, F_SETLK, ... l_type l_whence l_start l_len l_pid flock structureF_RDLCK, F_WRLCK,F_UNLCK, F_SHLCK, F_EXLCK

    Communication via files (Cont`)A deadlock scenario with file locking

    In Linux, fcntl() returns the error EDEADLOCK

    PfileP

    Signal register signal handler (signal catch function )send signal signal detection : state transition from kernel running to user runningcall signal handler

    variables for signal in task structureint sigpending : is signal received or not?struct signal_struct *sigsigset_t signal, blocked

    struct signal_struct /* sched.h */countaction[_NSIG]siglockstruct sigaction /* asm-i386/signal.h */sa_handlersa_flagssa_restorersa_masktypedef struct { unsigned long sig[_NSIG_WORDS];} sigset_t; /* asm-i386/signal.h */

    System V IPCMessage, Shared Memory, and Semaphore

    Common propertiesKey => id (cf: file name => fd)In kernel, ***id_ds for System V IPC (eg: msqid_ds)ipc_perm: key, uid, cuid, access mode, ipcs, ipcrm

    Differencemessage : suitable for Object-Orient Conceptshared memory : fastsemaphore : for user level synchronization

    System V IPC (Cont`)message queuemsqid = sys_msgget (key, flag) /* create */sys_msgsnd (msqid, msgp, msgsz, flag) /* send */sys_msgrcv (msqid, msgp, msgsz, msgtype, flag) /* receive */sys_msgctl(msqid, cmd, msqid_ds)/* control */PPPPPsendersreceiversmsgmsgmsgstruct msqid_ds

    System V IPC (Cont`)struct msqid_ds msg_permmsg_firstmsg_lastmsg_stimemsg_rtimemsg_ctimewwait_queuerwait_queuemsg_cbytesmsg_qnummsg_qbytesmsg_lspidmsg_lrpidmsg_nextmsg_typemsg_spotmsg_tsmsgtype in sys_msgrcv() =0 : receive the first msg in the queue >0 : receive the given type msg in the queue 0) V() operationelseP() operation struct

    socketsocket common interface for IPC and networkingProtocol family: UNIX, INET, AX25, IPX, Appletalklayer structure of a network

    IP TCP UDP INET

    BSD socket

    PLIP SLIP ETHERNET ARPparallelportserialportEthernetcard

    socket (Cont`) information for communication5-tuple {protocol, local-addr, local-process, foreign-addr, foreign-process

    C library routinessocket(): protocol, make socket structurebind(): assign local-addr and local-processconnect() : foreign-addr, foreign-process

    listen() : waiting in server accept(): make connection to a client

    read(), write()send(), sendto(), recv(), recvfrom()

    cf) system call: sys_socketcall/* net/socket.c */

    socket (Cont`)socket structure

    struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, .}/* include/linux/net.h */struct sock { ...} /* include/net/sock.h */struct proto_ops { family dup, release, bind, connect, accept, listen, ... getsockops setsockops sendmsg recvmsg}/* for INET operation *//* include/linux/net.h */file.f_dentry.f_posf_opsock_lseeksock_readsock_writeNULL sock_pollsock_ioctlNULL sock_no_open./* net/socket.c */

    socket (Cont`)connection oriented protocol

    socket()bind()listen()accept()read()write()socket()connect()write()read()serverclientblocks until connection from a clientprocessing requestdata (request)data (reply)connect established

    socket (Cont`)connectionless protocol

    socket()bind()recvfrom()sendto()socket()bind()sendto()recvfrom()serverclientblocks until data received from a clientprocessing requestdata (request)data (reply)

    TLIconnection oriented protocol

    t_open()t_bind()t_listen()t_accept()t_rcv()t_snd()t_open()t_bind()t_connect() t_snd()serverclientwait for connectionprocessing requestdata (request)data (reply)connection request t_rcv()

    VI. I/O System (Device Driver)

    Role of a device driverhandle data movement between memory and peripheral devicesusually written by a third-party

    PPPP system call interface kernel file system device driver interface (through devsw table) tty driver disk driver network driver

    Peripheral Device: General StructureH/W configurationextremely hardware dependent

    controllerCSR (Control and Status Register) - driver writes to the CSRs to issue commands to the device and reads CSRs to obtain completion status or error condition - memory mapped I/O, special in/out instruction (eg) 80*86s in/out command)- programmed I/O (tty, modem, printer), DMA (disk)internal bufferdevice itself

    Disk DriverDisk I/O handlingconvert logical disk block number into physical sector(s)handle read/write requests, handle interruptdisk schedulingFCFSSSTF (Shortest Seek Time First)SCANC-SCANDMA (channel)RAID

    ..

    Terminal Driverinteractive : line disciplinecanonical mode, raw mode (stty)xbufrbuf cblockraw queue (clists)processcanon queueout queueCSRtty driverin/outinterrupttty_readtty_write

    General structure of Device Driverwell defined entry pointtop half, bottom half

    character device driver block device driver

    whats the difference between character and block device driver?openclosereadwriteioctlmmap in/outintropenclosestrategysizein/outintr

    Device Switch Tabledevsw: table for registering the entry points of device drivers

    struct cdevsw { struct bdevsw { int (*d_open) (); int (*d_open) (); int (*d_close) (); int (*d_close) (); int (*d_read) (); int (*d_strategy) (); int (*d_write) (); int (*d_size) (); int (*d_ioctl) (); int (*d_xhalt) (); int (*d_mmap) (); . int (*d_segmap) (); } bdevsw[] int (*d_xpoll) (); int (*d_xhalt) (); struct streamtab *d_str; struct ttytab *d_tty; .} cdevsw[];

    (Source : UNIX Internals)

    Device Switch Table (Cont`)Example of switch table

    why do we access disks through character interface? hd_open hd_close hd_strategy ht_open ht_close ht_strategy cd_open cd_close cd_strategybdevswcon_open con_close con_read con_write con_ioctlcdevswtty_open tty_close tty_read tty_write tty_ioctled_open ed_close ed_read ed_write ed_ioctlnulldev nulldev mm_read mm_write nulldevhd_open hd_close hd_read hd_write nulldev#ls -l /dev/brw-r--r-- 0 1 hda1brw-r--r-- 0 2 hda2.brw-r--r-- 0 11 hdb1brw-r--r-- 1 0 tape.crw-r--r-- 1 0 tty0crw-r--r-- 1 1 tty1.crw-r--r-- 5 0 rhda1dev file

    Device Switch Table (Cont`)example : openopen(/dev/tty0, O_RD)

    (*cdevsw[getmajor(dev)].d_open) (dev, )

    proc tablefdfile tableinodei_dev : c, 1,0con_open con_close con_read con_write con_ioctlcdevswtty_open tty_close tty_read tty_write tty_ioctled_open ed_close ed_read ed_write ed_ioctlnulldev nulldev mm_read mm_write nulldevgd_open gd_close gd_read gd_write nulldev

    Device Switch Table (Cont`)install new device drivermake new device driver and linking kernelmy_open(), my_read(), my_write(), my_close(), .register devsw tablemake special file# mknod /dev/mydrv [b|c] major_number minor_number

    Device Switch Table (Cont`)control flow

    where the requesting process is slept?user mode kernelread() driver queue device interrupt handlerdevsw table IVT sleep wakeup

    STREAMfull-duplex data transfer and processing pathconsists of a pair of queues

    user application STREAM head W R W R W R W R STREAM driveruser kernelhardware STREAM module

    STREAM (Cont`)userSTREAM head TCP IP token ringuserSTREAM head UDP IP ethernetReusable Module userSTREAM headuserSTREAM headuserSTREAM head TCP UDP IP ATM DQDBMultiplexing

    STREAM (Cont`)STREAM featurestransparency among the queuesreusablemultiplexingmessage based communicationvirtual copying STREAM scheduler : priority bands

    Part II. Detailed Study: Linux Kernel Internals

    Contents why Linux?where is everything (kernel source code) ?kernel configure and compilesystem call implementationmodule programming some important kernel date structures

    ReferencesM. Beck, H. Bohme, M Dziadzka, U Kunitz, R. Magnus, D. Verworner, Linux Kernel Internals, 2nd Ed, Addison-Wesley, 1997Fred Butzen, Christopher Hilton, The LINUX Network, The M&T Books Slackware Series, 1998 Remy Card, etc, the LINIX KERNEL Book, John Wiley & Son, 1998A. Bubini, LINUX Device Driver, OREILLY, 1998Anonymous, Maximum Linux Security (A Hackers Guide To Protecting Your Linux Server and WS), SAMS Publishing, 1999

    http://www.linux.org/http://www.kernel.org/http://kldp.org//usr/src/linux

    Why Linux?freely availableLinus Torvalds, Copyleft1991 version 0.01 (November 1999, version 2.2.13)Redhat, Debian, Slackware, Alzzasupported many companiesMain characteristicsmulti-taskingmulti-user accessmulti-processorsupport various architecture (80*86, sparc, mips, alpha, smp, ..)demand load executablespagingdynamic cache for hard disk

    Why Linux? (Cont`)main characteristics (cont`)shared librarysupport for POSIX 1003.1various formats for executable filestrue 386 protected modeemulating maths co-processorsupport for national keyboards and fontssupport diverse file system (ext2, ..)TCP/IP, SLIP, PPPBSD socketsSystem V IPCVirtual Console

    Why Linux? (Cont`)drawbacksmonolithic kernel (currently micro kernerlize in many research)not for beginners (for system programmers)not well structured (performance-oriented)

    Key attractionexperimenting with the system (handle the kernel by yourself)supported many companies free: solution business & add on featuresthanks to the INTERNET & GNU (special thanks to Anti-MS feeling)

    Where is everything?Linux Operating System Structure(Source : the LINUX KERNEL book)System Calls Interfaceapplication Central kerneltask managementschedulersignalsmemory managementloadable modulesMachine InterfaceMachineNetwork Manager ipv4 ethernet . File Systemext2fs xiafs procminix nfs msdosiso9660 Buffer CachePeripheral Manager block character hd cdrom isdnnetwork scsi pci user levelkernel levelH/W level

    Where is everything? (Cont`)source structurebased on version 2.2.5under development : the contents described below may be changed/usr/src/linuxDocarchincludeinit fskernel ipc libmm netscripts driveralpha armm68k mips ppcsparc i386bootkernellibmath-emummcodaext2hpfsmsdosnfsntfs...ufsasm-alphaasm-arm asm-i386...linuxnetscsivideo802appletalkdecnetethernetipv6unixsunrpcx25...blockcdromcharnetpcipnpsbusscsi...soundvideo

    Where is everything? (Cont`)main subdirectoryarch/architecture dependent codes : arch/i386, arch/alpha, .arch/i386/boot/bootstrappingconfigure devices, memoryarch/i386/kernel/kernel entry point handling (trap/interrupt handling)context switcharch/i386/mm/machine dependent memory management codeinit/all the functions needed to start the kernelhand-made process 0 (init_task or task[0])fork process 1, 2, 3, ...

    Where is everything? (Cont`)main subdirectorykernel/ (arch/i386/kernel)central section of the kernelmain system call implementation (fork, exit, etc.)time managementschedulersignal handlingmm/virtual memory interfacepaging, kernel memory managementfs/virtual file system interfaceimplementations of the various file systems (ext2, nfs,...)

    Where is everything? (Cont`)main subdirectorydrivers/drivers for hardware componentsdrivers/block/ : block-oriented driver(hard disks)drivers/cdrom/ : proprietary CD-ROM drivesdrivers/char/ : character-oriented driver (serial ports, tty, modem, ..)drivers/net : network cardsdrivers/pci/ : PCI bus access and controldrivers/scsi/ : SCSI interfacedrivers/sound/ : sound card driversipc/classical inter-process communication semaphores, shared memory, message queues

    Where is everything? (Cont`)main subdirectorynet/various network protocol implementations : TCP/IP, ARP, ...code for sockets to the UNIX and Internet domainslib/some standard kernel library functions (printk)modules/kernel module filesmodules can be added to the kernel later (insmod, rmmod)include/commonly included kernel-specific header filesinclude/asm-i386/ : architecture-dependent header files for Intel CPUinclude/linux/ : Linux kernel internal structure (task, inode)

    Kernel Configuration and Compilenew kernel is generated in three steps1. configure (Documentation/Configuration.help, see chapter 3 of The LINUX Network)make config (menuconfig, xconfig)make oldconfig2. dependmake dep (make clean:optional)3. compilemake zImage

    cf) - make zdisk (#dd bs=8192 if=$(BOOTIMAZGE) of=/dev/fd0) - make zlilo (#cat $(BOOTIMAGE) > $(INSTALL_PATH)/vmlinuz) /etc/lilo.conf - #mkbootdisk --device /dev/fd0 zImage

    Add New System CallSystem Call : Control flow in Linux

    idt_table /* arch/i386/kernel/traps.c*/Kerneluser process

    do system calllibc.a

    push args save system call number make trap

    system call handlerreal system call functionsys_call_table /* arch/i386/kernel/entry.S */ system_call () /*arch/i386/kernel/entry.S */

    catch trap through IDT call real handler function using sys_call_table

    Add New System Call (Cont`)IDT (Interrupt Descriptor Table)define : include/asm_i386/desc.h, arch/i386/kernel/traps.c, irq.hconstructed while kernel initialization /*arch/i386/kernel/traps.c, irq.c*/system_call().idt_tabledivide_error()debug()nmi().segment_not_present().page_fault ().0x0timer_interrupt()hd_interrupt(). 0x20 FIRST_EXTERNAL_VECTOR SYSCALL_VECTOR 0x80 0xff common trap handler for 80*86 device interrupt handler (IRQ)

    Add New System Call (Cont`)sys_call_tablesyscall number : include/asm_i386/unistd.h#define __NR_exit 1#define __NR_fork 2#define __NR_read 3.#define __NR_vfork 190sys_call_table : arch/i386/kernel/entry.SENTRY(sys_call_table).long SYMBOL_NAME(sys_ni_syscall)/* 0 */.longSYMBOL_NAME(sys_exit)/* 1 */.longSYMBOL_NAME(sys_fork)/* 2 */.longSYMBOL_NAME(sys_read)/* 3 */..longSYMBOL_NAME(sys_vfork)/* 190 */.reptNR_syscalls-190

    sys_vfork().sys_call_tablesys_ni_syscall()sys_exit()sys_fork()sys_read()sys_write().. 0 190 255

    Add New System Call (Cont`)put them altogether : example of forkIVTdivide_error()debug()nmi()

    .0x0system_call().Kerneluser processmain(){ . fork()}libc.a.fork(){ . movl 2, %eax int $0x80 .}. ENTRY(system_call) /* entry.S */ SAVE_ALL . call *SYMBOL_NAME(sys_call_table)(,%eax,4) .0x80 sys_call_table

    sys_exit()sys_fork()sys_read ()sys_write ().1234 sys_fork() /* arch/i386/kernel/process.c */ /* kernel/fork.c */

    Add New System Call (Cont`)Syntax of real system call handler in Linux asmlinkage int sys_fork(regs)/* arch/i386/kernel/process.c */ { return do_fork(..); }

    int do_fork(..) /* kernel/fork.c */ {./* create new process */ }

    asmlinkage int sys_read(fd, buf, count)/* fs/read_write.c */ { ../* read data */ }

    Add New System Call (Cont`)Example: add new system call1 (too simple example) 1. kernel modification1-1. allocate syscall number : include/asm-i386/unistd.h#define __NR_exit 1.#define __NR_vfork 190#define __NR_mysyscall 191

    1-2. register sys_call_table : arch/i386/kernel/entry.SENTRY(sys_call_table)...longSYMBOL_NAME(sys_mysyscall)/* 191 */.reptNR_syscalls-191

    Add New System Call (Cont`)1-3. coding new system call handler asmlinkage int sys_mysyscall() { printk(Hello Linux, Im in Kernel\n); }

    1-4. kernel rebuildif you make a new file, you should let it know to make utility eg) kernel/test.c modify the following field in Makefile on kernel directory O_OBJS = sched.o, dma.o, fork.o, . capability.o, test.o

    Add New System Call (Cont`) 2. make user program with new system call2-1. make user program #include _syscall0(int, mysyscall); main() { int i; i = mysyscall(); }

    2-2. make library if possible#ar, ranlib

    Just Do It ()#define _syscall0 (type, name) \type name(void) \{ \long __res; \__asm__ volatile (int 0x80 \ : =a (__res) \ : 0 (__NR_##name)); \__syscall_return(type, __name); \} /* include/asm-i386/unistd.h */

    Add New System Call (Cont`)add new system call2 : arguments passing1. kernel modification 1-1 #define __NR_show_mult 192

    1-2 .longSYMBOL_NAME(sys_show_mult)/* 192 */ .reptNR_syscalls-192

    1-3 asmlinkage int sys_show_mult(int x, int y, int *res) { int error, compute;

    if ((error = verify_area(VERIFY_WRITE, res, sizeof(*res)))/* include/asm-i386/uaccess.h */ return error; compute = x*y; put_user(compute, res);/* include/asm-i386/uaccess.h */ return (0); } cf) copy_to_user(), copy_from_user() /* include/asm-i386/uaccess.h */

    Add New System Call (Cont`)add new system call2 : arguments passing2-1. make user program #include _syscall3(int, show_mult, int, x, int, y, int *, result); main() { int ret = 0; show_mult(2, 5, &ret); printf(Result : %d * %d = %d\n, 2, 5, ret); }int show_mult (int x, int y, int *result) { long __res; __asm__ volatile (int 0x80 : =a (__res) ,0 (__NR_##name), b ((long) (x)), c ((long) (y)), d ((long) result))); if (__res >= 0) errno =- __res; return __res;} /* include/asm-i386/unistd.h */

    Add New System Call (Cont`)add new system call3 : some general system callsgetpid asmlinkage int sys_getpid() { current->pid; }

    nice asmlinkage int sys_nice(new_priority) { . current->priority = newpriority ; }pause asmlinkage int sys_pause() { current->state = TASK_INTTERUPTIBLE; schedule(); }NR_TASKS: number of total concurrent tasksall tasks connected using double linked list (next_task, next_run)global variable: init_task, currenttask[0]: init_task, task[1]: init process

    Add New System Call (Cont`)forkdo_fork()/* kernel/fork.c */ sys_fork()/* arch/i386/kernel/process.c */- p = alloc_task_struct()- task structure initialize- copy_mm().- copy_thread()- wake_up_process(p)- return (p->pid) copy_thread()/* arch/i386/kernel/process.c */.- p->tss.eax = 0;- p->tss.eip = ret_from_fork;wake_up_process()/* kernel/sched.c */- add_to_runqueue(p);- current->need_resched = 1schedule()/* kernel/sched.c */ ret_from_sys_call()/* arch/i386/kernel/entry.S */if (schedule parent)else (schedule child)

    Add New System Call (Cont`)exitdo_exit()/* kernel/exit.c */ sys_exit()/* kernel/exit.c */.- handling each child process- current->state=TASK_ZOMBIE- schedule() notify_parent()/* kernel/signal.c */- sem_exit()- exit_mmap()- free_page_tables()- exit_files()- exit_thread().

    Add New System Call (Cont`)Project II: add new system get kernel information: want to know about process id, state, process execution time (system time and user time separately), the number of page faults, the number of open files, and and so on

    1. kernel modification asmlinkage int sys_process_statistics(.) { . current->pid, min_flt, maj_flt, times.tms_utime, times.tms_stime . }

    2. user program

    Motivation of Module in LINUXwhy do we use modules?Linux is a monolithic kerneltrivial modifications require kernel to be recompiledkernel is increasing in size by adding new featuresmany modules occupy permanent space in memory though they are used rarely

    module: steps toward micro-kernelized Linuxsmall and compact kernelclean kernelrapid kernelsolution business: components-based Linux

    : backup tape driver

    What can be Modules ?what can be modules?possibly anything current version

    file systemregister_filesystem, unregister_filesystemread_super, put_superblock device driverregister_blkdev, unregister_blkdevopen, releasecharacter device driverregister_chrdev, unregister_chrdevopen, releasenetwork device driverregister_netdev, unregister_netdevopen, closeexec domainregister_exec_domain, unregister_exec_domainload_binary, personalitybinary formatregister_binfmt, unregister_binfmtload_binary.cf: /lib/modules/x.x.x/*.o

    How to manipulate modules?how to manipulate modules?compilation

    insmod, lsmod, rmmod

    kerneld: for on-demand loading eg: mount -t msdos /dev/fd0 /mnt => transparent load fat & msdos modules# gcc -D__KERNEL__ -D_LINUX -DMODULE -c new_module.c

    Enable loadable module support (CONFIG_MODULES) [Y/n/?]MSDOS fs support (CONFIG_MSDOS_FS) [M/n/y/?]#insmod fat#lsmodModule: #pages : Used byfat 6 0#rmmod fat

    How to implement modules?Modulebasic two interfacesinit_module()cleanup_module()

    kernel moduleinit_module()

    cleanup_module()register_filesystem()register_blkdev()register_netdrv()sock_register()insmodrmmod

    How to implement modules? (Cont`)example1 : Hello world!!

    /* hello.c */#include #include

    int init_module() { printk(Hello world!! - Im in kernel\n); return 0;}

    void cleanup_module () { printk(Bye world - Im in kernel\n);}# gcc -D__KERNEL__ -D_LINUX -DMODULE -c hello.c#insmod hello.o#rmmod

    How to implement modules? (Cont`)example2 : simple device driver/* time.c */#include #include #define HOUR_MAJOR 60struct file_operations time_fops = { #define HOUR_MINOR 0 NULL, time_read, NULL, NULL, NULL, NULL, NULL, time_open, NULL, NULL};int time_init() { register_chrdev(HOUR_MAJOR, time, &time_fops); printk(time module loaded (major=%d)\n, HOUR_MAJOR);}

    int time_read(fd, buf, size) { int time_open(..) { . copy_to_user(CURRENT_TIME, buf,...); }}

    int init_module () { cleanup_module { return time_init(); unregister_chrdev(HOUR_MAJOR, time);} printk(time module unloaded \n);}

    How to implement modules? (Cont`)example2 : simple device driver

    how can the cat command invoke the time_read() function ?

    #gcc -D__KERNEL__ -D_LINUX -DMODULE -c time.c

    #mknod /dev/time c 60 0#insmod time#lsmodModule:#pages:Used by:time1

    #cat /dev/time/* print current time */

    #rmmod time

    How to implement modules? (Cont`)example2 : simple device driver register_blkdev()register_chrdev() time_init()register_chrdev(HOUR_MAJOR, time, &time_fops); - chrdevs[major].name = time- chrdevs[major].fops = time_fops/* include/linux/major.h */ init_module

    How to implement modules? (Cont`)example2 : simple device driver open

    filp_open() sys_open() open_namei()/* fs/namei.c */- get_unused_fd()- fd_install(fd, f)- struct file initialize- f->f_op->open()/* fs/device.c */- filp->f_op = get_chrfops(MAJOR (inode->i_rdev)); /* filp->f_op = chrdevs[major].fops */- filp->f_op->open; pipe_open()socket_open()nfs_open()blkdev_open()chrdev_open() time_open()

    How to implement modules? (Cont`)example2 : simple device driver read

    block_read() /* fs/block_dev.c */ sys_read()/* fs/read_write.c */- f->f_op->readnfs_read()time_read()tty_read()pipe_read()

    How to implement modules? (Cont`)example3 : system call wrapper#include #include #include #include #include

    extern void *sys_call_table[];int uid;asmlinkage int (*original_call) (const char *, int, int);asmlinkage int (*getuid_call) ( );

    int init_module ( ) { original_call = sys_call_table[__NR_open]; sys_call_table[__NR_open] = our_sys_open; printk(Spying on UID: %d\n, uid); getuid_call = sys_call_table[__NR_getuid]; return 0;}

    void cleanup_module ( ){ if (sys_call_table[__NR_open] != our_sys_open) { sys_call_table[__NR_open] = original_call; }}

    How to implement modules? (Cont`)example3 : system call wrapperasmlinkage int our_sys_open(const chat *fname, int flags, int mode) { int i=0; char ch;

    if (uid == getuid_call() { printk(opened file by %d: , uid); do { get_user(filename+i); i++; printk(%c, ch); } while (ch != 0); } printk(\n); return original_call(fname, flags, mode);}

    How to implement modules? (Cont`)example4 : new file systemdesign super blockprogram file operations, program inode operationsregistering : register_filesystem()

    mount

    #ifdef CONFIG_MINIX_FS register_filesystem(&(struct file_system_type) {minix_read_super, minix, 1, NULL});#endifstruct file_system_type { struct super_block *(*read_super) (); char *name; int requires_dev; struct file_system_type *next;} *file_system;

    How to implement modules? (Cont`)Project IIIimplement your own modules make file operationsmake module interfacemake drivermknod (use pseudo device such as memory)

    mydrv_open() mydrv_interrupt() mydrv_release() mydrv_init() mydrv_read() mydrv_write() mydrv_out() mydrv_ioctl()mydrv init_module()cleanup_module()

    How to implement modules? (Cont`)system call for modulescreate_modulememory allocation for module (return load address)a new element for module_listinit_modulephysical loading of requesting module (module functions become an integral part of kernel)relocating module functions and solving references of kernel symbolscall module specific init_module functiondelete_moduleget_kernel_symsto get kernel symbols

    How to implement modules? (Cont`)Kernel data structure for create_module()modulenextrefsymtabname...module_listsymbol tablefor this module size

    referencesmodulenextrefsymtabname...symbol tablefor this module size

    references

    Control flow of FS system call

    file access under Linux /* include/linux/sched.h, fs.h */

    why do we need the file data structure ?task structurefs_structfsfiles... count umask *root *pwdinodeinodefile_struct count close_on_exec fd[0] fd[1]

    fd[255]fileinodefile operation routines

    Control flow of FS system call (Cont`)Why do we need file data structure=> to support various type of files with single coherent interface

    open

    filp_open() /* fs/open.c */ sys_open()/* fs/open.c */ open_namei()/* fs/namei.c */- get_unused_fd()- fd_install(fd, f)- struct file initialize- f->f_op->open() /* to support various file */

    Control flow of FS system call (Cont`)struct file /* include/linux/fs.h */

    file operation example

    where is create()?f_next, f_prevf_dentry/* to access inode */f_opf_mode/* access type */f_pos/* file offset */f_count/* reference count */f_flagsf_reada, f_ramax...lseek()read()write()readdir()poll()ioctl()mmap()open()flush()release()fsync()fasync().. include/linux/fs.hext2_file_lseek, generic_file_read,ext2_file_writeNULL, NULL,ext2_file_ioctlgeneric_file_mmapNULL, .fs/ext2/file.cufs_file_lseek, generic_file_read,ufs_file_writeNULL, NULL, NULL,generic_file_mmapNULL, .fs/ufs/file.cNULL, nfs_file_read,nfs_file_writeNULL, NULL, NULL,nfs_file_mmapnfs_file_open, fs/nfs/file.cpipe_lseek, pipe_read,pipe_writeNULL, pipe_poll, pipe_ioctl,NULL,pipe_rdwr_open, ...fs/pipe.cNULL, NULL,NULL,NULL, NULL, NULL,NULLblkdev_open, .fs/device.csock_lseeksock_readsock_writeNULL sock_pollsock_ioctlNULL sock_no_open./* net/socket.c */

    Control flow of FS system call (Cont`)openfilp_open() /* fs/open.c */ sys_open()/* fs/open.c */ open_namei()/* fs/namei.c */- get_unused_fd()- fd_install(fd, f)- struct file initialize- f->f_op->open() System call layer VFS layer Specific File layer pipe_rdwr_open()blkdev_open()chrdev_open()nfs_file_open()sock_no_open()iget(), bread()

    Control flow of FS system call (Cont`)readgeneric_file_read() /* mm/filemap.c */ sys_read()/* fs/read_write.c */- f->f_op->readSystem call handling layer VFS layer sock_read()block_read()tty_read()pipe_read()Specific File layer - try to find page in page cache, if (hit) OK.- get_free_page()- inode->i_op->readpage()nfs_file_read()

    Control flow of FS system call (Cont`)inode structure in Linux /* include/linux/fs.h, ext2_fs_i.h */

    task.fd[].file.f_dentry.f_posf_opdentryd_inodeinodeFile specific information.i_inoi_devi_counti_modei_nlinki_uid, gid i_atime, ...

    i_rdevi_opi_data[15]i_flagsi_.

    device driverinode operation routines

    Control flow of FS system call (Cont`)inode operation example

    ...i_op...def_file_operation create(), lookup()link(), unlink(), symlink()mkdir(), rmdir()mknod(), rename(), readlink(), followlink()readpage(), writepage()bmap(), truncate(), .include/linux/fs.hufs_file_operations, NULL, NULL,NULL, NULL,...generic_readpageNULLufs_bmap,.fs/ufs/file.cext2_file_operations, NULL, NULL,NULL, NULL,...generic_readpageNULLext2_bmap,.fs/ext2/file.cnfs_file_operations, NULL, NULL,NULL, NULL,...nfs_readpagenfs_writepageNULL.fs/nfs/file.crdwr_pipe_fops, NULL, NULL,NULL, NULL,...fs/pipe.cdos_file_operations,NULL, NULL,NULL, NULL,dos_readpage,dos_writepage,NULL,.fs/dos/files.cdef_blk_fops, NULL, NULL,NULL, NULL,...fs/device.c

    Control flow of FS system call (Cont`)read

    generic_file_read() /* mm/filemap.c */ sys_read()/* fs/read_write.c */- f->f_op->read- try to find page in cache, if (hit) OK.- inode->i_op->readpage()generic_readpage() /* fs/buffer.c */ ext2_bmap()/* fs/ext2/inode.c */ll_rw_block() /* driver/block/ll_rw_blk.c */System call handling layer VFS layer sock_read() block_read()tty_read()pipe_read()Specific File layer nfs_readpage()dos_readpage()hd_request /* driver/block/hd.c */Device Driver layerSpecific FS layer coda_readpage() ufs_bmap()/* fs/ufs/inode.c */

    Device Driver Implementation in Linuxdata structureblkdevs, chrdevs for devsw blk_dev_struct for block driver only

    struct device_struct {name;fops;} chrdevs[], blkdevs[];lseekread, write, readdirpoll, ioctl, mmap,open, flush, releasefsync, fasync..file_operations/* fs/devices.c */struct blk_dev_struct {request_fn;queue;request;...} blk_dev[];/* include/linux/blkdev.h */

    Driver Implementation in Linux (Cont`)data structure (cont`)

    file_operationschrdevs[]namefopsblkdevrequest_fncurrent_requestrequestrq_statusrq_devcmdsembhtailnextrequestrq_statusrq_devcmdsembhtailnextbuffer_headb_devb_blocknrb_stateb_countb_size...b_nextb_datarequest

    Driver Implementation in Linux (Cont`)Example of structure of driver: IDE disks hd_open()driver/block/hd.cNULL,block_read,block_writeNULL, NULL, hd_ioctl,NULL,hd_open, NULLhd_release,block_fsyncstruct file_operations hd_ops hd_interrupt() hd_release() hd_init() hd_request() hd_ioctl() hd_out() check_status()

    Driver Implementation in Linux (Cont`)major number /* include/linux/major.h */

    Major Character devices Block devices

    01 memRAM disk 2floppy (fd*)3IDE hard disk (hd* )4terminal5terminal & AUX6Parallel Interface7virtual console (vcs*)8SCSI hard disk (sd*)9SCSI tapes (st*)23Mitsumi CD-ROM (mcd*).

    Driver Implementation in Linux (Cont`)initialization of disk driverregister_blkdev()register_blkdev() /* fs/devices.c */ hd_init()/* driver/block/hd.c */- register_blkdev(HD_MAJOR, hd, &hd_fops);- blk_dev[HD_MAJOR]. request_fn = hd_request- blkdevs[major].name = device name- blkdevs[major].fops = fops/* include/linux/major.h */ init process init_module

    Driver Implementation in Linux (Cont`)disk driver open

    filp_open() /* fs/open.c */ sys_open()/* fs/open.c */ open_namei()/* fs/namei.c */- get_unused_fd()- fd_install(fd, f)- struct file initialize- f->f_op->open()/* fs/device.c */- filp->f_op = get_blkfops(MAJOR (inode->i_rdev)); /* filp->f_op = blkdevs[major].fops */- filp->f_op->open; /* hd_open */pipe_open()socket_open()nfs_open()chrdev_open()blkdev_open() hd_open()/* driver/block/hd.c */

    Driver Implementation in Linux (Cont`)disk driver read

    block_read() /* fs/block_dev.c */ sys_read()/* fs/read_write.c */- f->f_op->readnfs_read()generic_file_read()tty_read()pipe_read() /* mm/filemap.c */- getblk(); /* buffer header */ll_rw_block() /* driver/block/ll_rw_blk.c */- request structure initializemake_request()add_request()- call blk_dev[major].request_fn hd_request() /* driver/block/hd.c */- hd_out()

    Driver Implementation in Linux (Cont`)queue and requests (similar to message queue)requests are sorted by sector numberinb, outb

    struct blk_dev_struct {request_fn;queue;request;...} blk_dev[];/* include/linux/blkdev.h */struct request { rq_status rq_dev cmd /* R/W */ error sector, nr_sector buffer, bh sem next ...}block device driverqueuereqreqreqbuffer cachebread block_readll_rw_blockmake_requesthd_requestdo I/Orequest_fn

    Driver Implementation in Linux (Cont`)various disks and partitions gendisk

    gendiskmajornameminor_shiftmax_ppart.real_devicesnext

    gendisk_head8sdgendiskmajornameminor_shiftmax_ppart.real_devicesnext

    3ide0hd_structstart_sectnr_sects......start_sectnr_sects

    Driver Implementation in Linux (Cont`)tty driverregister_chrdev()

    register_chrdev() /* fs/devices.c */ tty_init()/* driver/block/hd.c */- register_chrdev(TTY_MAJOR, tty, &tty_fops);- blkdevs[major].name = device name- blkdevs[major].fops = fopstty_lseek,tty_read,tty_writeNULL,tty_polltty_ioctl,NULL,tty_open, NULLtty_release,NULLtty_afsyncdriver/char/tty_io.c/* include/linux/major.h */ init process init_module

    Driver Implementation in Linux (Cont`)Example of network driver : 3c509different from disk and tty drivernot directly interface with VFS el3_open() /* driver/net/3c509.c */ el3_start_xmit() el3_init() el3_stop() el3_release() el3_out() el3_interrupt() /* driver/net/3c509.c */ ip_output() ip_rcv()

    Driver Implementation in Linux (Cont`)Example of network driver : 3c509struct device { name mem_end, mem_start base addr /* port number */ init, destructor . device_addr qdisc /* sk_buff */ . open, stop hard_start_xmit, hard_header irq}/* include/linux/netdevices.h */el3_open() . request_irq(dev->irq, el3_interrupt

    init_module() in 3c509/* driver/net/3c509.c*//* register_netdev() */ init port, irq, make dev structure dev->init=el3_init dev->open=el3_open dev->hard_start_xmit = el3_start_xmit ...

    Task Scheduling LINUX schedulingclock tick is 10msec, time quantum is 10 clock tickssupport REAL-TIME task

    variables for scheduling in task structurep_policy : task type /* include/linux/sched.h */SCHED_FIFO, SCHED_RR, SCHED_OTHERp_priority set to DEF_PRIORITY (20) /* include/linux/sched.h */can be changed using sys_nice() or sys_setpriority();p_counterdecrease each clock tickcounter = priority, when counter of all task is zero need_resched : need re-scheduling when return from syscall or interruptrt_priority set using sched_setscheduler(pid, policy, sched_param) system callused to set real time tasks (static priority)

    Task Scheduling (Cont`)schedule() function /* kernel/sched.c */

    schedule need_resched sleep_on- schedule real time task first (rt_priority)- select a task which has highest values of counter + priority (using goodness function) give advantage to the task which run this_cpu give slight advantage to the task which has mm object- if (p_counter == 0) for all task p_counter = p_priority- context switch : switch_to (current, next) /* arch/i386/kernel/process.c */

    Task Scheduling (Cont`)Example of scheduling3 tasks

    T1p_pri p_count. T2p_pri p_count. T3p_pri p_count. 020 2020 2020 201020 1020 2020 202020 1020 1020 203020 1020 1020 104020 020 1020 10millisecond20 020 020 1020 2020 2020 20

    Signal a mechanism to inform an asynchronous event to processtypes of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, .action : abort, exit, ignore, stop, user level catch function

    whats the difference among interrupt, trap, and signal?

    void sig_handler(signo)int signo;{signal (SIGUSR1, sig_handler);/* reinstall */printf(received signal %d\n, signo);/* handle the signal */..}

    main (){signal (SIGUSR1, sig_handler);/* install the handler */.for ( ; ; )pause();}

    Signal (Cont`)register signal handler (signal catch function )send signal signal detection : state transition from kernel running to user runningcall signal handler

    variables for signal in task structureint sigpending : is signal received or not?struct signal_struct *sigsigset_t signal, blocked

    struct signal_struct /* sched.h */countaction[_NSIG]siglockstruct sigaction /* asm-i386/signal.h */sa_handlersa_flagssa_restorersa_masktypedef struct { unsigned long sig[_NSIG_WORDS];} sigset_t; /* asm-i386/signal.h */

    Signal (Cont`)register signal catch functioncountaction[_NSIG]siglocksa_handlersa_flagssa_restorersa_masksys_signal(sig, handler)/* kernel/signal.c */do_sigaction(sig, new_sa, old_sa).sigsignal, blockedsigpending.task.signal_structsigactionsigset_t630.sigset_t630

    Signal (Cont`)send signalcountaction[_NSIG]siglocksa_handlersa_flagssa_restorersa_mask.sigsignal, blockedsigpending.task.signal_structsigactionsigset_t630.sigset_t630sys_kill(pid,sig)/* kernel/signal.c */kill_proc_info(sig, info, pid) send_sig_info(sig, info, *t)sigaddset(t->signal, sig);t->sigpending = 1;

    Signal (Cont`)signal handling

    countaction[_NSIG]siglocksa_handlersa_flagssa_restorersa_mask.sigsignal, blockedsigpending.task.signal_structsigactionsigset_t630.sigset_t630do_signal(regs, oldset)/* arch/i386/kernel/signal.c */signr = dequeue_signal() handle_signal()setup stack frame for signal handlerif (current->sigpending) do_signal();/* arch/i386/kernel/entry.S */handle SIG_IGN or SIG_DFL

    Signal (Cont`)signal handling: state of stack for handling signalmemory - return address - argumentsstackmemory - return address - argumentsstack- return address to kernel- return address to sighandler- arguments

    ThreadMotivation (golf course) Possibility of parallel processingprocess is too heavyPPPPPtimeCPUaddress spaceprocessprocess model(Source : UNIX internals)

    Thread (Cont`)thread model

    task : a set of thread and a collection of resources (passive)thread : hardware context, stack, thread information (id, scheduling, ..)

    timeCPUaddress spacethreadthread model(Source : UNIX internals)

    Thread (Cont`)types of threads kernel threadLWP (lightweight process) : a kernel supported user threaduser thread : C-thread, P-threadCPUCPUthread schedulerKKKKKLLLUUUUUUprocess (or task)user level scheduler

    Thread (Cont`)threads in Linuxstruct thread: currently only one in task structuresys_clone()fully share the address context such as page directoryunder developing

    can use user level thread (P thread)/usr/include/pthread.hpthread_create()pthread_join()pthread_mutex_init()

    Thread (Cont`)Example of thread programming#include ...

    typedef struct { double volatile *p_s; pthread_mutex_t *p_s_lock; int n;} DATA;

    #define L 9double x[L], y[L];/* gcc -lpthread */

    int main(int argc, char *argv[]) { pthread_t *thread; void *retval; int cpu, i; DATA *A; volatile double s = 0; pthread_mutex_t s_lock;

    if (argc != 0) { printf(USAGE: %s, CPU number, argv[0]); exit(1); } cpu = atoi(argv[1]); thread = (pthread_t *)calloc(cpu, sizeof(pthread_t)); A = (DATA *) calloc(cpu, sizeof(DATA));

    Thread (Cont`)Example of thread programmingvoid *SMP_scalprod(void *arg){ register double localsum; long i; DATA D = *(DATA *)arg;

    localsum = 0.0; for (i=D.n; ifamily] = ops;} /* net/socket.c */struct net_proto_family inet_family_ops = { PF_INET, inet_create}

    inet_proto_init(){ sock_register(inet_family_ops) ...} /* net/ipv4/af_inet.c */ /* net/unix/af_unix.c */ /* net/ipx/af_ipx.c */

    Socket Create (cont`)socket create

    sys_socket(family, type, protocol) /* net/socket.c */AF_UNIX, AF_INET, AF_IPX, ... /* include/linux/socket.h */SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, ... sock_create()struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, .} /* include/linux/net.h */sock_alloc()net_families[family]->create() unix_create() inet_create()struct sock { ... prot net_pinfo tp_pinfo socket sk_buff .} /* include/net/sock.h */sk_alloc()switch (type) sock->ops=&inet_stream_ops or sock->ops=&inet_dgram_ops sk->prot = &tcp_prot

    Socket Create (cont`)socket create sys_socket(family, type, protocol) /* net/socket.c */sock_create()get_fd()get_empty_filp()file->f_op=&socket_file_opsassociate d_inode with socket structurestruct file_operations socket_file_ops = { sock_lseek sock_read sock_write NULL /* readdir */ sock_poll sock_ioctl NULL /* mmap */ sock_no_open NULL /* flush */ sock_close NULL /* fsync */ sock_fasync}/* net/socket.c */

    Socket Create (cont`)after socket creation struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, .}/* include/linux/net.h */task.fd[].file.f_dentry.f_posf_opdentryd_inodestruct sock { next, prev daddr, dport rcv_saddr, sport ... rmem_alloc receive_queue /* sk_buff */ wmem_alloc send_queue ... pair /* struct sock */ prot /* struct proto */ tp_pinfo dst_cache /* struct dst_entry */ ...} /* include/net/sock.h */VFS layerINET layerTCP layerIP layerDriver layer

    Send Datasending data through socketcompare with FS control flow, that is a piece of pizza

    sock_write() /* net/socket.c */ sys_write()/* fs/read_write.c */f->f_op->write sock_sendmsg()socki_lookup(d_inode)make msgsock->ops->sendmsgstruct proto_ops inet_stream_ops = { PF_INET sock_no_dup inet_release inet_bind inet_stream_connect sock_no_socketpair inet_accept inet_getname inet_poll inet_ioctl inet_listen inet_shutdown inet_getsockopt inet_setsockopt sock_no_fcntl inet_sendmsg inet_recvmsg}/* net/ipv4/af_inet.c */ inet_sendmsg()sk->prot->sendmsg /* net/ipv4/af_inet.c */

    Send Data (cont`)sending data through socket

    struct proto tcp_proto = { netxt, prev tcp_close tcp_v4_connect tcp_accept NULL /* retrasmit */ tcp_write_wakeup tcp_read_wakeup tcp_poll tcp_ioctl tcp_v4_init_sock tcp_v4_destroy_sock tcp_shutdown tcp_getsockopt tcp_setsockopt tcp_v4_sendmsg tcp_recvmsg TCP ...}/* net/ipv4/tcp_ipv4.c */ inet_sendmsg()sk->prot->sendmsg /* net/ipv4/af_inet.c */ tcp_v4_sendmsg() tcp_do_sendmsg()copy data from user to sk_buff /* net/ipv4/tcp.c */ tcp_send_skb() tcp_transmit_skb() /* net/ipv4/tcp_output.c */make tcp headersk->tp_pinfo.af_tcp.af_specific->queue_xmit(skb)

    Send Data (cont`)sending data through socket

    tcp_transmit_skb()sk->tp_pinfo.af_tcp.af_specific->queue_xmit(skb)struct tcp_func ipv4_specific = { ip_queue_xmit tcp_v4_send_check tcp_v4_rebulid_header tcp_v4_conn_request tcp_v4_sync_recv_sock tcp_v4_get_sock sizeof(struct iphdr) ip_setsockopt ip_getsockopt v4_addr2sockaddr sizeof(struct sockaddr_in)}

    sk_alloc() => tcp_v4_sock_init()tcp_v4_sock_init() { sk->tp_pinfo.af_tcp.af_specific=&ipv4_specific ..}/* net/ipv4/tcp_ipv4.c */ /* net/ipv4/tcp_output.c */ ip_queue_xmit()build IP headerfragment handlingcall ip_route_output() /* dst_cache.output = ip_output in ip_route_output */sk->dst_cache->output() /* net/ipv4/ip_output.c */ ip_output()ip_finish_output(skb) /* net/ipv4/ip_output.c */

    Send Data (cont`)sending data through socket

    ip_finish_output()hh->hh_output(skb) /* include/net/ip.h */ dev_queue_xmit() /* net/core/dev.c */struct hh_cache { hh_refcnt hh_type hh_output }

    struct device { name rmem_end, rmem_start mem_end, mem_start base addr irq init, destructor . device_addr qdisc . open, stop hard_start_xmit, hard_header ...}/* include/linux/netdevices.h */input pkt into dev->qdiscdev->hard_start_xmit()el3_start_xmit() /* driver/net/3c509.c */make ethernet framesend frame using inb(), outb(), ...

    init_module() in 3c509/* driver/net/3c509.c*/ init port, irq, make dev structure dev->open=el3_open dev->hard_start_xmit = el3_start_xmit ...

    hh->output =neigh_ops->output =dev_queue_xmit/* net/ipv4/arp.c*/

    Send Data (cont`)sending data through socket

    struct sock sk_buff headers data ... sk_buff headers data ...struct device sk_buff headers data ......send queue......qdisc...Device LayerProtocol Layer

    Send Data (Cont`)Sending all together (TCP/IP & Ethernet) cf) compare with the control flow of FS, its too terrible (FS is a piece of cake) sock_write() /* net/socket.c */ sys_write()/* fs/read_write.c */ inet_sendmsg() /* net/ipv4/af_inet.c */ tcp_send_skb() /* net/ipv4/tcp_output.c */ ip_queue_xmit() /* net/ipv4/ip_output.c */el3_start_xmit() /* driver/net/3c509.c */Linux kernelVFSBSD socketinet socketTCPIPDevice

    Receive Datareceiving data through socketel3_interrupt() /* driver/net/3c509.c */mark_bh(NET_BH)el3_open() . request_irq(dev->irq, el3_interrupt net_bh() /* net/core/dev.c */make sk_buff in device structureptype->func()struct packet_type { type dev func .}

    /* net/ipv4/ip_output.c */struct packet_type ip_packet_type = { ETH_P_IP, NULL, ip_rcv, ...} /* include/linux/netdevice.h */ip_rcv() /* net/ipv4/ip_input.c */ip_forward(), ip_defrag()skb->dst->input()/* dst.ipput = ip_local_deliver in ip_route_input() */ip_local_deliver() /* net/ipv4/ip_input.c */

    Receive Data (cont`)receiving data through socket struct inet_protocol { handler err_handler ... name} /* include/net/protocol.h */ipprot->handler()ip_local_deliver() /* net/ipv4/ip_input.c */struct inet_protocol tcp_protocol { tcp_v4_rcv tcp_v4_err . TCP} /* net/ipv4/protocol.c */tcp_v4_rcv()tcp_v4_do_rcv() /* net/ipv4/tcp_ipv4.c */call tcp_rcv_establishedor call tcp_rcv_state_processtcp_rcv_state_process() /* net/ipv4/tcp_input.c */check consistency, tcp_data()tcp_data()tcp_data_queue() /* sk_buff into sk */wake up process

    Receive Data (cont`)receiving data through socket sock_read() /* net/socket.c */ sys_read()/* fs/read_write.c */f->f_op->read sock_recvmsg()socki_lookup(d_inode)make msg headersock->ops->recvmsg inet_recvmsg()sk->prot->sendmsg /* net/ipv4/af_inet.c */ tcp_recvmsg() /* net/ipv4/tcp.c */add_wait_queue(sk->sleep, {current, NULL})tcp_data()

    Receive Data (cont`)Receiving all together (TCP/IP & Ethernet)

    sock_read() /* net/socket.c */ sys_read()/* fs/read_write.c */ inet_recvmsg() /* net/ipv4/af_inet.c */ tcp_recvmsg() /* net/ipv4/tcp.c */Linux kernelVFSBSD socketinet socketTCPIPDeviceel3_interrupt() /* driver/net/3c509.c */net_bh() /* net/core/dev.c */ip_rcv() /* net/ipv4/ip_input.c */tcp_rcv_state_process() /* net/ipv4/tcp_input.c */wake upsleep

    Conclusion in NetworkAdd new features

    sock_write() /* net/socket.c */ sys_write()/* fs/read_write.c */ inet_sendmsg() /* net/ipv4/af_inet.c */ tcp_send_skb() /* net/ipv4/tcp_output.c */ ip_queue_xmit() /* net/ipv4/ip_output.c */el3_start_xmit() /* driver/net/3c509.c */ Linux kernel virtual_ip() secure_tcp() compress_net()

    Conclusion of Linuxabstraction is just a set of data structure in kernel levelprocess struct task_struct/* include/linux/sched.h */struct user/* include/asm-i386/user.h */memorystruct vm_area_struct/* include/linux/sched.h, include/asm-i386/page.h */ filestruct file, struct inode/* include/linux/fs.h, ext2_fs_i.h */file systemstruct super_block/* include/linux/fs.h, */buffer struct buffer_head/* include/linux/fs.h */device driverstruct device_struct/* fs/devices.c, driver/* */IPC/* include/linux/ipc.h, sem.h, msg.h, shm.h */TCP/IP/* include/linux/tcp.h, ip.h */

    E. p414thrashing: . . . .

    Working Set 261577751 -> {1,2,5,6,7} Problem: size

    FIFO : p1LFU: p2 or p3 or p7

    FIFO: timeLRU: referenceLFU: freq.E.151ref:shm1.c shm2.cref: sem_lock.csimple programAT&T Transport Interfacehttp://www.rrzn.uni-hannover.de/ZentralSys/Vektor/manual/manlib/C/ni/ni01/ni000009.htm