unix 내부 구조 (linux kernel 을 중심으로 )
DESCRIPTION
UNIX 내부 구조 (LINUX Kernel 을 중심으로 ). Part I. UNIX Operating System 1. Introduction 2. Process Management 3. Memory Management 4. File System 5. Synchronization & IPC 6. I/O System (Device Driver) Part II. Detailed Study: LINUX Kernel Internals 1. Where is everything? - PowerPoint PPT PresentationTRANSCRIPT
-
UNIX
(LINUX Kernel )
ContentsPart I. UNIX Operating System1. Introduction2. Process Management3. Memory Management4. File System 5. Synchronization & IPC6. I/O System (Device Driver)
Part II. Detailed Study: LINUX Kernel Internals1. Where is everything?System call Implementation Device Driver using Module Programming 2. Linux internals
ReferencesU. Vahalia, Unix Internals, The New Frontiers, Prentice Hall, 1996.
H. M. Deitel, Operating Systems, 2nd edition, Addison-Wesley, 1990Silberschatz and Galvin, Operating System Concepts (5th edition), Addison-Wesley, 1998Mukesh Singhal and Niranjan G. Shivaratri, Advanced Concepts in Operating Systems, McGraw-Hill, 1994.
Maurice J. Bach, The Design of the UNIX Operating System, Prentice Hall, 1986. M. Beck, etc, Linux Kernel Internals, 2nd Ed, Addison-Wesley, 1997Marshall K. McKusick, K. Bostic, M. Karels and J. Quarterman, The Design and Implementation of the 4.4 BSD Operating System, Addison-Weseley Pub. Co., 1996.Benry Goodheart and James Cox, The Magic Garden Explained, Prentice Hall, 1994.
I. IntroductionWhat is UNIX Operating System?Brief HistoryKernel ArchitectureFeatures of UNIX Operating System
What is UNIX Operating System?
Whats the similarity between Onion and UNIX?kernelHardwareX windowRDBMSNetwork Admin.Packagecshviwhoa.outdutelnetgreppslsgccsortwc
What is UNIX Operating System? (Cont`)HardwareHardware Control (Interrupts handling, etc)File System ManagementBuffer CacheDevice DriversProcessManagement IPCContextMemory ManagementSystem Call Interface Libraries User Programs User Programs Trap User levelKernel level(Source : The design of the UNIX OS) HW level
What is UNIX Operating System? (Cont`)UNIX Operating System is a Resource Manager Physical Resource CPU, Memory, Disk, Network Abstract Resource process, thread, page, file, inode, message, security,
UNIX Operating System is the Computing Environmentsprovide resources service to userssystem call, API
abstraction is just a set of data structure in kernel level
Brief HistoryBefore UNIXMultics: 1965, AT&T (Bell Lab), General Electronic, MITEpoch1969, Ken Thompson, Space Travel on PDP-7Dennis Ritches5fs, ed, shell (Bourn shell )1973 The UNIX Time Sharing System in CACMBSDBilly Joy, Chuch Haley ()ex, csh, paging based virtual memory system, TCP/IP, ffs, socket1993 4.4BSD (final version, BSDI )AT&T System VVersion 1,2,,7, System III, System V, SVR4.2/ESMPregion based virtual memory, IPC, remote file sharing, STREAM,
Brief History (Cont`)Commercial UNIXXENIX (MS, SCO), SCO UNIX (SCO), AIX (IBM, Journaling FS), HP-UX (HP), ULTRIX (DEC, MP), OSF/1 (Digital), .SunOS (Sun Microsystems, VFS, NFS), Solaris, Unixware (Novell)Mach micro-kernel chorus, Exo-kernel, SPIN, L4, .http://ssrnet.snu.ac.kr/~choijm/current_os.htmlstandardSVID(System V Interface Definition), POSIX (IEEE), X/OPEN (Inc.)UI (SUN, AT&T : Solaris), OSF (OSF/1)LinuxPerformance orientedPhilosophy of COPYLEFT
Kernel ArchitectureMonolithic Kernel traditional UNIX, SVR4, Solaris, Linux, .
OS Personality HardwareSystem CallIntegrated KernelOS Functionality processprocessprocess
Kernel Architecture (Cont`)Monolithic KernelOS Personality HardwareSystem CallFile Systemprocess
read()Disk Device DriverProcess ManagementMemory ManagerBuffer Cacheprocess
fork()bread()sys_read()hd_request()do_hd_io()sys_fork()copy_mm()CPUcopy_thread()
Kernel Architecture (Cont`)Micro-Kernel Mach, Chorus, L3/L4, SPIN, QNX, Window-NT
HardwareSystem CallMicrokernelServerServerServerOS Functionality process
Kernel Architecture (Cont`)Micro-Kernel
what is the advantage of micro kernel ?HardwareSystem CallMicrokernelFile System Serverprocess
read() Process Server.sys_read()hd_request()
Windows-NT ArchitectureWindows-NT
HardwareHardware Abstraction Layer(HAL)System ServicesKernelNT ExecutiveObject ManagerSecurityRef. MonitorProcess ManagerLPCFacilityI/O ManagerFile SystemCache ManagerDevice Drivers Network DriversWin32 ServerSecurityServerOS/2ServerPOSIXServerProtected Subsystem(Servers)ApplicationsLogonProcessOS/2ClientWin32ClientPOSIXClientVMMgt.MessageTrapHW ControlUser modeKernel mode(Source : Inside Windows NT)
FeaturesWhat is Good about UNIXOpen system freeSmall is beautiful philosophyfile: just stream of bytesSimple and Coherentdata, device, pipe, socket, memory, process, can be treated as a single abstraction (file)Portabilityhigh-level languagenew paradigm: OO, client-server model, clustering, PDA, MM ServerTrue ParallelismMultitasking (Time Sharing), Multiprogramming, Multiprocessor, MPP
Features (Cont`)What is Wrong with UNIXToo many variantdumping groundNot small and simple any moreuncontrolled growthBuilding-block approachinappropriate for beginnerLack of GUInot now
Ritches words, It takes a genius to understand and appreciate the UNIXs simplicity
II. Process Management
OverviewWhat is process?process state transitioncontextschedulingkernel entry pointinterrupt, trap, system callsignal
What is Process?Definitionan instance of a running program (runnable program)an execution environment of a programscheduling entitya control flow and address spacePCB (Process Control Block) : proc. table and U areaManipulation of Processcreate, destroy contextstate transitiondispatch (context switch)sleep, wakeupswap
Process State Transition user running kernel running zombie initial (idle)fork ready to runsuspended ready asleepsuspended asleepforkreturn fromsyscall orinterruptsyscall,interruptswtchsleep, lockwakeup, unlockexitwaitswapswapswtch(Source : UNIX Internals)
Process State Transition (Cont`)Flow of execution : execution mode (cf: address space) Kernel executionprocess A execution Kernel execution Kernel execution Kernel executionprocess C executionprocess B execution process B creationInterrupt or Trap cause change of execution modes (Source : Magic Garden)
Contextcontext : system context, address (memory) context, H/W context proc tableU areasegment tablepage tablememorydiskfdfile tableRegisters (TSS)eip sp eflagseaxcs..swap
Context : system contextSystem contextproc. Tableidentification: pid, process group id, family relationstatesleep channel: sleep queuescheduling information : p_cpu, p_pri, p_nice, ..signal handling informationaddress (memory) informationU areastores hardware context when the process is not running currentlyUID, GIDarguments, return values, and error status for system callsignal catch functionfile descriptorusage statistics May it be different according to the version and variant of UNIX
Context : address contextfork example
guess what can we get from this program?
intglob = 6;charbuf[] = a write to stdout\n;
int main(void){ int var;pid_t pid;
var = 88;write(STDOUT_FILENO, buf, sizeof(buf)-1);printf(before fork\n);
if ((pid = fork()) == 0) {/* child */glob++; var++;} elsesleep(2);/* parent */
printf(pid = %d, glob = %d, var = %d\n, getpid(), glob, var);exit (0);} (Source : Adv. programming in the UNIX Env., pgm 8.1)
Context : address context (Cont`)fork internal : compile resultstest.cgccheadertextdata bssstackusers perspective (virtual address)movl %eax, [glob]addl %eax, 1movl [glob], %eax...glob, bufvar, pidtextdatastackkernel0xffffffff0xbfffffff0x0 a.out : ELF formatExecutable and Linking Format
Context : address context (Cont`)fork internal : before fork (after run a.out)
cf) we assume that there is no paging mechanism in this figure. memorytextstackdatasegment T.proc T.pid = 11glob, bufvar, pid
Context : address context (Cont`)fork internal : after fork
address space : basic protection barrier memorytextstackdatasegment T.proc T.pid = 11segment T.proc T.pid = 12stackdataglob, bufvar, pidvar, pidglob, buf
Context : address context (Cont`)fork internal : with COW (Copy on Write) mechanism
after fork with COW after glob++ operation memorytextstackdatasegment T.proc T.pid = 11segment T.proc T.pid = 12textstackdatasegment T.proc T.pid = 11segment T.proc T.pid = 12data
Context : address context (Cont`)execve internal
memorytextstackdatasegment T.proc T.pid = 11stackdatatext
Context : hardware contexttime sharing (multitasking)process 1process 2process 3time quantumWhere am I ??
Context : hardware context (Cont`)brief reminds the 80x86 architectureALUControl UnitRegistersINOUT eip, eflags eax, ebx, ecx, edx, esi, edi, cs, ds, ss, es, ... cr0, cr1, cr2, cr3, GDTR, TR, ...
Context : hardware context (Cont`)context swtchU areaProc T.CPUU areaProc T.restore contextsavecontext
Context : hardware context (Cont`)context swtch : pseudo-code in UNIX
trick : register (eg, eax in 80*86 CPU)
Think about the difference between context switch and system call./* need context swtch */if (save_context()){/* pick another process to run from ready queue */.restore_context(new process)/* The control does not arrive here, NEVER !!! */}/* resuming process executes from here !!! */... (Source : The Design of the UNIX OS)
Process SchedulingProcess scheduling allocate CPU resource among the competing processescriteria : fairness, efficiency (response time vs. throughput)
types of processesInteractiveBatch (Computation-Intensive)Real-timevideo,hospital
types of schedulingPreemptive schedulingother processes can take CPU away from the current running processNon preemptive scheduling(Windows98)other processes can not take CPU away from the current running process
(utilization)(throughput)/ (turnaround) ->(waiting) (response)
Process Scheduling (Cont`)Existing Policies
FCFS (First Come First Served)RR (Round-Robin)time quantum(10-100milisec)SJF (Shortest Job First)Multilevel Feedback Queue
EDF (Earliest Deadline First)RM (Rate Monotonic)
Fair QueuingGang SchedulingCausality SchedulingProcess migration
Process Scheduling (Cont`)UNIX : Round Robin with multilevel Feedback Queue
Round-Robin
Process Scheduling (Cont`)Multilevel Feedback QueueReady Queue nCPUReady Queue 1Ready Queue 2CPUCPU.higher priorityless time quantum
Process Scheduling (Cont`)Round-Robin : real implementationscheduling information in proc. table : p_pri, p_cpu, p_niceevery clock tick : increments p_cpu for current running processevery second : p_cpu = p_cpu * decay factor (generally 1/2) p_pri = PUSER + p_cpu/2 + p_niceExample of System III3 process, PUSER=50, p_nice = 0, clock ticks 60 at every second
second
Process Scheduling (Cont`)Example of BSDdecay factor : (2*load_average) / (2*load_average + 1)p_pri = PUSER + (p_cpu/4) + (2*p_nice)clock tick is 10msectime quantum is 10 clock ticksExample of Machdecay factor : 5/8p_usrpri = PUSER + (3.8*(max(1,M/P) ) * p_cpu )/T + 0.5 * p_niceExample of SVR4support REAL-TIME class processclass independent scheduler / class dependent schedulerExample of LINUXsupport REAL-TIME processselect a process that has the highest value of priority + countercounter of the current process decreases at each clock tick.
Process Scheduling (Cont`)Range of Process PrioritiesSwapperWaiting for Disk I/OWaiting for BufferWaiting for InodeWaiting for TTY IOWaiting for Child ExitUser Level 0 (50)User Level 1User Level nKernel Mode PriorityUser Mode Priority(Source : The Design of the UNIX OS)
Kernel Entry PointInterruptTrapsystem call
kernelPM FSMMDD HWM process device
Interrupt HandlingInterrupta mechanism that peripheral devices inform an asynchronous event to UNIX Operating System
whats the difference between polling and interrupt? PICReal time ClockCPUdiskttynetworkKernelIVTclock()nmi()tty_intr()disk_intr()net_intr().01234cdrom clock() disk_intr()interrupt handlers
Interrupt Handling (Cont`)interrupt handling mechanismsimilar to the step of receiving a letter while telephoning
stepif user mode, change kernel modesave context of current process (make new context layer)determine interrupt sourcefind interrupt vector and call interrupt handler . interrupt handling..restore saved context
what if another interrupt is triggered while handling a interrupt?
Interrupt Handling (Cont`)clock interrupt handler ( timer_interrupt() in Linux )clock(){restart clock /* will interrupt again */if (callout table not empty) (eg) timer_list in LINUX)adjust time and schedule callout function if necessaryif (profiling on)count program counter at time of interruptgather statistics per process and systemupdate CPU usage for the current running processif (one second elapsed) {alarm handlingcalculate the p_pri for all processreschedule if necessarywake up swapper or page daemon if necessary}} (Source : The Design of the UNIX OS)
Trap Handlingtrap : an asynchronous software event
IVT2021222324div_by_zero()invalid_opcode()overflow()segment_fault ()page_fault ().01234system_call().80clock()nmi()tty_intr()disk_intr()net_intr().
System Call Handlingsystem call : an example of trap
IVTdiv_by_zero()invalid_opcode()overflow()segment_fault ()page_fault ().01234system_call().80sys_call_table (sysent[])sys_no_syscall()sys_exit()sys_fork()sys_read ()sys_write ().01234sys_getpid().255 47sys_no_syscall()Kernel sys_fork() sys_read()trap system_call()
System Call Handling (Cont`)invoke system call
IVTdiv_by_zero()in_opcode()overflow()seg_fault ()page_fault ().01234system_call().80sys_call_table (sysent[])sys_no_sys()sys_exit()sys_fork()sys_read ()sys_write ().01234sys_getpid().255 47sys_no_sys()Kernelprocessmain(){ . fork()}libc.a.fork(){ . movl $2, eax trap $80 .}.read(){} sys_fork() sys_read()
System Call Handling (Cont`)how to make a new system callcoding new system call function in kernel spaceallocate syscall_number (and an empty slot in sys_call_table[]) and registeringkernel rebuild
reconfigure libraryar, ranlib
coding your program with new system call
Signala mechanism to inform an asynchronous event to processtypes of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, .action : abort, exit, ignore, stop, user level catch function
whats the difference among interrupt, trap, and signal?
void sig_handler(signo)int signo;{signal (SIGUSR1, sig_handler);/* reinstall */printf(received signal %d\n, signo);/* handle the signal */..}
main (){signal (SIGUSR1, sig_handler);/* install the handler */.for ( ; ; )pause();}
Signal (Cont`)register signal handler (signal catch function )send signal signal detection : state transition from kernel running to user runningcall signal handler
variables for signal in task structure in LINUXint sigpending : is signal received or not?struct signal_struct *sigsigset_t signal, blocked
struct signal_struct /* sched.h */countaction[_NSIG]siglockstruct sigaction /* asm-i386/signal.h */sa_handlersa_flagssa_restorersa_masktypedef struct { unsigned long sig[_NSIG_WORDS];} sigset_t; /* asm-i386/signal.h */
III. Memory Management
Memory Hierarchyhierarchy
caching is more and more important (how to keep consistency?)registerCPU cacheMain MemorySecondary StorageServer (or INTERNET) larger capacity lower speed lower cost
Memory Management StrategyThree strategiesFetch strategy: when a process (page) is brought into memory?demand fetchprefetch (agent in Web)Placement strategy: where a process (page) is put on memory? first fit, best fit, worst fitreplacement strategy: which process (page) is evicted from memory?LRU, LFU, MRU,
History of Memory Management Systemsingle user system (stone age of memory management)overlay fixed partition multiprogramming systemabsolute assembler, relocating assemblervariable partition multiprogramming systemcoalescing , compactionvirtual memory systempagingsegmentation (segment, region, vm_object)paging/segmentation
(Overlay) ) 2-pass (20K)(30K)(10K)pass 1 (70K)pass2 (80K)
History (Cont`)variable partition multiprogramming system
Scenario fork P1 (40K) fork P2 (20K) fork P3 (10K) fork P4 (20K) fork P5 (40K) fork P6 (20K) fork P7 (70K) exit P1 exit P3 exit P4 exit P6kernel 0100P1140P2160P3170P4190P5230P6250P7320400free memory map10014040160190302302502032040080memory and kernel internals
Memory Management Strategy : PlacementScenario fork P1 (40K) fork P2 (20K) fork P3 (10K) fork P4 (20K) fork P5 (40K) fork P6 (20K) fork P7 (70K) exit P1 exit P3 exit P4 exit P6 fork P8 (25K)free memory map10014040160190302302502032040080memory and kernel internalsWhere to go??
Memory Management Strategy : Placementissue : fragmentationemployed at swap management, KMA (kernel memory allocator)
Scenario fork P8 (25K)free memory map10014040160190302302502032040080memory kernel internalsfirst fitbest fitworst fit
Virtual Memoryvirtual memory : separate virtual address and physical address
virtual address0xfffffffftextdatabssstackkernel0x0kernel textkernel datakernel stackkernel bsspage
Virtual Memory (Cont`)virtual address : Linux case0xfffffffftextdatabssstackkernel0x00xc0000000start_code end_code end_databrkprogramtextdatabssstart_code end_code end_dataend_bssshared C librarytextdatabssother shared librarystart_stack arg_start arg_end env_end(Source : Linux Internals)shared memory
Virtual Memory (Cont`)physical memoryconsists of kernel and a set of processesphysical memorykernel0x00x4ffffff P4 P1 P2 P3
Virtual Memory (Cont`)physical memorya collection of page frame (4K or 8K) page frame 5 page frame 4 page frame 3 page frame 2 page frame 1 page frame n-1 page frame n.physical memory P2 P1 P3
Virtual Memory (Cont`)address translation
segment number spage number p offset
dvirtual address v = (s, p, d)segment table origin register b + s' p' +page frame number p' offset
dphysical addresssegment table page table
Virtual Memory (Cont`)address translation : table structure
cf) disk block descriptor per each page table entry
segment table page table V page frame number (p) D R U W COW V segment start address (s) L R W E A swap (fs) number block number type (fill 0, demand fill)
Virtual Memory (Cont`)execve (final)
headertextdatastacka.out memorysegment T.proc T.48 K 0 K12 K32 K28 K24 K20 K16 K12 K 8 K 4 K 0 K T1 D1 T2 S1 n Kn-1 K4 K28 K20 K12 Kpage T. 1 1 0 1 0 0 0 0 0 0 1 0
Virtual Memory (Cont`)SVR 4.0 virtual memory structure struct procp_asstruct asseg_listhint
struct hatprivate datastruct segvn_dataanon_map vnodeas_ptrprivates_opsbasesizeas_ptrprivates_opsbasesizeas_ptrprivates_opsbasesizeas_ptrprivates_opsbasesizeresident pages of fileanonymous pages of segmentvirtual address space text data stack u areastruct seg
Virtual Memory (Cont`)BSD (Mach) virtual memory structure struct taskvm_mapstruct vm_mapfirst hint last struct pmapresident page liststruct vm_map_entrystruct vm_objectstruct vm_page
Virtual Memory (Cont`)Linux virtual memory structureDataCodevm_endvm_startvm_flagvm_inodevm_endvm_endvm_startvm_flagvm_inodevm_endcountpgdmmapmmvm_area_structvm_area_structmm_structtask_struct
Virtual Memory (Cont`) advantage of virtual memorylarge address spaceno need of placement strategyflexible memory object sharing among the processes
no free lunch : disadvantage of virtual memoryaddress translation
memorysegment T. P1 4 K28 K20 Kpage T. 1 1 0 1 0 P2segment T.8 K28 K40 Kpage T. 1 1 0 1
Virtual Memory (Cont`)address translation with TLB (Translation Lookahead Buffer)
segment number spage number p offset
dvirtual address v = (s, p, d)segment table origin register b + s' p' +page frame number p' offset
dphysical addresssegment table page table p'TLB (associative memory) s p
Virtual Memory (Cont`)HAT (Hardware Address Translation)isolate all hardware dependent codeHAT in SVR4, pmap in BSD, pgd in Linux, ...responsible all address translation transparently
case study : 80*86 CPU
virtual addresssegment descriptoroffset16bit32bitsegment descriptor table (GDT, LDT) 32bitlinear addresssegment translationcf) 80*86 reminds GDT - available for all tasks - segment for OS code data - descriptor for LDT, TSSLDT - for a specific taskIDT - interrupt service routine
Virtual Memory (Cont`)HAT (Hardware Address Translation):Pagingcase study : 80*86 CPU
page table entry DIR PAGE offsetpage directory linear address012 1122 21 31CR3page table PFN offsetphysical address0 11 31 PFN PFN 31 11 0 31 11 0 PFN D R U W P 31 11 6 5 2 1 0control register:Page Directory Base RegisterD: DirtyR: referencedU:User/SupervisorW:Read/WriteP:Present(valid)
Replacement Strategy
Which page can be evicted from memory ?
goal : reduce the number of page fault and thrashing memoryp7 p3 p1 p4 p2p8 diskreplacement policypage fault for p8
Replacement Strategy (Cont`)basic principle of replacement : localitytemporal locality : stack, tree traverse, counting variablespatial locality : array, sequential code, file reference
replacement policyFIFO (First In First Out)LRU (Least Recently Used)LFU (Least Frequently Used)NUR (Not Used Recently)MRU (Most Recently Used)Working SetSecond Chance(FIFO+reference bit)
Replacement Strategy (Cont`)example : FIFO, LRU, LFU
guess which page will be evicted from memory under the LRU policy?which policy is the best policy?
memoryp7 p3 p1 p4 p2p8 disk scenario : page reference orderp1, p2, p3, p1, p4, p2, p1, p3, p4, p7, p8 system internals
Replacement Strategy (Cont`)Project I : program a simulator for FIFO, LRU, and LFU policy and compare their performance.
assume - memory consists of 20 page frames - a range of page number is 0 ~ 49 - number of references is 300program the 3 policies - use linked list for FIFO and LRU - use priority tree for LFU if possible - use hash to fast find a pagecompare the performance and discuss it
Replacement Strategy (Cont`)Example of real implementation in UNIX : buffer cachehash queue header(page_no % 5 ) = 0(page_no % 5 ) = 1(page_no % 5 ) = 2(page_no % 5 ) = 3(page_no % 5 ) = 42110 233242645302819 343(Source : The Design of the UNIX OS)lru list headerheadtail
Replacement Strategy (Cont`)example : NURused by pagedaemon (two-handed clock algorithm)
V page frame number (p) D R U W COW 0 0 0 1 possible combination 1 1 1 0 replace page having (0,0) combination first
Swapper vs. PageDaemonswapping and pagingreplace some object from memory when memory is almost full.
swappingobject : processswap in/ swap outswap space management similar to variable partition multiprogramming
pagingobject : pagepage fault handling
IV. File System
Overview of File System Virtual File Systemffsnfsext2fsntfs.mmfsprocfsbuffer cache File System device driverUser modeSystem modeprocess 1process 2process n.
User Interface System callopenread/writecloseduplinkpipe, mkfifomkdir, readdirmknodstatmountsync, fsck
User Interface (Cont`)file descriptor, file table, inode (vnode)
proc tableU areasegment tablefdfile tableTSSvnodeinode
User Interface (Cont`)fork vs open
fork open same file
how about dup?proc tablefdfile tablevnodeproc tablefdparentchildproc tablefdfile tablevnodeparentfile table
Disk systemphysical viewplotter, arm, headcylinder, track, sectorseek time, rotational latency, transmission timelogical view (a viewpoint of UNIX)disk is a collection of disk blocksthe disk block size is usually equal to the page frame size
01234567891011121314.
Structure of Filedisk block allocationwant to create a file with size of 14 Kassume - disk block size is 4 K.
sequential allocationnon sequential allocationblock chain, indexed block, FAT
01234567891011121314..1516
Structure of File (Cont`)non sequential allocationblock chain
new file name
Structure of File (Cont`)non sequential allocationindex block
what if the index block is full ?new file name index block...
Structure of File (Cont`)non sequential allocationFAT (File Allocation Table)
what is the adv. and disadv. among block chain, index block, and FAT ?new file name FAT 11 12 NIL 5 4 NIL 34 21 9 6 7 NIL UNUN
Structure of File (Cont`)sequential allocation
what is the adv. and disadv. between sequential and non sequential allocation ?
new file name start size
Structure of File (Cont`)inode in Unix File System
i_inode_numberi_modei_nlinki_uid, gidi_rdevi_atime, ctime, mtimedirectindirect.inodetype (4bit) u g s r w x r w x r w xS_IFSOCKS_IFLNKS_IFREGS_IFBLKS_IFDIRS_IFCHRS_IFIFO
Structure of File (Cont`)inode in Unix File System: find blockassume the size of disk block is 4Kwhich block is related if f_offset is 10000 ? (or 47000 )f_offsetindirect.inode file tabledirect74 1218 24 3341 165169
Structure of Directoryconnect file name to disk block(s)
provide hierarchical structure for file systeminode number file namedirectory entry in UNIX FSfile name extension attributes time first block numberdirectory entry in DOSdisk block 15 etc 4 dev 3 usr 1 . 1 ..9 mnt 7 var 6 vmunix i_modetime.1inode 1 disk block 717 lib 16 include 12 src 3 . 1 ..23 member20 bin i_modetime.7inode 325 local disk block 3937 mark33 tom 32 jim 23 . 3 ..42 mjc41 soonii_modetime.39inode 23
Structure of Directory (Cont`)hierarchical view
/usrdevetcvarmntvmunixsrcincludelibbinmemberlocaljimtommarksoonimjc
Structure of Directory (Cont`)open exampleopen(/usr/member/sooni/test.c, O_RD)find inode using directory structure (namei())allocate fd, file table and initialize
proc tablefdfile tableinodef_offset
Structure of File Systemfile system: boot, super, inode, data blocksystem/dev/hda/dev/hdb/dev/hda1/dev/hda2/dev/hda3boot superi-node
disk blocks
Structure of File System (Cont`)super block : manage information for file system (cf: inode for file)
iget, iputballoc, bfree
s_type s_flags_devs_blocksizes_magics_name.s_free_inode []s_free_disk block []struct superblock free inode list (map) ...
free disk block list (map) ...
Structure of File System (Cont`)super block
s_type s_flags_devs_blocksizes_magics_name.s_free_inode []s_free_disk block []struct superblock
29 27 26 24 21 20 1961 57 56 54 51 50 48 46 45 43 42 41 39 38 37 34disk block 29 disk block 61
Structure of File System (Cont`)mount mount /dev/hda3 /mnt
open(/mnt/test.c, O_RD)
inode for /mntvfsmntlist s_dev s_blocksize mounted point root inode ...inode for root on FS of /dev/hda3super block for /dev/hda3vfsmountvfsmount mmt_sb
Inode for special fileinode structure for special filepipe no indirect block (unnamed pipe)readers, writers, read pointer, write pointer
special device fileno direct, indirect blockdevice number : major number + minor numbermajor number : corresponding device type used as index for device switch tableminor number : corresponding device unit pass as argument to device driver
Existing File System S5FSfirst and conventional UNIX file system FFS support 255 characters file namecylinder groupsfragmentsLFSsmall write optimizesuitable for RAID storage systemVxFS (Journaling File System)fast recovery using internal logging i_no size file_name directory entry for ffsboot blocksuper blockcylinder group 1(inode, disk blocks)
cylinder group 2
...fast file system structure
Existing File System ext2 File SystemLinux default file systemsimilar to Berkeleys FFSinode : 12 direct blockused bitmap for free block and inode managementfault-tolerant featuresboot block
Block group 0
Block group 1
Block group nExt2 file system structure super block
Group descriptor
Block bitmap
Inode bitmap
Inode table
Data Blocks
Existing File System NFSstateless protocolXDR (Extended Data Representation)AFS, Coda File Systemdisconnected operationSprite File Systemstrong consistencyVFSto support various file systemmfsprocfs
nfs client nfs serverapplicationsystem call VFSNFSRPC stubXDRRPC stub VFS nfsdNFSUFS
swap space managementswap space management
P1P2P3P4P5P6400swap spaceWhere to go?? 064Mtextdatastack P1textdatastack P2
swap space managementswap used map
why does UNIX manage swap space differently to the FS ? P1P2P3P4P5P6400swap used map3638124166448swap spaceWhere to go?? 064MScenario swap out P1 (3M) swap out P2 (3M) swap out P3 (2M) swap out P4 (1M) swap out P5 (3M) swap out P6 (4M) swap in P2 swap in P4 swap in P5
V. Inter-Process Communication
Inter-Process Communication (IPC)synchronization pipescommunication via filessignalSystem V IPCmessage queueshared memorysemaphoreIPC with sockets
synchronizationparallelismmultiprocessor (true parallelism) or time sharing (quasi-parallelism)race condition : more than one process want to access a same resourceshared resource
mutual exclusiononly one process can exclusively access a shared resource at a timecritical section : a portion of a program that accesses a shared resourcerepresentative mechanism: ipl, lock, semaphore, test&set
deadlock
synchronization (Cont)example of race condition I
guess what the results are?int main(void){pid_t pid;
if ((pid = fork()) == 0) {/* child */charatatime(output from child\n);} else {charatatime(output from parent\n); }exit (0);}
void charatatime(char *str){char *ptr; int c;
setbuf(stdout, NULL);for (ptr = str; c=*ptr++; )putc(c, stdout);} (Source : Adv. programming in the UNIX Env. pgm 8.7) outpuot utfprut froom chmild parent
synchronization (Cont`)system internalstask structurefdfile structuref_posfdinodeshared resource
synchronization (Cont`)example of race condition IIscenarioprocess P1 is currently dispatching (removing from ready queue)disk interrupt occursdisk interrupt handler wake up process P2 and want to insert it into ready queue
synchronization (Cont`)ipl (interrupt priority level)
synchronization (Cont`)lockassociate lock variable to each shared resourcelock before (unlock after) the critical section
spin_lock primitive
void spin_lock(spinlock_t *s) {while (test_and_set (s) != 0);}
void spin_unlock (spinlock_t *s) {*s = 0;} (Source : UNIX internals)
synchronization (Cont`)sleep_lock
spin lock or sleep lock, lock granularity, rw_lock (try_lock)process wants resourcesleep on resourceawakened by any process wake up all waiting processeslock the resourceuse resourceunlock resourcedoes anyone want it?continue other processingYesYesNoNo
synchronization (Cont`)semaphorean object that can be accessed P and V (and sem_initialize) method.
semaphore primitivevoid initsem (semaphore_t *sem, int val) {*sem = val;}
void P (semaphore_t *sem) {*sem -= 1;while (*sem < 0)sleep;}
void V (semaphore_t *sem) {*sem += 1;if (processes slept on sem queue)wake up the processes slept on sem;} (Source : UNIX internals)
synchronization (Cont`)semaphore : example
clientservershared memoryput the item into shared memoryproduce an itemremove an item from shared memoryconsume the item
synchronization (Cont`)semaphore : example
clientservershared memoryput the item into shared memoryproduce an itemremove an item from shared memoryconsume the itemsem1, sem2initsem(sem1, 5) initsem(sem2, 0)P(sem1)V(sem2)P(sem2)V(sem1)
synchronization (Cont`)semaphore in the linux kernelwidely used for wait until condition meet (eg read disk blocks)semaphore /* include/asm-i386/semaphore, kernel/sched.c */declare semaphore for each shared resourcevoid down (struct semaphore *sem) { while (sem->count wait); sem->count--;}void up (struct semaphore *sem) { sem->count++; wake_up (&sem->wait);}struct semaphore { atomic_t count; struct wait_queue *wait;}process 2process 1shared resourcestruct semaphore *xdown(x)critical sectionup(x)down(x)critical sectionup(x)
synchronization (Cont`)semaphore in the linux kernelsleep, wakeup /* include/linux/wait.h kernel/sched.c */
interruptible_sleep_on(), wake_up_interruptible()void sleep_on (struct wait_queue *queue) { void wake_up (struct wait_queue *queue) { struct wait_queue entry = {current, NULL}; struct wait_queue *p = *queue; current->state = TASK_UNINTERRUPTABLE; do { add_wait_queue (queue, &entry); p->task->state = TASK_RUNNING; schedule(); add_runqueue(p); p->p->next; remove_wait_queue(queue, &entry); } while (p != *queue);} }struct wait_queue { struct task_struct *task; struct wait_queue *next;}
synchronization (Cont`)Deadlocksystem state that processes wait events that never occur.
process 1resource 1process 2resource 2process 3process 4resource 3resource 4
synchronization (Cont`)Deadlockdeadlock preventiondeadlock avoidancedeadlock detection and correction
reduction of resource allocation graph
P2 R1 P1 P3 R2 P2 R1 P1 P3 R2 P2 R1 P1 P3 R2 P2 R1 P1 P3 R2
pipenamed pipe, unnamed pipepipe(fd[]), mkfifo(path, mode), mknod(path, mode, dev_t)
no indirect blocks in inoderd_pointer, wr_pointer, number of readers, number of writersprocess 1pipeS_IFREGS_IFCHRS_IFBLKS_FIFO kernelwrite fdprocess 2read fdwrite fd
pipepipe(unnamed pipe)limitcannot broadcastno object boundariescannot direct data to a specific readerFIFO(named pipe)FIFO filemust be explicitly deleted(unlink)namedless secure than pipe
pipe (Cont`)example of pipe : % ls -l | morefor (;;) { read_command();parsing_command(); pipe(fd[]); if (fork()) { close(stdin); dup(fd[0]); if (fork()) { close(stdout) dup(fd[1]); exec(ls, ); } exec(more, ); }wait();}
Communication via filesthe oldest way of data exchanging among processes
race condition may be occurredreading a data before the other has completed modifying itmandatory or advisory lockinglockf, flock, fcntlfcntl(fd, cmd, arg)
PfilePF_GETLK, F_SETLK, ... l_type l_whence l_start l_len l_pid flock structureF_RDLCK, F_WRLCK,F_UNLCK, F_SHLCK, F_EXLCK
Communication via files (Cont`)A deadlock scenario with file locking
In Linux, fcntl() returns the error EDEADLOCK
PfileP
Signal register signal handler (signal catch function )send signal signal detection : state transition from kernel running to user runningcall signal handler
variables for signal in task structureint sigpending : is signal received or not?struct signal_struct *sigsigset_t signal, blocked
struct signal_struct /* sched.h */countaction[_NSIG]siglockstruct sigaction /* asm-i386/signal.h */sa_handlersa_flagssa_restorersa_masktypedef struct { unsigned long sig[_NSIG_WORDS];} sigset_t; /* asm-i386/signal.h */
System V IPCMessage, Shared Memory, and Semaphore
Common propertiesKey => id (cf: file name => fd)In kernel, ***id_ds for System V IPC (eg: msqid_ds)ipc_perm: key, uid, cuid, access mode, ipcs, ipcrm
Differencemessage : suitable for Object-Orient Conceptshared memory : fastsemaphore : for user level synchronization
System V IPC (Cont`)message queuemsqid = sys_msgget (key, flag) /* create */sys_msgsnd (msqid, msgp, msgsz, flag) /* send */sys_msgrcv (msqid, msgp, msgsz, msgtype, flag) /* receive */sys_msgctl(msqid, cmd, msqid_ds)/* control */PPPPPsendersreceiversmsgmsgmsgstruct msqid_ds
System V IPC (Cont`)struct msqid_ds msg_permmsg_firstmsg_lastmsg_stimemsg_rtimemsg_ctimewwait_queuerwait_queuemsg_cbytesmsg_qnummsg_qbytesmsg_lspidmsg_lrpidmsg_nextmsg_typemsg_spotmsg_tsmsgtype in sys_msgrcv() =0 : receive the first msg in the queue >0 : receive the given type msg in the queue 0) V() operationelseP() operation struct
socketsocket common interface for IPC and networkingProtocol family: UNIX, INET, AX25, IPX, Appletalklayer structure of a network
IP TCP UDP INET
BSD socket
PLIP SLIP ETHERNET ARPparallelportserialportEthernetcard
socket (Cont`) information for communication5-tuple {protocol, local-addr, local-process, foreign-addr, foreign-process
C library routinessocket(): protocol, make socket structurebind(): assign local-addr and local-processconnect() : foreign-addr, foreign-process
listen() : waiting in server accept(): make connection to a client
read(), write()send(), sendto(), recv(), recvfrom()
cf) system call: sys_socketcall/* net/socket.c */
socket (Cont`)socket structure
struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, .}/* include/linux/net.h */struct sock { ...} /* include/net/sock.h */struct proto_ops { family dup, release, bind, connect, accept, listen, ... getsockops setsockops sendmsg recvmsg}/* for INET operation *//* include/linux/net.h */file.f_dentry.f_posf_opsock_lseeksock_readsock_writeNULL sock_pollsock_ioctlNULL sock_no_open./* net/socket.c */
socket (Cont`)connection oriented protocol
socket()bind()listen()accept()read()write()socket()connect()write()read()serverclientblocks until connection from a clientprocessing requestdata (request)data (reply)connect established
socket (Cont`)connectionless protocol
socket()bind()recvfrom()sendto()socket()bind()sendto()recvfrom()serverclientblocks until data received from a clientprocessing requestdata (request)data (reply)
TLIconnection oriented protocol
t_open()t_bind()t_listen()t_accept()t_rcv()t_snd()t_open()t_bind()t_connect() t_snd()serverclientwait for connectionprocessing requestdata (request)data (reply)connection request t_rcv()
VI. I/O System (Device Driver)
Role of a device driverhandle data movement between memory and peripheral devicesusually written by a third-party
PPPP system call interface kernel file system device driver interface (through devsw table) tty driver disk driver network driver
Peripheral Device: General StructureH/W configurationextremely hardware dependent
controllerCSR (Control and Status Register) - driver writes to the CSRs to issue commands to the device and reads CSRs to obtain completion status or error condition - memory mapped I/O, special in/out instruction (eg) 80*86s in/out command)- programmed I/O (tty, modem, printer), DMA (disk)internal bufferdevice itself
Disk DriverDisk I/O handlingconvert logical disk block number into physical sector(s)handle read/write requests, handle interruptdisk schedulingFCFSSSTF (Shortest Seek Time First)SCANC-SCANDMA (channel)RAID
..
Terminal Driverinteractive : line disciplinecanonical mode, raw mode (stty)xbufrbuf cblockraw queue (clists)processcanon queueout queueCSRtty driverin/outinterrupttty_readtty_write
General structure of Device Driverwell defined entry pointtop half, bottom half
character device driver block device driver
whats the difference between character and block device driver?openclosereadwriteioctlmmap in/outintropenclosestrategysizein/outintr
Device Switch Tabledevsw: table for registering the entry points of device drivers
struct cdevsw { struct bdevsw { int (*d_open) (); int (*d_open) (); int (*d_close) (); int (*d_close) (); int (*d_read) (); int (*d_strategy) (); int (*d_write) (); int (*d_size) (); int (*d_ioctl) (); int (*d_xhalt) (); int (*d_mmap) (); . int (*d_segmap) (); } bdevsw[] int (*d_xpoll) (); int (*d_xhalt) (); struct streamtab *d_str; struct ttytab *d_tty; .} cdevsw[];
(Source : UNIX Internals)
Device Switch Table (Cont`)Example of switch table
why do we access disks through character interface? hd_open hd_close hd_strategy ht_open ht_close ht_strategy cd_open cd_close cd_strategybdevswcon_open con_close con_read con_write con_ioctlcdevswtty_open tty_close tty_read tty_write tty_ioctled_open ed_close ed_read ed_write ed_ioctlnulldev nulldev mm_read mm_write nulldevhd_open hd_close hd_read hd_write nulldev#ls -l /dev/brw-r--r-- 0 1 hda1brw-r--r-- 0 2 hda2.brw-r--r-- 0 11 hdb1brw-r--r-- 1 0 tape.crw-r--r-- 1 0 tty0crw-r--r-- 1 1 tty1.crw-r--r-- 5 0 rhda1dev file
Device Switch Table (Cont`)example : openopen(/dev/tty0, O_RD)
(*cdevsw[getmajor(dev)].d_open) (dev, )
proc tablefdfile tableinodei_dev : c, 1,0con_open con_close con_read con_write con_ioctlcdevswtty_open tty_close tty_read tty_write tty_ioctled_open ed_close ed_read ed_write ed_ioctlnulldev nulldev mm_read mm_write nulldevgd_open gd_close gd_read gd_write nulldev
Device Switch Table (Cont`)install new device drivermake new device driver and linking kernelmy_open(), my_read(), my_write(), my_close(), .register devsw tablemake special file# mknod /dev/mydrv [b|c] major_number minor_number
Device Switch Table (Cont`)control flow
where the requesting process is slept?user mode kernelread() driver queue device interrupt handlerdevsw table IVT sleep wakeup
STREAMfull-duplex data transfer and processing pathconsists of a pair of queues
user application STREAM head W R W R W R W R STREAM driveruser kernelhardware STREAM module
STREAM (Cont`)userSTREAM head TCP IP token ringuserSTREAM head UDP IP ethernetReusable Module userSTREAM headuserSTREAM headuserSTREAM head TCP UDP IP ATM DQDBMultiplexing
STREAM (Cont`)STREAM featurestransparency among the queuesreusablemultiplexingmessage based communicationvirtual copying STREAM scheduler : priority bands
Part II. Detailed Study: Linux Kernel Internals
Contents why Linux?where is everything (kernel source code) ?kernel configure and compilesystem call implementationmodule programming some important kernel date structures
ReferencesM. Beck, H. Bohme, M Dziadzka, U Kunitz, R. Magnus, D. Verworner, Linux Kernel Internals, 2nd Ed, Addison-Wesley, 1997Fred Butzen, Christopher Hilton, The LINUX Network, The M&T Books Slackware Series, 1998 Remy Card, etc, the LINIX KERNEL Book, John Wiley & Son, 1998A. Bubini, LINUX Device Driver, OREILLY, 1998Anonymous, Maximum Linux Security (A Hackers Guide To Protecting Your Linux Server and WS), SAMS Publishing, 1999
http://www.linux.org/http://www.kernel.org/http://kldp.org//usr/src/linux
Why Linux?freely availableLinus Torvalds, Copyleft1991 version 0.01 (November 1999, version 2.2.13)Redhat, Debian, Slackware, Alzzasupported many companiesMain characteristicsmulti-taskingmulti-user accessmulti-processorsupport various architecture (80*86, sparc, mips, alpha, smp, ..)demand load executablespagingdynamic cache for hard disk
Why Linux? (Cont`)main characteristics (cont`)shared librarysupport for POSIX 1003.1various formats for executable filestrue 386 protected modeemulating maths co-processorsupport for national keyboards and fontssupport diverse file system (ext2, ..)TCP/IP, SLIP, PPPBSD socketsSystem V IPCVirtual Console
Why Linux? (Cont`)drawbacksmonolithic kernel (currently micro kernerlize in many research)not for beginners (for system programmers)not well structured (performance-oriented)
Key attractionexperimenting with the system (handle the kernel by yourself)supported many companies free: solution business & add on featuresthanks to the INTERNET & GNU (special thanks to Anti-MS feeling)
Where is everything?Linux Operating System Structure(Source : the LINUX KERNEL book)System Calls Interfaceapplication Central kerneltask managementschedulersignalsmemory managementloadable modulesMachine InterfaceMachineNetwork Manager ipv4 ethernet . File Systemext2fs xiafs procminix nfs msdosiso9660 Buffer CachePeripheral Manager block character hd cdrom isdnnetwork scsi pci user levelkernel levelH/W level
Where is everything? (Cont`)source structurebased on version 2.2.5under development : the contents described below may be changed/usr/src/linuxDocarchincludeinit fskernel ipc libmm netscripts driveralpha armm68k mips ppcsparc i386bootkernellibmath-emummcodaext2hpfsmsdosnfsntfs...ufsasm-alphaasm-arm asm-i386...linuxnetscsivideo802appletalkdecnetethernetipv6unixsunrpcx25...blockcdromcharnetpcipnpsbusscsi...soundvideo
Where is everything? (Cont`)main subdirectoryarch/architecture dependent codes : arch/i386, arch/alpha, .arch/i386/boot/bootstrappingconfigure devices, memoryarch/i386/kernel/kernel entry point handling (trap/interrupt handling)context switcharch/i386/mm/machine dependent memory management codeinit/all the functions needed to start the kernelhand-made process 0 (init_task or task[0])fork process 1, 2, 3, ...
Where is everything? (Cont`)main subdirectorykernel/ (arch/i386/kernel)central section of the kernelmain system call implementation (fork, exit, etc.)time managementschedulersignal handlingmm/virtual memory interfacepaging, kernel memory managementfs/virtual file system interfaceimplementations of the various file systems (ext2, nfs,...)
Where is everything? (Cont`)main subdirectorydrivers/drivers for hardware componentsdrivers/block/ : block-oriented driver(hard disks)drivers/cdrom/ : proprietary CD-ROM drivesdrivers/char/ : character-oriented driver (serial ports, tty, modem, ..)drivers/net : network cardsdrivers/pci/ : PCI bus access and controldrivers/scsi/ : SCSI interfacedrivers/sound/ : sound card driversipc/classical inter-process communication semaphores, shared memory, message queues
Where is everything? (Cont`)main subdirectorynet/various network protocol implementations : TCP/IP, ARP, ...code for sockets to the UNIX and Internet domainslib/some standard kernel library functions (printk)modules/kernel module filesmodules can be added to the kernel later (insmod, rmmod)include/commonly included kernel-specific header filesinclude/asm-i386/ : architecture-dependent header files for Intel CPUinclude/linux/ : Linux kernel internal structure (task, inode)
Kernel Configuration and Compilenew kernel is generated in three steps1. configure (Documentation/Configuration.help, see chapter 3 of The LINUX Network)make config (menuconfig, xconfig)make oldconfig2. dependmake dep (make clean:optional)3. compilemake zImage
cf) - make zdisk (#dd bs=8192 if=$(BOOTIMAZGE) of=/dev/fd0) - make zlilo (#cat $(BOOTIMAGE) > $(INSTALL_PATH)/vmlinuz) /etc/lilo.conf - #mkbootdisk --device /dev/fd0 zImage
Add New System CallSystem Call : Control flow in Linux
idt_table /* arch/i386/kernel/traps.c*/Kerneluser process
do system calllibc.a
push args save system call number make trap
system call handlerreal system call functionsys_call_table /* arch/i386/kernel/entry.S */ system_call () /*arch/i386/kernel/entry.S */
catch trap through IDT call real handler function using sys_call_table
Add New System Call (Cont`)IDT (Interrupt Descriptor Table)define : include/asm_i386/desc.h, arch/i386/kernel/traps.c, irq.hconstructed while kernel initialization /*arch/i386/kernel/traps.c, irq.c*/system_call().idt_tabledivide_error()debug()nmi().segment_not_present().page_fault ().0x0timer_interrupt()hd_interrupt(). 0x20 FIRST_EXTERNAL_VECTOR SYSCALL_VECTOR 0x80 0xff common trap handler for 80*86 device interrupt handler (IRQ)
Add New System Call (Cont`)sys_call_tablesyscall number : include/asm_i386/unistd.h#define __NR_exit 1#define __NR_fork 2#define __NR_read 3.#define __NR_vfork 190sys_call_table : arch/i386/kernel/entry.SENTRY(sys_call_table).long SYMBOL_NAME(sys_ni_syscall)/* 0 */.longSYMBOL_NAME(sys_exit)/* 1 */.longSYMBOL_NAME(sys_fork)/* 2 */.longSYMBOL_NAME(sys_read)/* 3 */..longSYMBOL_NAME(sys_vfork)/* 190 */.reptNR_syscalls-190
sys_vfork().sys_call_tablesys_ni_syscall()sys_exit()sys_fork()sys_read()sys_write().. 0 190 255
Add New System Call (Cont`)put them altogether : example of forkIVTdivide_error()debug()nmi()
.0x0system_call().Kerneluser processmain(){ . fork()}libc.a.fork(){ . movl 2, %eax int $0x80 .}. ENTRY(system_call) /* entry.S */ SAVE_ALL . call *SYMBOL_NAME(sys_call_table)(,%eax,4) .0x80 sys_call_table
sys_exit()sys_fork()sys_read ()sys_write ().1234 sys_fork() /* arch/i386/kernel/process.c */ /* kernel/fork.c */
Add New System Call (Cont`)Syntax of real system call handler in Linux asmlinkage int sys_fork(regs)/* arch/i386/kernel/process.c */ { return do_fork(..); }
int do_fork(..) /* kernel/fork.c */ {./* create new process */ }
asmlinkage int sys_read(fd, buf, count)/* fs/read_write.c */ { ../* read data */ }
Add New System Call (Cont`)Example: add new system call1 (too simple example) 1. kernel modification1-1. allocate syscall number : include/asm-i386/unistd.h#define __NR_exit 1.#define __NR_vfork 190#define __NR_mysyscall 191
1-2. register sys_call_table : arch/i386/kernel/entry.SENTRY(sys_call_table)...longSYMBOL_NAME(sys_mysyscall)/* 191 */.reptNR_syscalls-191
Add New System Call (Cont`)1-3. coding new system call handler asmlinkage int sys_mysyscall() { printk(Hello Linux, Im in Kernel\n); }
1-4. kernel rebuildif you make a new file, you should let it know to make utility eg) kernel/test.c modify the following field in Makefile on kernel directory O_OBJS = sched.o, dma.o, fork.o, . capability.o, test.o
Add New System Call (Cont`) 2. make user program with new system call2-1. make user program #include _syscall0(int, mysyscall); main() { int i; i = mysyscall(); }
2-2. make library if possible#ar, ranlib
Just Do It ()#define _syscall0 (type, name) \type name(void) \{ \long __res; \__asm__ volatile (int 0x80 \ : =a (__res) \ : 0 (__NR_##name)); \__syscall_return(type, __name); \} /* include/asm-i386/unistd.h */
Add New System Call (Cont`)add new system call2 : arguments passing1. kernel modification 1-1 #define __NR_show_mult 192
1-2 .longSYMBOL_NAME(sys_show_mult)/* 192 */ .reptNR_syscalls-192
1-3 asmlinkage int sys_show_mult(int x, int y, int *res) { int error, compute;
if ((error = verify_area(VERIFY_WRITE, res, sizeof(*res)))/* include/asm-i386/uaccess.h */ return error; compute = x*y; put_user(compute, res);/* include/asm-i386/uaccess.h */ return (0); } cf) copy_to_user(), copy_from_user() /* include/asm-i386/uaccess.h */
Add New System Call (Cont`)add new system call2 : arguments passing2-1. make user program #include _syscall3(int, show_mult, int, x, int, y, int *, result); main() { int ret = 0; show_mult(2, 5, &ret); printf(Result : %d * %d = %d\n, 2, 5, ret); }int show_mult (int x, int y, int *result) { long __res; __asm__ volatile (int 0x80 : =a (__res) ,0 (__NR_##name), b ((long) (x)), c ((long) (y)), d ((long) result))); if (__res >= 0) errno =- __res; return __res;} /* include/asm-i386/unistd.h */
Add New System Call (Cont`)add new system call3 : some general system callsgetpid asmlinkage int sys_getpid() { current->pid; }
nice asmlinkage int sys_nice(new_priority) { . current->priority = newpriority ; }pause asmlinkage int sys_pause() { current->state = TASK_INTTERUPTIBLE; schedule(); }NR_TASKS: number of total concurrent tasksall tasks connected using double linked list (next_task, next_run)global variable: init_task, currenttask[0]: init_task, task[1]: init process
Add New System Call (Cont`)forkdo_fork()/* kernel/fork.c */ sys_fork()/* arch/i386/kernel/process.c */- p = alloc_task_struct()- task structure initialize- copy_mm().- copy_thread()- wake_up_process(p)- return (p->pid) copy_thread()/* arch/i386/kernel/process.c */.- p->tss.eax = 0;- p->tss.eip = ret_from_fork;wake_up_process()/* kernel/sched.c */- add_to_runqueue(p);- current->need_resched = 1schedule()/* kernel/sched.c */ ret_from_sys_call()/* arch/i386/kernel/entry.S */if (schedule parent)else (schedule child)
Add New System Call (Cont`)exitdo_exit()/* kernel/exit.c */ sys_exit()/* kernel/exit.c */.- handling each child process- current->state=TASK_ZOMBIE- schedule() notify_parent()/* kernel/signal.c */- sem_exit()- exit_mmap()- free_page_tables()- exit_files()- exit_thread().
Add New System Call (Cont`)Project II: add new system get kernel information: want to know about process id, state, process execution time (system time and user time separately), the number of page faults, the number of open files, and and so on
1. kernel modification asmlinkage int sys_process_statistics(.) { . current->pid, min_flt, maj_flt, times.tms_utime, times.tms_stime . }
2. user program
Motivation of Module in LINUXwhy do we use modules?Linux is a monolithic kerneltrivial modifications require kernel to be recompiledkernel is increasing in size by adding new featuresmany modules occupy permanent space in memory though they are used rarely
module: steps toward micro-kernelized Linuxsmall and compact kernelclean kernelrapid kernelsolution business: components-based Linux
: backup tape driver
What can be Modules ?what can be modules?possibly anything current version
file systemregister_filesystem, unregister_filesystemread_super, put_superblock device driverregister_blkdev, unregister_blkdevopen, releasecharacter device driverregister_chrdev, unregister_chrdevopen, releasenetwork device driverregister_netdev, unregister_netdevopen, closeexec domainregister_exec_domain, unregister_exec_domainload_binary, personalitybinary formatregister_binfmt, unregister_binfmtload_binary.cf: /lib/modules/x.x.x/*.o
How to manipulate modules?how to manipulate modules?compilation
insmod, lsmod, rmmod
kerneld: for on-demand loading eg: mount -t msdos /dev/fd0 /mnt => transparent load fat & msdos modules# gcc -D__KERNEL__ -D_LINUX -DMODULE -c new_module.c
Enable loadable module support (CONFIG_MODULES) [Y/n/?]MSDOS fs support (CONFIG_MSDOS_FS) [M/n/y/?]#insmod fat#lsmodModule: #pages : Used byfat 6 0#rmmod fat
How to implement modules?Modulebasic two interfacesinit_module()cleanup_module()
kernel moduleinit_module()
cleanup_module()register_filesystem()register_blkdev()register_netdrv()sock_register()insmodrmmod
How to implement modules? (Cont`)example1 : Hello world!!
/* hello.c */#include #include
int init_module() { printk(Hello world!! - Im in kernel\n); return 0;}
void cleanup_module () { printk(Bye world - Im in kernel\n);}# gcc -D__KERNEL__ -D_LINUX -DMODULE -c hello.c#insmod hello.o#rmmod
How to implement modules? (Cont`)example2 : simple device driver/* time.c */#include #include #define HOUR_MAJOR 60struct file_operations time_fops = { #define HOUR_MINOR 0 NULL, time_read, NULL, NULL, NULL, NULL, NULL, time_open, NULL, NULL};int time_init() { register_chrdev(HOUR_MAJOR, time, &time_fops); printk(time module loaded (major=%d)\n, HOUR_MAJOR);}
int time_read(fd, buf, size) { int time_open(..) { . copy_to_user(CURRENT_TIME, buf,...); }}
int init_module () { cleanup_module { return time_init(); unregister_chrdev(HOUR_MAJOR, time);} printk(time module unloaded \n);}
How to implement modules? (Cont`)example2 : simple device driver
how can the cat command invoke the time_read() function ?
#gcc -D__KERNEL__ -D_LINUX -DMODULE -c time.c
#mknod /dev/time c 60 0#insmod time#lsmodModule:#pages:Used by:time1
#cat /dev/time/* print current time */
#rmmod time
How to implement modules? (Cont`)example2 : simple device driver register_blkdev()register_chrdev() time_init()register_chrdev(HOUR_MAJOR, time, &time_fops); - chrdevs[major].name = time- chrdevs[major].fops = time_fops/* include/linux/major.h */ init_module
How to implement modules? (Cont`)example2 : simple device driver open
filp_open() sys_open() open_namei()/* fs/namei.c */- get_unused_fd()- fd_install(fd, f)- struct file initialize- f->f_op->open()/* fs/device.c */- filp->f_op = get_chrfops(MAJOR (inode->i_rdev)); /* filp->f_op = chrdevs[major].fops */- filp->f_op->open; pipe_open()socket_open()nfs_open()blkdev_open()chrdev_open() time_open()
How to implement modules? (Cont`)example2 : simple device driver read
block_read() /* fs/block_dev.c */ sys_read()/* fs/read_write.c */- f->f_op->readnfs_read()time_read()tty_read()pipe_read()
How to implement modules? (Cont`)example3 : system call wrapper#include #include #include #include #include
extern void *sys_call_table[];int uid;asmlinkage int (*original_call) (const char *, int, int);asmlinkage int (*getuid_call) ( );
int init_module ( ) { original_call = sys_call_table[__NR_open]; sys_call_table[__NR_open] = our_sys_open; printk(Spying on UID: %d\n, uid); getuid_call = sys_call_table[__NR_getuid]; return 0;}
void cleanup_module ( ){ if (sys_call_table[__NR_open] != our_sys_open) { sys_call_table[__NR_open] = original_call; }}
How to implement modules? (Cont`)example3 : system call wrapperasmlinkage int our_sys_open(const chat *fname, int flags, int mode) { int i=0; char ch;
if (uid == getuid_call() { printk(opened file by %d: , uid); do { get_user(filename+i); i++; printk(%c, ch); } while (ch != 0); } printk(\n); return original_call(fname, flags, mode);}
How to implement modules? (Cont`)example4 : new file systemdesign super blockprogram file operations, program inode operationsregistering : register_filesystem()
mount
#ifdef CONFIG_MINIX_FS register_filesystem(&(struct file_system_type) {minix_read_super, minix, 1, NULL});#endifstruct file_system_type { struct super_block *(*read_super) (); char *name; int requires_dev; struct file_system_type *next;} *file_system;
How to implement modules? (Cont`)Project IIIimplement your own modules make file operationsmake module interfacemake drivermknod (use pseudo device such as memory)
mydrv_open() mydrv_interrupt() mydrv_release() mydrv_init() mydrv_read() mydrv_write() mydrv_out() mydrv_ioctl()mydrv init_module()cleanup_module()
How to implement modules? (Cont`)system call for modulescreate_modulememory allocation for module (return load address)a new element for module_listinit_modulephysical loading of requesting module (module functions become an integral part of kernel)relocating module functions and solving references of kernel symbolscall module specific init_module functiondelete_moduleget_kernel_symsto get kernel symbols
How to implement modules? (Cont`)Kernel data structure for create_module()modulenextrefsymtabname...module_listsymbol tablefor this module size
referencesmodulenextrefsymtabname...symbol tablefor this module size
references
Control flow of FS system call
file access under Linux /* include/linux/sched.h, fs.h */
why do we need the file data structure ?task structurefs_structfsfiles... count umask *root *pwdinodeinodefile_struct count close_on_exec fd[0] fd[1]
fd[255]fileinodefile operation routines
Control flow of FS system call (Cont`)Why do we need file data structure=> to support various type of files with single coherent interface
open
filp_open() /* fs/open.c */ sys_open()/* fs/open.c */ open_namei()/* fs/namei.c */- get_unused_fd()- fd_install(fd, f)- struct file initialize- f->f_op->open() /* to support various file */
Control flow of FS system call (Cont`)struct file /* include/linux/fs.h */
file operation example
where is create()?f_next, f_prevf_dentry/* to access inode */f_opf_mode/* access type */f_pos/* file offset */f_count/* reference count */f_flagsf_reada, f_ramax...lseek()read()write()readdir()poll()ioctl()mmap()open()flush()release()fsync()fasync().. include/linux/fs.hext2_file_lseek, generic_file_read,ext2_file_writeNULL, NULL,ext2_file_ioctlgeneric_file_mmapNULL, .fs/ext2/file.cufs_file_lseek, generic_file_read,ufs_file_writeNULL, NULL, NULL,generic_file_mmapNULL, .fs/ufs/file.cNULL, nfs_file_read,nfs_file_writeNULL, NULL, NULL,nfs_file_mmapnfs_file_open, fs/nfs/file.cpipe_lseek, pipe_read,pipe_writeNULL, pipe_poll, pipe_ioctl,NULL,pipe_rdwr_open, ...fs/pipe.cNULL, NULL,NULL,NULL, NULL, NULL,NULLblkdev_open, .fs/device.csock_lseeksock_readsock_writeNULL sock_pollsock_ioctlNULL sock_no_open./* net/socket.c */
Control flow of FS system call (Cont`)openfilp_open() /* fs/open.c */ sys_open()/* fs/open.c */ open_namei()/* fs/namei.c */- get_unused_fd()- fd_install(fd, f)- struct file initialize- f->f_op->open() System call layer VFS layer Specific File layer pipe_rdwr_open()blkdev_open()chrdev_open()nfs_file_open()sock_no_open()iget(), bread()
Control flow of FS system call (Cont`)readgeneric_file_read() /* mm/filemap.c */ sys_read()/* fs/read_write.c */- f->f_op->readSystem call handling layer VFS layer sock_read()block_read()tty_read()pipe_read()Specific File layer - try to find page in page cache, if (hit) OK.- get_free_page()- inode->i_op->readpage()nfs_file_read()
Control flow of FS system call (Cont`)inode structure in Linux /* include/linux/fs.h, ext2_fs_i.h */
task.fd[].file.f_dentry.f_posf_opdentryd_inodeinodeFile specific information.i_inoi_devi_counti_modei_nlinki_uid, gid i_atime, ...
i_rdevi_opi_data[15]i_flagsi_.
device driverinode operation routines
Control flow of FS system call (Cont`)inode operation example
...i_op...def_file_operation create(), lookup()link(), unlink(), symlink()mkdir(), rmdir()mknod(), rename(), readlink(), followlink()readpage(), writepage()bmap(), truncate(), .include/linux/fs.hufs_file_operations, NULL, NULL,NULL, NULL,...generic_readpageNULLufs_bmap,.fs/ufs/file.cext2_file_operations, NULL, NULL,NULL, NULL,...generic_readpageNULLext2_bmap,.fs/ext2/file.cnfs_file_operations, NULL, NULL,NULL, NULL,...nfs_readpagenfs_writepageNULL.fs/nfs/file.crdwr_pipe_fops, NULL, NULL,NULL, NULL,...fs/pipe.cdos_file_operations,NULL, NULL,NULL, NULL,dos_readpage,dos_writepage,NULL,.fs/dos/files.cdef_blk_fops, NULL, NULL,NULL, NULL,...fs/device.c
Control flow of FS system call (Cont`)read
generic_file_read() /* mm/filemap.c */ sys_read()/* fs/read_write.c */- f->f_op->read- try to find page in cache, if (hit) OK.- inode->i_op->readpage()generic_readpage() /* fs/buffer.c */ ext2_bmap()/* fs/ext2/inode.c */ll_rw_block() /* driver/block/ll_rw_blk.c */System call handling layer VFS layer sock_read() block_read()tty_read()pipe_read()Specific File layer nfs_readpage()dos_readpage()hd_request /* driver/block/hd.c */Device Driver layerSpecific FS layer coda_readpage() ufs_bmap()/* fs/ufs/inode.c */
Device Driver Implementation in Linuxdata structureblkdevs, chrdevs for devsw blk_dev_struct for block driver only
struct device_struct {name;fops;} chrdevs[], blkdevs[];lseekread, write, readdirpoll, ioctl, mmap,open, flush, releasefsync, fasync..file_operations/* fs/devices.c */struct blk_dev_struct {request_fn;queue;request;...} blk_dev[];/* include/linux/blkdev.h */
Driver Implementation in Linux (Cont`)data structure (cont`)
file_operationschrdevs[]namefopsblkdevrequest_fncurrent_requestrequestrq_statusrq_devcmdsembhtailnextrequestrq_statusrq_devcmdsembhtailnextbuffer_headb_devb_blocknrb_stateb_countb_size...b_nextb_datarequest
Driver Implementation in Linux (Cont`)Example of structure of driver: IDE disks hd_open()driver/block/hd.cNULL,block_read,block_writeNULL, NULL, hd_ioctl,NULL,hd_open, NULLhd_release,block_fsyncstruct file_operations hd_ops hd_interrupt() hd_release() hd_init() hd_request() hd_ioctl() hd_out() check_status()
Driver Implementation in Linux (Cont`)major number /* include/linux/major.h */
Major Character devices Block devices
01 memRAM disk 2floppy (fd*)3IDE hard disk (hd* )4terminal5terminal & AUX6Parallel Interface7virtual console (vcs*)8SCSI hard disk (sd*)9SCSI tapes (st*)23Mitsumi CD-ROM (mcd*).
Driver Implementation in Linux (Cont`)initialization of disk driverregister_blkdev()register_blkdev() /* fs/devices.c */ hd_init()/* driver/block/hd.c */- register_blkdev(HD_MAJOR, hd, &hd_fops);- blk_dev[HD_MAJOR]. request_fn = hd_request- blkdevs[major].name = device name- blkdevs[major].fops = fops/* include/linux/major.h */ init process init_module
Driver Implementation in Linux (Cont`)disk driver open
filp_open() /* fs/open.c */ sys_open()/* fs/open.c */ open_namei()/* fs/namei.c */- get_unused_fd()- fd_install(fd, f)- struct file initialize- f->f_op->open()/* fs/device.c */- filp->f_op = get_blkfops(MAJOR (inode->i_rdev)); /* filp->f_op = blkdevs[major].fops */- filp->f_op->open; /* hd_open */pipe_open()socket_open()nfs_open()chrdev_open()blkdev_open() hd_open()/* driver/block/hd.c */
Driver Implementation in Linux (Cont`)disk driver read
block_read() /* fs/block_dev.c */ sys_read()/* fs/read_write.c */- f->f_op->readnfs_read()generic_file_read()tty_read()pipe_read() /* mm/filemap.c */- getblk(); /* buffer header */ll_rw_block() /* driver/block/ll_rw_blk.c */- request structure initializemake_request()add_request()- call blk_dev[major].request_fn hd_request() /* driver/block/hd.c */- hd_out()
Driver Implementation in Linux (Cont`)queue and requests (similar to message queue)requests are sorted by sector numberinb, outb
struct blk_dev_struct {request_fn;queue;request;...} blk_dev[];/* include/linux/blkdev.h */struct request { rq_status rq_dev cmd /* R/W */ error sector, nr_sector buffer, bh sem next ...}block device driverqueuereqreqreqbuffer cachebread block_readll_rw_blockmake_requesthd_requestdo I/Orequest_fn
Driver Implementation in Linux (Cont`)various disks and partitions gendisk
gendiskmajornameminor_shiftmax_ppart.real_devicesnext
gendisk_head8sdgendiskmajornameminor_shiftmax_ppart.real_devicesnext
3ide0hd_structstart_sectnr_sects......start_sectnr_sects
Driver Implementation in Linux (Cont`)tty driverregister_chrdev()
register_chrdev() /* fs/devices.c */ tty_init()/* driver/block/hd.c */- register_chrdev(TTY_MAJOR, tty, &tty_fops);- blkdevs[major].name = device name- blkdevs[major].fops = fopstty_lseek,tty_read,tty_writeNULL,tty_polltty_ioctl,NULL,tty_open, NULLtty_release,NULLtty_afsyncdriver/char/tty_io.c/* include/linux/major.h */ init process init_module
Driver Implementation in Linux (Cont`)Example of network driver : 3c509different from disk and tty drivernot directly interface with VFS el3_open() /* driver/net/3c509.c */ el3_start_xmit() el3_init() el3_stop() el3_release() el3_out() el3_interrupt() /* driver/net/3c509.c */ ip_output() ip_rcv()
Driver Implementation in Linux (Cont`)Example of network driver : 3c509struct device { name mem_end, mem_start base addr /* port number */ init, destructor . device_addr qdisc /* sk_buff */ . open, stop hard_start_xmit, hard_header irq}/* include/linux/netdevices.h */el3_open() . request_irq(dev->irq, el3_interrupt
init_module() in 3c509/* driver/net/3c509.c*//* register_netdev() */ init port, irq, make dev structure dev->init=el3_init dev->open=el3_open dev->hard_start_xmit = el3_start_xmit ...
Task Scheduling LINUX schedulingclock tick is 10msec, time quantum is 10 clock tickssupport REAL-TIME task
variables for scheduling in task structurep_policy : task type /* include/linux/sched.h */SCHED_FIFO, SCHED_RR, SCHED_OTHERp_priority set to DEF_PRIORITY (20) /* include/linux/sched.h */can be changed using sys_nice() or sys_setpriority();p_counterdecrease each clock tickcounter = priority, when counter of all task is zero need_resched : need re-scheduling when return from syscall or interruptrt_priority set using sched_setscheduler(pid, policy, sched_param) system callused to set real time tasks (static priority)
Task Scheduling (Cont`)schedule() function /* kernel/sched.c */
schedule need_resched sleep_on- schedule real time task first (rt_priority)- select a task which has highest values of counter + priority (using goodness function) give advantage to the task which run this_cpu give slight advantage to the task which has mm object- if (p_counter == 0) for all task p_counter = p_priority- context switch : switch_to (current, next) /* arch/i386/kernel/process.c */
Task Scheduling (Cont`)Example of scheduling3 tasks
T1p_pri p_count. T2p_pri p_count. T3p_pri p_count. 020 2020 2020 201020 1020 2020 202020 1020 1020 203020 1020 1020 104020 020 1020 10millisecond20 020 020 1020 2020 2020 20
Signal a mechanism to inform an asynchronous event to processtypes of signal : SIGKILL, SIGINT, SIGBUS, SIGUSR1, .action : abort, exit, ignore, stop, user level catch function
whats the difference among interrupt, trap, and signal?
void sig_handler(signo)int signo;{signal (SIGUSR1, sig_handler);/* reinstall */printf(received signal %d\n, signo);/* handle the signal */..}
main (){signal (SIGUSR1, sig_handler);/* install the handler */.for ( ; ; )pause();}
Signal (Cont`)register signal handler (signal catch function )send signal signal detection : state transition from kernel running to user runningcall signal handler
variables for signal in task structureint sigpending : is signal received or not?struct signal_struct *sigsigset_t signal, blocked
struct signal_struct /* sched.h */countaction[_NSIG]siglockstruct sigaction /* asm-i386/signal.h */sa_handlersa_flagssa_restorersa_masktypedef struct { unsigned long sig[_NSIG_WORDS];} sigset_t; /* asm-i386/signal.h */
Signal (Cont`)register signal catch functioncountaction[_NSIG]siglocksa_handlersa_flagssa_restorersa_masksys_signal(sig, handler)/* kernel/signal.c */do_sigaction(sig, new_sa, old_sa).sigsignal, blockedsigpending.task.signal_structsigactionsigset_t630.sigset_t630
Signal (Cont`)send signalcountaction[_NSIG]siglocksa_handlersa_flagssa_restorersa_mask.sigsignal, blockedsigpending.task.signal_structsigactionsigset_t630.sigset_t630sys_kill(pid,sig)/* kernel/signal.c */kill_proc_info(sig, info, pid) send_sig_info(sig, info, *t)sigaddset(t->signal, sig);t->sigpending = 1;
Signal (Cont`)signal handling
countaction[_NSIG]siglocksa_handlersa_flagssa_restorersa_mask.sigsignal, blockedsigpending.task.signal_structsigactionsigset_t630.sigset_t630do_signal(regs, oldset)/* arch/i386/kernel/signal.c */signr = dequeue_signal() handle_signal()setup stack frame for signal handlerif (current->sigpending) do_signal();/* arch/i386/kernel/entry.S */handle SIG_IGN or SIG_DFL
Signal (Cont`)signal handling: state of stack for handling signalmemory - return address - argumentsstackmemory - return address - argumentsstack- return address to kernel- return address to sighandler- arguments
ThreadMotivation (golf course) Possibility of parallel processingprocess is too heavyPPPPPtimeCPUaddress spaceprocessprocess model(Source : UNIX internals)
Thread (Cont`)thread model
task : a set of thread and a collection of resources (passive)thread : hardware context, stack, thread information (id, scheduling, ..)
timeCPUaddress spacethreadthread model(Source : UNIX internals)
Thread (Cont`)types of threads kernel threadLWP (lightweight process) : a kernel supported user threaduser thread : C-thread, P-threadCPUCPUthread schedulerKKKKKLLLUUUUUUprocess (or task)user level scheduler
Thread (Cont`)threads in Linuxstruct thread: currently only one in task structuresys_clone()fully share the address context such as page directoryunder developing
can use user level thread (P thread)/usr/include/pthread.hpthread_create()pthread_join()pthread_mutex_init()
Thread (Cont`)Example of thread programming#include ...
typedef struct { double volatile *p_s; pthread_mutex_t *p_s_lock; int n;} DATA;
#define L 9double x[L], y[L];/* gcc -lpthread */
int main(int argc, char *argv[]) { pthread_t *thread; void *retval; int cpu, i; DATA *A; volatile double s = 0; pthread_mutex_t s_lock;
if (argc != 0) { printf(USAGE: %s, CPU number, argv[0]); exit(1); } cpu = atoi(argv[1]); thread = (pthread_t *)calloc(cpu, sizeof(pthread_t)); A = (DATA *) calloc(cpu, sizeof(DATA));
Thread (Cont`)Example of thread programmingvoid *SMP_scalprod(void *arg){ register double localsum; long i; DATA D = *(DATA *)arg;
localsum = 0.0; for (i=D.n; ifamily] = ops;} /* net/socket.c */struct net_proto_family inet_family_ops = { PF_INET, inet_create}
inet_proto_init(){ sock_register(inet_family_ops) ...} /* net/ipv4/af_inet.c */ /* net/unix/af_unix.c */ /* net/ipx/af_ipx.c */
Socket Create (cont`)socket create
sys_socket(family, type, protocol) /* net/socket.c */AF_UNIX, AF_INET, AF_IPX, ... /* include/linux/socket.h */SOCK_STREAM, SOCK_DGRAM, SOCK_RAW, ... sock_create()struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, .} /* include/linux/net.h */sock_alloc()net_families[family]->create() unix_create() inet_create()struct sock { ... prot net_pinfo tp_pinfo socket sk_buff .} /* include/net/sock.h */sk_alloc()switch (type) sock->ops=&inet_stream_ops or sock->ops=&inet_dgram_ops sk->prot = &tcp_prot
Socket Create (cont`)socket create sys_socket(family, type, protocol) /* net/socket.c */sock_create()get_fd()get_empty_filp()file->f_op=&socket_file_opsassociate d_inode with socket structurestruct file_operations socket_file_ops = { sock_lseek sock_read sock_write NULL /* readdir */ sock_poll sock_ioctl NULL /* mmap */ sock_no_open NULL /* flush */ sock_close NULL /* fsync */ sock_fasync}/* net/socket.c */
Socket Create (cont`)after socket creation struct socket { state type flags ops /* proto_ops */ sk /* struct sock */ files, inodes next, wait, .}/* include/linux/net.h */task.fd[].file.f_dentry.f_posf_opdentryd_inodestruct sock { next, prev daddr, dport rcv_saddr, sport ... rmem_alloc receive_queue /* sk_buff */ wmem_alloc send_queue ... pair /* struct sock */ prot /* struct proto */ tp_pinfo dst_cache /* struct dst_entry */ ...} /* include/net/sock.h */VFS layerINET layerTCP layerIP layerDriver layer
Send Datasending data through socketcompare with FS control flow, that is a piece of pizza
sock_write() /* net/socket.c */ sys_write()/* fs/read_write.c */f->f_op->write sock_sendmsg()socki_lookup(d_inode)make msgsock->ops->sendmsgstruct proto_ops inet_stream_ops = { PF_INET sock_no_dup inet_release inet_bind inet_stream_connect sock_no_socketpair inet_accept inet_getname inet_poll inet_ioctl inet_listen inet_shutdown inet_getsockopt inet_setsockopt sock_no_fcntl inet_sendmsg inet_recvmsg}/* net/ipv4/af_inet.c */ inet_sendmsg()sk->prot->sendmsg /* net/ipv4/af_inet.c */
Send Data (cont`)sending data through socket
struct proto tcp_proto = { netxt, prev tcp_close tcp_v4_connect tcp_accept NULL /* retrasmit */ tcp_write_wakeup tcp_read_wakeup tcp_poll tcp_ioctl tcp_v4_init_sock tcp_v4_destroy_sock tcp_shutdown tcp_getsockopt tcp_setsockopt tcp_v4_sendmsg tcp_recvmsg TCP ...}/* net/ipv4/tcp_ipv4.c */ inet_sendmsg()sk->prot->sendmsg /* net/ipv4/af_inet.c */ tcp_v4_sendmsg() tcp_do_sendmsg()copy data from user to sk_buff /* net/ipv4/tcp.c */ tcp_send_skb() tcp_transmit_skb() /* net/ipv4/tcp_output.c */make tcp headersk->tp_pinfo.af_tcp.af_specific->queue_xmit(skb)
Send Data (cont`)sending data through socket
tcp_transmit_skb()sk->tp_pinfo.af_tcp.af_specific->queue_xmit(skb)struct tcp_func ipv4_specific = { ip_queue_xmit tcp_v4_send_check tcp_v4_rebulid_header tcp_v4_conn_request tcp_v4_sync_recv_sock tcp_v4_get_sock sizeof(struct iphdr) ip_setsockopt ip_getsockopt v4_addr2sockaddr sizeof(struct sockaddr_in)}
sk_alloc() => tcp_v4_sock_init()tcp_v4_sock_init() { sk->tp_pinfo.af_tcp.af_specific=&ipv4_specific ..}/* net/ipv4/tcp_ipv4.c */ /* net/ipv4/tcp_output.c */ ip_queue_xmit()build IP headerfragment handlingcall ip_route_output() /* dst_cache.output = ip_output in ip_route_output */sk->dst_cache->output() /* net/ipv4/ip_output.c */ ip_output()ip_finish_output(skb) /* net/ipv4/ip_output.c */
Send Data (cont`)sending data through socket
ip_finish_output()hh->hh_output(skb) /* include/net/ip.h */ dev_queue_xmit() /* net/core/dev.c */struct hh_cache { hh_refcnt hh_type hh_output }
struct device { name rmem_end, rmem_start mem_end, mem_start base addr irq init, destructor . device_addr qdisc . open, stop hard_start_xmit, hard_header ...}/* include/linux/netdevices.h */input pkt into dev->qdiscdev->hard_start_xmit()el3_start_xmit() /* driver/net/3c509.c */make ethernet framesend frame using inb(), outb(), ...
init_module() in 3c509/* driver/net/3c509.c*/ init port, irq, make dev structure dev->open=el3_open dev->hard_start_xmit = el3_start_xmit ...
hh->output =neigh_ops->output =dev_queue_xmit/* net/ipv4/arp.c*/
Send Data (cont`)sending data through socket
struct sock sk_buff headers data ... sk_buff headers data ...struct device sk_buff headers data ......send queue......qdisc...Device LayerProtocol Layer
Send Data (Cont`)Sending all together (TCP/IP & Ethernet) cf) compare with the control flow of FS, its too terrible (FS is a piece of cake) sock_write() /* net/socket.c */ sys_write()/* fs/read_write.c */ inet_sendmsg() /* net/ipv4/af_inet.c */ tcp_send_skb() /* net/ipv4/tcp_output.c */ ip_queue_xmit() /* net/ipv4/ip_output.c */el3_start_xmit() /* driver/net/3c509.c */Linux kernelVFSBSD socketinet socketTCPIPDevice
Receive Datareceiving data through socketel3_interrupt() /* driver/net/3c509.c */mark_bh(NET_BH)el3_open() . request_irq(dev->irq, el3_interrupt net_bh() /* net/core/dev.c */make sk_buff in device structureptype->func()struct packet_type { type dev func .}
/* net/ipv4/ip_output.c */struct packet_type ip_packet_type = { ETH_P_IP, NULL, ip_rcv, ...} /* include/linux/netdevice.h */ip_rcv() /* net/ipv4/ip_input.c */ip_forward(), ip_defrag()skb->dst->input()/* dst.ipput = ip_local_deliver in ip_route_input() */ip_local_deliver() /* net/ipv4/ip_input.c */
Receive Data (cont`)receiving data through socket struct inet_protocol { handler err_handler ... name} /* include/net/protocol.h */ipprot->handler()ip_local_deliver() /* net/ipv4/ip_input.c */struct inet_protocol tcp_protocol { tcp_v4_rcv tcp_v4_err . TCP} /* net/ipv4/protocol.c */tcp_v4_rcv()tcp_v4_do_rcv() /* net/ipv4/tcp_ipv4.c */call tcp_rcv_establishedor call tcp_rcv_state_processtcp_rcv_state_process() /* net/ipv4/tcp_input.c */check consistency, tcp_data()tcp_data()tcp_data_queue() /* sk_buff into sk */wake up process
Receive Data (cont`)receiving data through socket sock_read() /* net/socket.c */ sys_read()/* fs/read_write.c */f->f_op->read sock_recvmsg()socki_lookup(d_inode)make msg headersock->ops->recvmsg inet_recvmsg()sk->prot->sendmsg /* net/ipv4/af_inet.c */ tcp_recvmsg() /* net/ipv4/tcp.c */add_wait_queue(sk->sleep, {current, NULL})tcp_data()
Receive Data (cont`)Receiving all together (TCP/IP & Ethernet)
sock_read() /* net/socket.c */ sys_read()/* fs/read_write.c */ inet_recvmsg() /* net/ipv4/af_inet.c */ tcp_recvmsg() /* net/ipv4/tcp.c */Linux kernelVFSBSD socketinet socketTCPIPDeviceel3_interrupt() /* driver/net/3c509.c */net_bh() /* net/core/dev.c */ip_rcv() /* net/ipv4/ip_input.c */tcp_rcv_state_process() /* net/ipv4/tcp_input.c */wake upsleep
Conclusion in NetworkAdd new features
sock_write() /* net/socket.c */ sys_write()/* fs/read_write.c */ inet_sendmsg() /* net/ipv4/af_inet.c */ tcp_send_skb() /* net/ipv4/tcp_output.c */ ip_queue_xmit() /* net/ipv4/ip_output.c */el3_start_xmit() /* driver/net/3c509.c */ Linux kernel virtual_ip() secure_tcp() compress_net()
Conclusion of Linuxabstraction is just a set of data structure in kernel levelprocess struct task_struct/* include/linux/sched.h */struct user/* include/asm-i386/user.h */memorystruct vm_area_struct/* include/linux/sched.h, include/asm-i386/page.h */ filestruct file, struct inode/* include/linux/fs.h, ext2_fs_i.h */file systemstruct super_block/* include/linux/fs.h, */buffer struct buffer_head/* include/linux/fs.h */device driverstruct device_struct/* fs/devices.c, driver/* */IPC/* include/linux/ipc.h, sem.h, msg.h, shm.h */TCP/IP/* include/linux/tcp.h, ip.h */
E. p414thrashing: . . . .
Working Set 261577751 -> {1,2,5,6,7} Problem: size
FIFO : p1LFU: p2 or p3 or p7
FIFO: timeLRU: referenceLFU: freq.E.151ref:shm1.c shm2.cref: sem_lock.csimple programAT&T Transport Interfacehttp://www.rrzn.uni-hannover.de/ZentralSys/Vektor/manual/manlib/C/ni/ni01/ni000009.htm