duartes internals papers

Upload: raphael-sanchez-prudencio

Post on 07-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Duartes Internals Papers

    1/54

  • 8/6/2019 Duartes Internals Papers

    2/54

    2

    In a motherboard the CPUs gateway to the world is the front-side bus connecting it to thenorthbridge. Whenever the CPU needs to read or write memory it does so via this bus. Ituses some pins to transmit the physical memory address it wants to write or read, whileother pins send the value to be written or receive the value being read. An Intel Core 2QX6600 has 33 pins to transmit the physical memory address (so there are 233 choices ofmemory locations) and 64 pins to send or receive data (so data is transmitted in a 64-bit

    data path, or 8-byte chunks). This allows the CPU to physically address 64 gigabytes ofmemory (233 locations * 8 bytes) although most chipsets only handle up to 8 gigs of RAM.

    Now comes the rub. Were used to thinking of memory only in terms of RAM, the stuff

    programs read from and write to all the time. And indeed most of the memory requestsfrom the processor are routed to RAM modules by the northbridge. But not all of them.Physical memory addresses are also used for communication with assorted devices on themotherboard (this communication is calledmemory-mapped I/O). These devices includevideo cards, most PCI cards (say, a scanner or SCSI card), and also theflash memorythatstores the BIOS.

    When the northbridge receives a physical memory request it decides where to route it:

    should it go to RAM? Video card maybe? This routing is decided via the memory addressmap. For each region of physical memory addresses, the memory map knows the devicethat owns that region. The bulk of the addresses are mapped to RAM, but when they arent

    the memory map tells the chipset which device should service requests for those addresses.This mapping of memory addresses away from RAM modules causes the classic hole in PCmemory between 640KB and 1MB. A bigger hole arises when memory addresses arereserved for video cards and PCI devices. This is why 32-bit OSes haveproblems using 4gigs of RAM. In Linux the file /proc/iomem neatly lists these address range mappings. Thediagram below shows a typical memory map for the first 4 gigs of physical memoryaddresses in an Intel PC:

    http://en.wikipedia.org/wiki/Memory-mapped_IOhttp://en.wikipedia.org/wiki/Memory-mapped_IOhttp://en.wikipedia.org/wiki/Memory-mapped_IOhttp://en.wikipedia.org/wiki/Flash_memoryhttp://en.wikipedia.org/wiki/Flash_memoryhttp://en.wikipedia.org/wiki/Flash_memoryhttp://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://en.wikipedia.org/wiki/Flash_memoryhttp://en.wikipedia.org/wiki/Memory-mapped_IO
  • 8/6/2019 Duartes Internals Papers

    3/54

    3

    Memory layout for the first 4 gigabytes in an Intel system.

    Actual addresses and ranges depend on the specific motherboard and devices present in thecomputer, but most Core 2 systems are pretty close to the above. All of the brown regionsare mapped away from RAM. Remember that these arephysicaladdresses that are used onthe motherboard buses. Inside the CPU (for example, in the programs we run and write),

    the memory addresses are logical and they must be translated by the CPU into a physicaladdress before memory is accessed on the bus.

    The rules for translation of logical addresses into physical addresses are complex and theydepend on the mode in which the CPU is running (real mode, 32-bit protected mode, and64-bit protected mode). Regardless of the translation mechanism, the CPU modedetermines how much physical memory can be accessed. For example, if the CPU is runningin 32-bit mode, then it is only capable of physically addressing 4 GB (well, there is anexception calledphysical address extension, but ignore it for now). Since the top 1 GB or so

    http://en.wikipedia.org/wiki/Physical_address_extensionhttp://en.wikipedia.org/wiki/Physical_address_extensionhttp://en.wikipedia.org/wiki/Physical_address_extensionhttp://en.wikipedia.org/wiki/Physical_address_extension
  • 8/6/2019 Duartes Internals Papers

    4/54

    4

    of physical addresses are mapped to motherboard devices the CPU can effectively use only~3 GB of RAM (sometimes less I have a Vista machine where only 2.4 GB are usable). Ifthe CPU is inreal mode, then it can only address 1 megabyte of physical RAM (this is theonly mode early Intel processors were capable of). On the other hand, a CPU running in 64-bit mode can physically access 64GB (few chipsets support that much RAM though). In 64-bit mode it is possible to use physical addresses above the total RAM in the system to

    access the RAM regions that correspond to physical addresses stolen by motherboarddevices. This is called reclaimingmemory and its done with help from the chipset.

    Thats all the memory we need for the next post, which describes the boot process frompower up until the boot loader is about to jump into the kernel. If youd like to learn more

    about this stuff, I highly recommend the Intel manuals. Im big into primary sources overall,but the Intel manuals in particular are well written and accurate. Here are some:

    Datasheet for Intel G35 Chipsetdocuments a representative chipset for Core 2processors. This is the main source for this post.

    Datasheet for Intel Core 2 Quad-Core Q6000 Sequenceis a processor datasheet. Itdocuments each pin in the processor (there arent that many actually, and after you

    group them theres really not a lot to it). Fascinating stuff, though some bits arearcane. TheIntel Software Developers Manualsare outstanding. Far from arcane, they

    explain beautifully all sorts of things about the architecture. Volumes 1 and 3A havethe good stuff (dont be put off by the name, the volumes are smal l and you canread selectively).

    Pdraig Bradysuggested that I link to Ulrich Dreppers excellentpaper on memory.Its great stuff. I was waiting to link to it in a post about memory, but the more themerrier.

    http://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Real_modehttp://download.intel.com/design/chipsets/datashts/31760701.pdfhttp://download.intel.com/design/chipsets/datashts/31760701.pdfhttp://download.intel.com/design/processor/datashts/31559205.pdfhttp://download.intel.com/design/processor/datashts/31559205.pdfhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.pixelbeat.org/http://www.pixelbeat.org/http://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://www.pixelbeat.org/http://www.intel.com/products/processor/manuals/index.htmhttp://download.intel.com/design/processor/datashts/31559205.pdfhttp://download.intel.com/design/chipsets/datashts/31760701.pdfhttp://en.wikipedia.org/wiki/Real_mode
  • 8/6/2019 Duartes Internals Papers

    5/54

    5

    2.How Computers Boot UpThe previous post describedmotherboards and the memory mapin Intel computers to setthe scene for the initial phases of boot. Booting is an involved, hacky, multi-stage affair fun stuff. Heres an outline of the process:

    An outline of the boot sequence

    Things start rolling when you press the power button on the computer (no! do tell!). Oncethe motherboard is powered up it initializes its own firmware the chipset and other tidbits and tries to get the CPU running. If things fail at this point (e.g., the CPU is busted ormissing) then you will likely have a system that looks completely dead except for rotatingfans. A few motherboards manage to emit beeps for an absent or faulty CPU, but thezombie-with-fans state is the most common scenario based on my experience. Sometimes

    USB or other devices can cause this to happen: unplugging allnon-essential devices is apossible cure for a system that was working and suddenly appears dead like this. You canthen single out the culprit device by elimination.

    If all is well the CPU starts running. In a multi-processor or multi-core system one CPU isdynamically chosen to be the bootstrap processor (BSP) that runs all of the BIOS and kernelinitialization code. The remaining processors, called application processors (AP) at thispoint, remain halted until later on when they are explicitly activated by the kernel. IntelCPUs have been evolving over the years but theyre fully backwards compatible, so modern

    CPUs can behave like the original 1978Intel 8086, which is exactly what they do afterpower up. In this primitive power up state the processor is inreal modewith memorypagingdisabled. This is like ancient MS-DOS where only 1 MB of memory can be addressed

    and any code can write to any place in memory theres no notion of protection orprivilege.

    Mostregistersin the CPU have well-defined values after power up, including the instructionpointer (EIP) which holds the memory address for the instruction being executed by theCPU. Intel CPUs use a hack whereby even though only 1MB of memory can be addressed atpower up, a hidden base address (an offset, essentially) is applied to EIP so that the firstinstruction executed is at address 0xFFFFFFF0 (16 bytes short of the end of 4 gigs of

    http://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://en.wikipedia.org/wiki/Intel_8086http://en.wikipedia.org/wiki/Intel_8086http://en.wikipedia.org/wiki/Intel_8086http://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Processor_registerhttp://en.wikipedia.org/wiki/Processor_registerhttp://en.wikipedia.org/wiki/Processor_registerhttp://en.wikipedia.org/wiki/Processor_registerhttp://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Intel_8086http://duartes.org/gustavo/blog/post/motherboard-chipset-memory-map
  • 8/6/2019 Duartes Internals Papers

    6/54

    6

    memory and well above one megabyte). This magical address is called thereset vectorandis standard for modern Intel CPUs.

    The motherboard ensures that the instruction at the reset vector is ajump to the memorylocation mapped to the BIOS entry point. This jump implicitly clears the hidden baseaddress present at power up. All of these memory locations have the right contents needed

    by the CPU thanks to thememory mapkept by the chipset. They are all mapped to flashmemory containing the BIOS since at this point the RAM modules have random crap inthem. An example of the relevant memory regions is shown below:

    Important memory regions during boot

    The CPU then starts executing BIOS code, which initializes some of the hardware in themachine. Afterwards the BIOS kicks off thePower-on Self Test(POST) which tests variouscomponents in the computer. Lack of a working video card fails the POST and causes theBIOS to halt and emit beeps to let you know whats wrong, since messages on the screenarent an option. A working video card takes us to a stage where the computer looks alive:manufacturer logos are printed, memory starts to be tested, angels blare their horns. OtherPOST failures, like a missing keyboard, lead to halts with an error message on the screen.

    http://en.wikipedia.org/wiki/Reset_vectorhttp://en.wikipedia.org/wiki/Reset_vectorhttp://en.wikipedia.org/wiki/Reset_vectorhttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://en.wikipedia.org/wiki/Power_on_self_testhttp://en.wikipedia.org/wiki/Power_on_self_testhttp://en.wikipedia.org/wiki/Power_on_self_testhttp://en.wikipedia.org/wiki/Power_on_self_testhttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://en.wikipedia.org/wiki/Reset_vector
  • 8/6/2019 Duartes Internals Papers

    7/54

    7

    The POST involves a mixture of testing and initialization, including sorting out all theresources interrupts, memory ranges, I/O ports for PCI devices. Modern BIOSes thatfollow theAdvanced Configuration and Power Interfacebuild a number of data tables thatdescribe the devices in the computer; these tables are later used by the kernel.

    After the POST the BIOS wants to boot up an operating system, which must be found

    somewhere: hard drives, CD-ROM drives, floppy disks, etc. The actual order in which theBIOS seeks a boot device is user configurable. If there is no suitable boot device the BIOShalts with a complaint like Non-System Disk or Disk Error. A dead hard drive mightpresent with this symptom. Hopefully this doesnt happen and the BIOS finds a working diskallowing the boot to proceed.

    The BIOS now reads the first 512-bytesector(sector zero) of the hard disk. This is calledtheMaster Boot Recordand it normally contains two vital components: a tiny OS-specificbootstrapping program at the start of the MBR followed by a partition table for the disk. TheBIOS however does not care about any of this: it simply loads the contents of the MBR intomemory location 0x7c00 and jumps to that location to start executing whatever code is inthe MBR.

    Master Boot Record

    The specific code in the MBR could be a Windows MBR loader, code from Linux loaders suchas LILO or GRUB, or even a virus. In contrast the partition table is standardized: it is a 64-byte area with four 16-byte entries describing how the disk has been divided up (so you canrun multiple operating systems or have separate volumes in the same disk). TraditionallyMicrosoft MBR code takes a look at the partition table, finds the (only) partition marked asactive, loads the boot sector for thatpartition, and runs that code. The boot sector is thefirst sector of a partition, as opposed to the first sector for the whole disk. If something iswrong with the partition table you would get messages like Invalid Partition Table or

    Missing Operating System. This message does not come from the BIOS but rather fromthe MBR code loaded from disk. Thus the specific message depends on the MBR flavor.

    Boot loading has gotten more sophisticated and flexible over time. The Linux boot loadersLilo and GRUB can handle a wide variety of operating systems, file systems, and bootconfigurations. Their MBR code does not necessarily follow the boot the active partitionapproach described above. But functionally the process goes like this:

    http://en.wikipedia.org/wiki/ACPIhttp://en.wikipedia.org/wiki/ACPIhttp://en.wikipedia.org/wiki/ACPIhttp://en.wikipedia.org/wiki/Disk_sectorhttp://en.wikipedia.org/wiki/Disk_sectorhttp://en.wikipedia.org/wiki/Disk_sectorhttp://en.wikipedia.org/wiki/Master_boot_recordhttp://en.wikipedia.org/wiki/Master_boot_recordhttp://en.wikipedia.org/wiki/Master_boot_recordhttp://en.wikipedia.org/wiki/Master_boot_recordhttp://en.wikipedia.org/wiki/Disk_sectorhttp://en.wikipedia.org/wiki/ACPI
  • 8/6/2019 Duartes Internals Papers

    8/54

    8

    1. The MBR itself contains the first stage of the boot loader. GRUB calls this stage 1.2. Due to its tiny size, the code in the MBR does just enough to load another sector

    from disk that contains additional boostrap code. This sector might be the bootsector for a partition, but could also be a sector that was hard-coded into the MBRcode when the MBR was installed.

    3. The MBR code plus code loaded in step 2 then read a file containing the second stageof the boot loader. In GRUB this is GRUB Stage 2, and in Windows Server this isc:\NTLDR. If step 2 fails in Windows youd get a message like NTLDR is missing.The stage 2 code then reads a boot configuration file (e.g., grub.conf in GRUB,boot.ini in Windows). It then presents boot choices to the user or simply goes aheadin a single-boot system.

    4. At this point the boot loader code needs to fire up a kernel. It must know enoughabout file systems to read the kernel from the boot partition. In Linux this meansreading a file like vmlinuz-2.6.22-14-server containing the kernel, loading the fileinto memory and jumping to the kernel bootstrap code. In Windows Server 2003some of the kernel start-up code is separate from the kernel image itself and isactually embedded into NTLDR. After performing several initializations, NTDLR loadsthe kernel image from file c:\Windows\System32\ntoskrnl.exe and, just as GRUBdoes, jumps to the kernel entry point.

    Theres a complication worth mentioning (aka, I told you this thing is hacky). The image for

    a current Linux kernel, even compressed, does not fit into the 640K of RAM available in realmode. My vanilla Ubuntu kernel is 1.7 MB compressed. Yet the boot loader must run in realmode in order to call the BIOS routines for reading from the disk, since the kernel is clearlynot available at that point. The solution is the venerableunreal mode. This is not a trueprocessor mode (I wish the engineers at Intel were allowed to have fun like that), but rathera technique where a program switches back and forth between real mode and protectedmode in order to access memory above 1MB while still using the BIOS. If you read GRUBsource code, youll see these transitions all over the place (look under stage2/ for calls to

    real_to_prot and prot_to_real). At the end of this sticky process the loader has stuffed thekernel in memory, by hook or by crook, but it leaves the processor in real mode when its

    done.

    Were now at the jump from Boot Loader to Early Kernel Initialization as shown in the

    first diagram. Thats when things heat up as the kernel starts to unfold and set things inmotion. The next post will be a guided tour through the Linux Kernel initialization with links

    to sources at theLinux Cross Reference. I cant do the same for Windows but Ill point outthe highlights.

    [Update: cleared up discussion of NTLDR.]

    http://en.wikipedia.org/wiki/Unreal_modehttp://en.wikipedia.org/wiki/Unreal_modehttp://en.wikipedia.org/wiki/Unreal_modehttp://lxr.linux.no/http://lxr.linux.no/http://lxr.linux.no/http://lxr.linux.no/http://en.wikipedia.org/wiki/Unreal_mode
  • 8/6/2019 Duartes Internals Papers

    9/54

    9

    3.The Kernel Boot ProcessThe previous post explainedhow computers boot upright up to the point where the boot

    loader, after stuffing the kernel image into memory, is about to jump into the kernel entry

    point. This last post about booting takes a look at the guts of the kernel to see how an

    operating system starts life. Since I have anempirical bentIll link heavily to the sources forLinux kernel 2.6.25.6 at theLinux Cross Reference. The sources are very readable if you are

    familiar with C-like syntax; even if you miss some details you can get the gist of whats

    happening. The main obstacle is the lack of context around some of the code, such as when

    or why it runs or the underlying features of the machine. I hope to provide a bit of that

    context. Due to brevity (hah!) a lot of fun stuff like interrupts and memory gets only a

    nod for now. The post ends with the highlights for the Windows boot.

    At this point in the Intel x86 boot story the processor is running in real-mode, is able to

    address 1 MB of memory, and RAM looks like this for a modern Linux system:

    RAM contents after boot loader is done

    The kernel image has been loaded to memory by the boot loader using the BIOS disk I/O

    services. This image is an exact copy of the file in your hard drive that contains the kernel,

    e.g. /boot/vmlinuz-2.6.22-14-server. The image is split into two pieces: a small part

    http://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/reality-driven-developmenthttp://duartes.org/gustavo/blog/post/reality-driven-developmenthttp://duartes.org/gustavo/blog/post/reality-driven-developmenthttp://lxr.linux.no/http://lxr.linux.no/http://lxr.linux.no/http://lxr.linux.no/http://duartes.org/gustavo/blog/post/reality-driven-developmenthttp://duartes.org/gustavo/blog/post/how-computers-boot-up
  • 8/6/2019 Duartes Internals Papers

    10/54

    10

    containing the real-mode kernel code is loaded below the 640K barrier; the bulk of the

    kernel, which runs in protected mode, is loaded after the first megabyte of memory.

    The action starts in the real-mode kernel header pictured above. This region of memory is

    used to implement theLinux boot protocolbetween the boot loader and the kernel. Some of

    the values there are read by the boot loader while doing its work. These include amenities

    such as a human-readable string containing the kernel version, but also crucial informationlike the size of the real-mode kernel piece. The boot loader also writes values to this region,

    such as the memory address for the command-line parameters given by the user in the

    boot menu. Once the boot loader is finished it has filled in all of the parameters required by

    the kernel header. Its then time to jump into the kernel entry point. The diagram below

    shows the code sequence for the kernel initialization, along with source directories, files,

    and line numbers:

    Architecture-specific Linux Kernel Initialization

    The early kernel start-up for the Intel architecture is in filearch/x86/boot/header.S. Its in

    assembly language, which is rare for the kernel at large but common for boot code. The

    start of this file actually contains boot sector code, a left over from the days when Linux

    could work without a boot loader. Nowadays this boot sector, if executed, only prints a

    bugger_off_msg to the user and reboots. Modern boot loaders ignore this legacy code.

    After the boot sector code we have the first 15 bytes of the real-mode kernel header; these

    two pieces together add up to 512 bytes, the size of a typical disk sector on Intel hardware.

    After these 512 bytes, at offset 0200, we find the very first instruction that runs as part of

    the Linux kernel: the real-mode entry point. Its inheader.S:110and it is a 2-byte jumpwritten directly in machine code as 0x3aeb. You can verify this by running hexdump on your

    kernel image and seeing the bytes at that offset just a sanity check to make sure its not

    all a dream. The boot loader jumps into this location when it is finished, which in turn jumps

    toheader.S:229where we have a regular assembly routine called start_of_setup. This short

    routine sets up a stack, zeroes thebsssegment (the area that contains static variables, so

    they start with zero values) for the real-mode kernel and then jumps to good old C code at

    arch/x86/boot/main.c:122.

    http://lxr.linux.no/linux+v2.6.25.6/Documentation/i386/boot.txthttp://lxr.linux.no/linux+v2.6.25.6/Documentation/i386/boot.txthttp://lxr.linux.no/linux+v2.6.25.6/Documentation/i386/boot.txthttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.Shttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.Shttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.Shttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L110http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L110http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L110http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L229http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L229http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L229http://en.wikipedia.org/wiki/.bsshttp://en.wikipedia.org/wiki/.bsshttp://en.wikipedia.org/wiki/.bsshttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/main.c#L122http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/main.c#L122http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/main.c#L122http://en.wikipedia.org/wiki/.bsshttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L229http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L110http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.Shttp://lxr.linux.no/linux+v2.6.25.6/Documentation/i386/boot.txt
  • 8/6/2019 Duartes Internals Papers

    11/54

    11

    main() does some house keeping like detecting memory layout, setting a video mode, etc.

    It then callsgo_to_protected_mode(). Before the CPU can be set to protected mode,

    however, a few tasks must be done. There are two main issues: interrupts and memory. In

    real-mode theinterrupt vector tablefor the processor is always at memory address 0,

    whereas in protected mode the location of the interrupt vector table is stored in a CPU

    register called IDTR. Meanwhile, the translation of logical memory addresses (the onesprograms manipulate) to linear memory addresses (a raw number from 0 to the top of the

    memory) is different between real-mode and protected mode. Protected mode requires a

    register called GDTR to be loaded with the address of aGlobal Descriptor Tablefor memory.

    So go_to_protected_mode() callssetup_idt()andsetup_gdt()to install a temporary

    interrupt descriptor table and global descriptor table.

    Were now ready for the plunge into protected mode, which is done by

    protected_mode_jump, another assembly routine. This routine enables protected mode by

    setting the PE bit in the CR0 CPU register. At this point were running withpagingdisabled;

    paging is an optional feature of the processor, even in protected mode, and theres no need

    for it yet. Whats important is that were no longer confined to the 640K barrier and can now

    address up to 4GB of RAM. The routine then calls the 32-bit kernel entry point, which is

    startup_32for compressed kernels. This routine does some basic register initializations andcallsdecompress_kernel(), a C function to do the actual decompression.

    decompress_kernel() prints the familiar Decompressing Linux message. Decompression

    happens in-place and once its finished the uncompressed kernel image has overwritten the

    compressed one pictured in the first diagram. Hence the uncompressed contents also start

    at 1MB. decompress_kernel() then prints done. and the comforting Booting the kernel.

    By Booting it means a jump to the final entry point in this whole story, given to Linus by

    God himself atopMountain Halti, which is the protected-mode kernel entry point at the start

    of the second megabyte of RAM (0100000). That sacred location contains a routine called,

    uh,startup_32. But this one is in a different directory, you see.

    The second incarnation of startup_32 is also an assembly routine, but it contains 32-bit

    mode initializations. It clears the bss segment for the protected-mode kernel (which is thetrue kernel that will now run until the machine reboots or shuts down), sets up the final

    global descriptor table for memory, builds page tables so that paging can be turned on,

    enables paging, initializes a stack, creates the final interrupt descriptor table, and finally

    jumps to to the architecture-independent kernel start-up,start_kernel(). The diagram below

    shows the code flow for the last leg of the boot:

    http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L153http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L153http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L153http://en.wikipedia.org/wiki/Interrupt_vector_tablehttp://en.wikipedia.org/wiki/Interrupt_vector_tablehttp://en.wikipedia.org/wiki/Interrupt_vector_tablehttp://en.wikipedia.org/wiki/Global_descriptor_tablehttp://en.wikipedia.org/wiki/Global_descriptor_tablehttp://en.wikipedia.org/wiki/Global_descriptor_tablehttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L144http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L144http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L144http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pmjump.S#L31http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pmjump.S#L31http://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Paginghttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/head_32.S#L35http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/head_32.S#L35http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/misc.c#L368http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/misc.c#L368http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/misc.c#L368http://en.wikipedia.org/wiki/Haltihttp://en.wikipedia.org/wiki/Haltihttp://en.wikipedia.org/wiki/Haltihttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L507http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L507http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L507http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L507http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://en.wikipedia.org/wiki/Haltihttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/misc.c#L368http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/head_32.S#L35http://en.wikipedia.org/wiki/Paginghttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pmjump.S#L31http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L144http://en.wikipedia.org/wiki/Global_descriptor_tablehttp://en.wikipedia.org/wiki/Interrupt_vector_tablehttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L153
  • 8/6/2019 Duartes Internals Papers

    12/54

    12

    Architecture-independent Linux Kernel Initialization

    start_kernel() looks more like typical kernel code, which is nearly all C and machine

    independent. The function is a long list of calls to initializations of the various kernel

    subsystems and data structures. These include the scheduler, memory zones, time keeping,

    and so on. start_kernel() then callsrest_init(), at which point things are almost all working.

    rest_init() creates a kernel thread passing another function,kernel_init(), as the entry

    point. rest_init() then callsschedule()to kickstart task scheduling and goes to sleep by

    callingcpu_idle(), which is the idle thread for the Linux kernel. cpu_idle() runs forever and

    so does process zero, which hosts it. Whenever there is work to do a runnable process

    process zero gets booted out of the CPU, only to return when no runnable processes are

    available.But heres the kicker for us. This idle loop is the end of the long thread we followed since

    boot, its the final descendent of the very firstjump executed by the processor after power

    up. All of this mess, from reset vector to BIOS to MBR to boot loader to real-mode kernel to

    protected-mode kernel, all of it leads right here, jump by jump by jump it ends in the idle

    loop for the boot processor, cpu_idle(). Which is really kind of cool. However, this cant be

    the whole story otherwise the computer would do no work.

    At this point, the kernel thread started previously is ready to kick in, displacing process 0

    and its idle thread. And so it does, at which point kernel_init() starts running since it was

    given as the thread entry point.kernel_init()is responsible for initializing the remaining

    CPUs in the system, which have been halted since boot. All of the code weve seen so far

    has been executed in a single CPU, called the boot processor. As the other CPUs, calledapplication processors, are started they come up in real-mode and must run through several

    initializations as well. Many of the code paths are common, as you can see in the code for

    startup_32, but there are slight forks taken by the late-coming application processors.

    Finally, kernel_init() callsinit_post(), which tries to execute a user-mode process in the

    following order: /sbin/init, /etc/init, /bin/init, and /bin/sh. If all fail, the kernel will panic.

    Luckily init is usually there, and starts running as PID 1. It checks its configuration file to

    figure out which processes to launch, which might include X11 Windows, programs for

    http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L432http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L432http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L432http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/kernel/sched.c#L3959http://lxr.linux.no/linux+v2.6.25.6/kernel/sched.c#L3959http://lxr.linux.no/linux+v2.6.25.6/kernel/sched.c#L3959http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/process_32.c#L180http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/process_32.c#L180http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/process_32.c#L180http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L769http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L769http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L769http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L769http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/process_32.c#L180http://lxr.linux.no/linux+v2.6.25.6/kernel/sched.c#L3959http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L432
  • 8/6/2019 Duartes Internals Papers

    13/54

    13

    logging in on the console, network daemons, and so on. Thus ends the boot process as yet

    another Linux box starts running somewhere. May your uptime be long and untroubled.

    The process for Windows is similar in many ways, given the common architecture. Many of

    the same problems are faced and similar initializations must be done. When it comes to

    boot one of the biggest differences is that Windows packs all of the real-mode kernel code,

    and some of the initial protected mode code, into the boot loader itself (C:\NTLDR). Soinstead of having two regions in the same kernel image, Windows uses different binary

    images. Plus Linux completely separates boot loader and kernel; in a way this automatically

    falls out of the open source process. The diagram below shows the main bits for the

    Windows kernel:

    Windows Kernel InitializationThe Windows user-mode start-up is naturally very different. Theres no /sbin/init, but rather

    Csrss.exe and Winlogon.exe. Winlogon spawns Services.exe, which starts all of the

    Windows Services, and Lsass.exe, the local security authentication subsystem. The classic

    Windows login dialog runs in the context of Winlogon.

    This is the end of this boot series. Thanks everyone for reading and for feedback. Im sorry

    some things got superficial treatment; Ive gotta start somewhere and only so much fits into

    blog-sized bites. But nothing like a day after the next; my plan is to do regular Software

    Illustrated posts like this series along with other topics. Meanwhile, here are some

    resources: The best, most important resource, is source code for real kernels, either Linux or one

    of the BSDs. Intel publishes excellentSoftware Developers Manuals, which you can download for

    free. Understanding the Linux Kernelis a good book and walks through a lot of the Linux

    Kernel sources. Its getting outdated and its dry, but Id still recommend it to anyone

    who wants to grok the kernel.Linux Device Driversis more fun, teaches well, but islimited in scope. Finally, Patrick Moroney suggestedLinux Kernel DevelopmentbyRobert Love in the comments for this post. Ive heard other positive reviews for that

    book, so it sounds worth checking out.

    http://www.intel.com/products/processor/manuals/index.htmhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.amazon.com/exec/obidos/ASIN/0596005652/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005652/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005903/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005903/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005903/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0672327201/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0672327201/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0672327201/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0672327201/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005903/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005652/gustduar-20http://www.intel.com/products/processor/manuals/index.htm
  • 8/6/2019 Duartes Internals Papers

    14/54

    14

    For Windows, the best reference by far isWindows Internalsby David Solomon andMark Russinovich, the latter of Sysinternals fame. This is a great book, well-writtenand thorough. The main downside is the lack of source code.

    [Update: In acomment below, Nix covered a lot of ground on the initial root file system that

    I glossed over. Thanks toMarius Barbufor catching a mistake where I wrote "CR3" instead

    of GDTR]

    http://www.amazon.com/exec/obidos/ASIN/0735625301/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0735625301/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0735625301/gustduar-20http://blogs.technet.com/markrussinovich/http://blogs.technet.com/markrussinovich/http://duartes.org/gustavo/blog/category/internals/#comment-13790http://duartes.org/gustavo/blog/category/internals/#comment-13790http://duartes.org/gustavo/blog/category/internals/#comment-13790http://www.sirartisan.net/http://www.sirartisan.net/http://www.sirartisan.net/http://www.sirartisan.net/http://duartes.org/gustavo/blog/category/internals/#comment-13790http://blogs.technet.com/markrussinovich/http://www.amazon.com/exec/obidos/ASIN/0735625301/gustduar-20
  • 8/6/2019 Duartes Internals Papers

    15/54

    15

    4.Memory Translation and SegmentationThis post is the first in a series about memory and protection in Intel-compatible (x86)

    computers, going further down the path of how kernels work. As in theboot series, Ill link

    to Linux kernel sources but give Windows examples as well (sorry, Im ignorant about the

    BSDs and the Mac, but most of the discussion applies). Let me know what I screw up.

    In thechipsetsthat power Intel motherboards, memory is accessed by the CPU via the front

    side bus, which connects it to the northbridge chip. The memory addresses exchanged in

    the front side bus are physical memory addresses, raw numbers from zero to the top of

    the available physical memory. These numbers are mapped to physical RAM sticks by the

    northbridge. Physical addresses are concrete and final no translation, no paging, no

    privilege checks you put them on the bus and thats that. Within the CPU, however,

    programs use logical memory addresses, which must be translated into physical

    addresses before memory access can take place. Conceptually address translation looks like

    this:

    Memory address translation in x86 CPUs with paging enabled

    This is not a physical diagram, only a depiction of the address translation process,

    specifically for when the CPU has paging enabled. If you turn off paging, the output from

    the segmentation unit is already a physical address; in 16-bit real mode that is always the

    case. Translation starts when the CPU executes an instruction that refers to a memory

    address. The first step is translating that logic address into a linear address. But why go

    through this step instead of having software use linear (or physical) addresses directly? For

    roughly the same reason humans have an appendix whose primary function is getting

    infected. Its a wrinkle of evolution. To really make sense of x86 segmentation we need to

    go back to 1978.

    The original8086had 16-bit registers and its instructions used mostly 8-bit or 16-bit

    operands. This allowed code to work with 216 bytes, or 64K of memory, yet Intel engineers

    were keen on letting the CPU use more memory without expanding the size of registers and

    instructions. So they introduced segment registers as a means to tell the CPU which 64K

    chunk of memory a programs instructions were going to work on. It was a reasonable

    solution: first you load a segment register, effectively saying here, I want to work on the

    http://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/kernel-boot-processhttp://duartes.org/gustavo/blog/post/kernel-boot-processhttp://duartes.org/gustavo/blog/post/kernel-boot-processhttp://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-maphttp://en.wikipedia.org/wiki/8086http://en.wikipedia.org/wiki/8086http://en.wikipedia.org/wiki/8086http://en.wikipedia.org/wiki/8086http://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-maphttp://duartes.org/gustavo/blog/post/kernel-boot-processhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentation
  • 8/6/2019 Duartes Internals Papers

    16/54

    16

    memory chunk starting at X; afterwards, 16-bit memory addresses used by your code are

    interpreted as offsets into your chunk, or segment. There were four segment registers: one

    for the stack (ss), one for program code (cs), and two for data (ds, es). Most programs

    were small enough back then to fit their whole stack, code, and data each in a 64K

    segment, so segmentation was often transparent.

    Nowadays segmentation is still present and is always enabled in x86 processors. Eachinstruction that touches memory implicitly uses a segment register. For example, a jump

    instruction uses the code segment register (cs) whereas a stack push instruction uses the

    stack segment register (ss). In most cases you can explicitly override the segment register

    used by an instruction. Segment registers store 16-bit segment selectors; they can be

    loaded directly with instructions like MOV. The sole exception is cs, which can only be

    changed by instructions that affect the flow of execution, like CALL or JMP. Though

    segmentation is always on, it works differently in real mode versus protected mode.

    In real mode, such as duringearly boot, the segment selector is a 16-bit number specifying

    the physical memory address for the start of a segment. This number must somehow be

    scaled, otherwise it would also be limited to 64K, defeating the purpose of segmentation.

    For example, the CPU could use the segment selector as the 16 most significant bits of the

    physical memory address (by shifting it 16 bits to the left, which is equivalent to multiplyingby 216). This simple rule would enable segments to address 4 gigs of memory in 64K

    chunks. Sadly Intel made a bizarre decision to multiply the segment selector by only 24 (or

    16), which in a single stroke confined memory to about 1MB and unduly complicated

    translation. Heres an example showing a jump instruction where cs contains 01000:

    Real mode segmentationReal mode segment starts range from 0 all the way to 0xFFFF0 (16 bytes short of 1 MB) in

    16-byte increments. To these values you add a 16-bit offset (the logical address) between 0

    and 0xFFFF. Itfollowsthat there are multiple segment/offset combinations pointing to the

    same memory location, and physical addresses fall above 1MB if your segment is high

    enough (see the infamousA20 line). Also, when writing C code in real mode afar pointeris

    a pointer that contains both the segment selector andthe logical address, which allows it to

    address 1MB of memory. Far indeed. As programs started getting bigger and outgrowing

    http://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://mirror.href.com/thestarman/asm/debug/Segments.htmlhttp://mirror.href.com/thestarman/asm/debug/Segments.htmlhttp://mirror.href.com/thestarman/asm/debug/Segments.htmlhttp://en.wikipedia.org/wiki/A20_linehttp://en.wikipedia.org/wiki/A20_linehttp://en.wikipedia.org/wiki/A20_linehttp://en.wikipedia.org/wiki/C_memory_modelhttp://en.wikipedia.org/wiki/C_memory_modelhttp://en.wikipedia.org/wiki/C_memory_modelhttp://en.wikipedia.org/wiki/C_memory_modelhttp://en.wikipedia.org/wiki/A20_linehttp://mirror.href.com/thestarman/asm/debug/Segments.htmlhttp://duartes.org/gustavo/blog/post/how-computers-boot-up
  • 8/6/2019 Duartes Internals Papers

    17/54

    17

    64K segments, segmentation and its strange ways complicated development for the x86

    platform. This may all sound quaintly odd now but it has driven programmers into the

    wretched depths of madness.

    In 32-bit protected mode, a segment selector is no longer a raw number, but instead it

    contains an index into a table ofsegment descriptors. The table is simply an array

    containing 8-byte records, where each record describes one segment and looks thus:

    Segment descriptor

    There are three types of segments: code, data, and system. For brevity, only the common

    features in the descriptor are shown here. The base address is a 32-bit linear address

    pointing to the beginning of the segment, while the limit specifies how big the segment is.

    Adding the base address to a logical memory address yields a linear address. DPL is the

    descriptor privilege level; it is a number from 0 (most privileged, kernel mode) to 3 (least

    privileged, user mode) that controls access to the segment.

    These segment descriptors are stored in two tables: the Global Descriptor Table (GDT)

    and the Local Descriptor Table (LDT). Each CPU (or core) in a computer contains a

    register called gdtr which stores the linear memory address of the first byte in the GDT. To

    choose a segment, you must load a segment register with a segment selector in the

    following format:

    Segment Selector

    The TI bit is 0 for the GDT and 1 for the LDT, while the index specifies the desired segment

    selector within the table. Well deal with RPL, Requested Privilege Level, later on. Now,

    come to think of it, when the CPU is in 32-bit mode registers and instructions can address

    the entire linear address space anyway, so theres really no need to give them a push with a

    base address or other shenanigan. So why not set the base address to zero and let logical

    addresses coincide with linear addresses? Intel docs call this flat model and its exactly

    what modern x86 kernels do (they use the basic flat model, specifically). Basic flat model is

    equivalent to disabling segmentation when it comes to translating memory addresses. So in

    all its glory, heres the jump example running in 32-bit protected mode, with real-world

    values for a Linux user-mode app:

  • 8/6/2019 Duartes Internals Papers

    18/54

    18

    Protected Mode Segmentation

    The contents of a segment descriptor are cached once they are accessed, so theres no need

    to actually read the GDT in subsequent accesses, which would kill performance. Each

    segment register has a hidden part to store the cached descriptor that corresponds to its

    segment selector. For more details, including more info on the LDT, see chapter 3 of the

    Intel System Programming Guide Volume 3a. Volumes 2a and 2b, which cover every x86

    instruction, also shed light on the various types of x86 addressing operands 16-bit, 16-bit

    with segment selector (which can be used by far pointers), 32-bit, etc.

    In Linux, only 3 segment descriptors are used during boot. They are defined with the

    GDT_ENTRYmacro and stored in theboot_gdtarray. Two of the segments are flat,addressing the entire 32-bit space: a code segment loaded into cs and a data segment

    loaded into the other segment registers. The third segment is a system segment called the

    Task State Segment. After boot, each CPU has its own copy of the GDT. They are all nearly

    identical, but a few entries change depending on the running process. You can see the

    layout of the Linux GDT insegment.hand its instantiation ishere. There are four primary

    GDT entries: two flat ones for code and data in kernel mode, and another two for user

    mode. When looking at the Linux GDT, notice the holes inserted on purpose to align data

    with CPU cache lines an artifact of thevon Neumann bottleneckthat has become a

    http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L103http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L103http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L119http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L119http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L119http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/segment.h#L15http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/segment.h#L15http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/segment.h#L15http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24http://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/segment.h#L15http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L119http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L103
  • 8/6/2019 Duartes Internals Papers

    19/54

    19

    plague. Finally, the classic Segmentation fault Unix error message is notdue to x86-style

    segments, but rather invalid memory addresses normally detected by the paging unit

    alas, topic for an upcoming post.

    Intel deftly worked around their original segmentation kludge, offering a flexible way for us

    to choose whether to segment or go flat. Since coinciding logical and linear addresses are

    simpler to handle, they became standard, such that 64-bit mode now enforces a flat linearaddress space. But even in flat mode segments are still crucial for x86 protection, the

    mechanism that defends the kernel from user-mode processes and every process from each

    other. Its a dog eat dog world out there! In the next post, well take a peek at protection

    levels and how segments implement them.

  • 8/6/2019 Duartes Internals Papers

    20/54

    20

    5.CPU Rings, Privilege, and ProtectionYou probably know intuitively that applications have limited powers in Intel x86 computers

    and that only operating system code can perform certain tasks, but do you know how this

    really works? This post takes a look at x86 privilege levels, the mechanism whereby the

    OS and CPU conspire to restrict what user-mode programs can do. There are four privilege

    levels, numbered 0 (most privileged) to 3 (least privileged), and three main resources being

    protected: memory, I/O ports, and the ability to execute certain machine instructions. At

    any given time, an x86 CPU is running in a specific privilege level, which determines what

    code can and cannot do. These privilege levels are often described as protection rings, with

    the innermost ring corresponding to highest privilege. Most modern x86 kernels use only

    two privilege levels, 0 and 3:

    x86 Protection Rings

    About 15 machine instructions, out of dozens, are restricted by the CPU to ring zero. Many

    others have limitations on their operands. These instructions can subvert the protection

    mechanism or otherwise foment chaos if allowed in user mode, so they are reserved to the

    kernel. An attempt to run them outside of ring zero causes a general-protection exception,

    like when a program uses invalid memory addresses. Likewise, access to memory and I/O

    ports is restricted based on privilege level. But before we look at protection mechanisms,

    lets see exactlyhow the CPU keeps track of the current privilege level, which involves the

    segment selectorsfrom the previous post. Here they are:

    http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protectionhttp://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protectionhttp://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protectionhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection
  • 8/6/2019 Duartes Internals Papers

    21/54

    21

    Segment Selectors Data and Code

    The full contents of data segment selectors are loaded directly by code into various segment

    registers such as ss (stack segment register) and ds (data segment register). This includes

    the contents of the Requested Privilege Level (RPL) field, whose meaning we tackle in a bit.

    The code segment register (cs) is, however, magical. First, its contents cannot be set

    directly by load instructions such as mov, but rather only by instructions that alter the flow

    of program execution, like call. Second, and importantly for us, instead of an RPL field that

    can be set by code, cs has a Current Privilege Level (CPL) field maintained by the CPUitself. This 2-bit CPL field in the code segment register is always equal tothe CPUs

    current privilege level. The Intel docs wobble a little on this fact, and sometimes online

    documents confuse the issue, but thats the hard and fast rule. At any time, no matter

    whats going on in the CPU, a look at the CPL in cs will tell you the privilege level code is

    running with.

    Keep in mind that the CPU privilege level has nothing to do with operating system

    users. Whether youre root, Administrator, guest, or a regular user, it does not matter. All

    user code runs in ring 3 and all kernel code runs in ring 0, regardless of the OS user

    on whose behalf the code operates. Sometimes certain kernel tasks can be pushed to user

    mode, for example user-mode device drivers in Windows Vista, but these are just special

    processes doing a job for the kernel and can usually be killed without major consequences.Due to restricted access to memory and I/O ports, user mode can do almost nothing to the

    outside world without calling on the kernel. It cant open files, send network packets, print

    to the screen, or allocate memory. User processes run in a severely limited sandbox set up

    by the gods of ring zero. Thats why its impossible, by design, for a process to leak memory

    beyond its existence or leave open files after it exits. All of the data structures that control

    such things memory, open files, etc cannot be touched directly by user code; once a

    process finishes, the sandbox is torn down by the kernel. Thats why our servers can have

    600 days of uptime as long as the hardware and the kernel dont crap out, stuff can run

    for ever. This is also why Windows 95 / 98 crashed so much: its not because M$ sucks

    but because important data structures were left accessible to user mode for compatibility

    reasons. It was probably a good trade-off at the time, albeit at high cost.

    The CPU protects memory at two crucial points: when a segment selector is loaded and

    when a page of memory is accessed with a linear address. Protection thus mirrorsmemory

    address translationwhere both segmentation and paging are involved. When a data

    segment selector is being loaded, the check below takes place:

    http://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentation
  • 8/6/2019 Duartes Internals Papers

    22/54

    22

    x86 Segment ProtectionSince a higher number means less privilege, MAX() above picks the least privileged of CPL

    and RPL, and compares it to the descriptor privilege level (DPL). If the DPL is higher or

    equal, then access is allowed. The idea behind RPL is to allow kernel code to load a segment

    using lowered privilege. For example, you could use an RPL of 3 to ensure that a given

    operation uses segments accessible to user-mode. The exception is for the stack segment

    register ss, for which the three of CPL, RPL, and DPL must match exactly.

    In truth, segment protection scarcely matters because modern kernels use a flat address

    space where the user-mode segments can reach the entire linear address space. Useful

    memory protection is done in the paging unit when a linear address is converted into a

    physical address. Each memory page is a block of bytes described by a page table entry

    containing two fields related to protection: a supervisor flag and a read/write flag. Thesupervisor flag is the primary x86 memory protection mechanism used by kernels. When it

    is on, the page cannot be accessed from ring 3. While the read/write flag isnt as important

    for enforcing privilege, its still useful. When a process is loaded, pages storing binary

    images (code) are marked as read only, thereby catching some pointer errors if a program

    attempts to write to these pages. This flag is also used to implementcopy on writewhen a

    process is forked in Unix. Upon forking, the parents pages are marked read only and shared

    with the forked child. If either process attempts to write to the page, the processor triggers

    a fault and the kernel knows to duplicate the page and mark it read/write for the writing

    process.

    Finally, we need a way for the CPU to switch between privilege levels. If ring 3 code could

    transfer control to arbitrary spots in the kernel, it would be easy to subvert the operating

    system by jumping into the wrong (right?) places. A controlled transfer is necessary. This is

    accomplished via gate descriptors and via the sysenter instruction. A gate descriptor is a

    segment descriptor of type system, and comes in four sub-types: call-gate descriptor,

    interrupt-gate descriptor, trap-gate descriptor, and task-gate descriptor. Call gates provide

    a kernel entry point that can be used with ordinary call and jmp instructions, but they arent

    used much so Ill ignore them. Task gates arent so hot either (in Linux, they are only used

    in double faults, which are caused by either kernel or hardware problems).

    http://todo/http://todo/http://todo/http://todo/
  • 8/6/2019 Duartes Internals Papers

    23/54

    23

    That leaves two juicier ones: interrupt and trap gates, which are used to handle hardware

    interrupts (e.g., keyboard, timer, disks) and exceptions (e.g., page faults, divide by zero).

    Ill refer to both as an interrupt. These gate descriptors are stored in the Interrupt

    Descriptor Table (IDT). Each interrupt is assigned a number between 0 and 255 called a

    vector, which the processor uses as an index into the IDT when figuring out which gate

    descriptor to use when handling the interrupt. Interrupt and trap gates are nearly identical.Their format is shown below along with the privilege checks enforced when an interrupt

    happens. I filled in some values for the Linux kernel to make things concrete.

    Interrupt Descriptor with Privilege Check

    Both the DPL and the segment selector in the gate regulate access, while segment selector

    plus offset together nail down an entry point for the interrupt handler code. Kernels

    normally use the segment selector for the kernel code segment in these gate descriptors.

    An interrupt can never transfer control from a more-privileged to a less-privileged ring.

    Privilege must either stay the same (when the kernel itself is interrupted) or be elevated

    (when user-mode code is interrupted). In either case, the resulting CPL will be equal to to

    the DPL of the destination code segment; if the CPL changes, a stack switch also occurs. If

    an interrupt is triggered by code via an instruction like int n, one more check takes place:

    the gate DPL must be at the same or lower privilege as the CPL. This prevents user code

    from triggering random interrupts. If these checks fail you guessed it a general-

    protection exception happens. All Linux interrupt handlers end up running in ring zero.

  • 8/6/2019 Duartes Internals Papers

    24/54

    24

    During initialization, the Linux kernel first sets up an IDT insetup_idt()that ignores all

    interrupts. It then uses functions ininclude/asm-x86/desc.hto flesh out common IDT

    entries inarch/x86/kernel/traps_32.c. In Linux, a gate descriptor with system in its name

    is accessible from user mode and its set function uses a DPL of 3. Asystem gate is an

    Intel trap gate accessible to user mode. Otherwise, the terminology matches up. Hardware

    interrupt gates are not set here however, but instead in the appropriate drivers.Three gates are accessible to user mode: vectors 3 and 4 are used for debugging and

    checking for numeric overflows, respectively. Then a system gate is set up for the

    SYSCALL_VECTOR, which is 080 for the x86 architecture. This was the mechanism for a

    process to transfer control to the kernel, to make a system call, and back in the day I

    applied for an int 080 vanity license plate . Starting with the Pentium Pro, the

    sysenter instruction was introduced as a faster way to make system calls. It relies on

    special-purpose CPU registers that store the code segment, entry point, and other tidbits for

    the kernel system call handler. When sysenter is executed the CPU does no privilege

    checking, going immediately into CPL 0 and loading new values into the registers for code

    and stack (cs, eip, ss, and esp). Only ring zero can load the sysenter setup registers, which

    is done inenable_sep_cpu().

    Finally, when its time to return to ring 3, the kernel issues an iret or sysexit instruction to

    return from interrupts and system calls, respectively, thus leaving ring 0 and resuming

    execution of user code with a CPL of 3. Vim tells me Im approaching 1,900 words, so I/O

    port protection is for another day. This concludes our tour of x86 rings and protection.

    Thanks for reading!

    http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L475http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L475http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L475http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/desc.h#L322http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/desc.h#L322http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/desc.h#L322http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/traps_32.c#L1140http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/traps_32.c#L1140http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/traps_32.c#L1140http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/mach-default/irq_vectors.h#L31http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/mach-default/irq_vectors.h#L31http://lxr.linux.no/linux+v2.6.25.6/arch/x86/vdso/vdso32-setup.c#L235http://lxr.linux.no/linux+v2.6.25.6/arch/x86/vdso/vdso32-setup.c#L235http://lxr.linux.no/linux+v2.6.25.6/arch/x86/vdso/vdso32-setup.c#L235http://lxr.linux.no/linux+v2.6.25.6/arch/x86/vdso/vdso32-setup.c#L235http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/mach-default/irq_vectors.h#L31http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/traps_32.c#L1140http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/desc.h#L322http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L475
  • 8/6/2019 Duartes Internals Papers

    25/54

    25

    6.What Your Computer Does While You WaitThis post takes a look at the speed latency and throughput of various subsystems in a

    modern commodity PC, an Intel Core 2 Duo at 3.0GHz. I hope to give a feel for the relative

    speed of each component and a cheatsheet for back-of-the-envelope performance

    calculations. Ive tried to show real-world throughputs (the sources are posted as a

    comment) rather than theoretical maximums. Time units are nanoseconds (ns, 10-9

    seconds), milliseconds (ms, 10-3 seconds), and seconds (s). Throughput units are in

    megabytes and gigabytes per second. Lets start with CPU and memory, the north of the

    northbridge:

    The first thing that jumps out is how absurdly fast our processors are. Most simple

    instructions on the Core 2 take one clock cycle to execute, hence a third of a

    nanosecond at 3.0Ghz. For reference, light only travels ~4 inches (10 cm) in the time

    taken by a clock cycle. Its worth keeping this in mind when youre thinking of optimization

    instructions are comically cheap to execute nowadays.

    http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait
  • 8/6/2019 Duartes Internals Papers

    26/54

    26

    As the CPU works away, it must read from and write to system memory, which it accesses

    via the L1 and L2 caches. The caches usestatic RAM, a much faster (and expensive) type of

    memory than theDRAMmemory used as the main system memory. The caches are part of

    the processor itself and for the pricier memory we get very low latency. One way in which

    instruction-level optimization is still very relevant is code size. Due to caching, there can be

    massive performance differences between code that fits wholly into the L1/L2 caches andcode that needs to be marshalled into and out of the caches as it executes.

    Normally when the CPU needs to touch the contents of a memory region they must either

    be in the L1/L2 caches already or be brought in from the main system memory. Here we

    see our first major hit, a massive ~250 cycles of latency that often leads to a stall, when

    the CPU has no work to do while it waits. To put this into perspective, reading from L1

    cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a

    book from a nearby shelf (14 seconds), and main system memory is taking a 4-minute walk

    down the hall to buy a Twix bar.

    The exact latency of main memory is variable and depends on the application and many

    other factors. For example, it depends on the CAS latency and specifications of the actual

    RAM stick that is in the computer. It also depends on how successful the processor is at

    prefetching guessing which parts of memory will be needed based on the code that isexecuting and having them brought into the caches ahead of time.

    Looking at L1/L2 cache performance versus main memory performance, it is clear how

    much there is to gain from larger L2 caches and from applications designed to use it well.

    For a discussion of all things memory, see Ulrich DreppersWhat Every Programmer Should

    Know About Memory(pdf), a fine paper on the subject.

    People refer to the bottleneck between CPU and memory as thevon Neumann bottleneck.

    Now, the front side bus bandwidth, ~10GB/s, actually looks decent. At that rate, you could

    read all of 8GB of system memory in less than one second or read 100 bytes in 10ns. Sadly

    this throughput is a theoretical maximum (unlike most others in the diagram) and cannot be

    achieved due to delays in the main RAM circuitry. Many discrete wait periods are required

    when accessing memory. The electrical protocol for access calls for delays after a memoryrow is selected, after a column is selected, before data can be read reliably, and so on. The

    use of capacitors calls for periodic refreshes of the data stored in memory lest some bits get

    corrupted, which adds further overhead. Certain consecutive memory accesses may happen

    more quickly but there are still delays, and more so for random access. Latency is always

    present.

    Down in the southbridge we have a number of other buses (e.g., PCIe, USB) and

    peripherals connected:

    http://en.wikipedia.org/wiki/Static_RAMhttp://en.wikipedia.org/wiki/Static_RAMhttp://en.wikipedia.org/wiki/Static_RAMhttp://en.wikipedia.org/wiki/Dynamic_random_access_memoryhttp://en.wikipedia.org/wiki/Dynamic_random_access_memoryhttp://en.wikipedia.org/wiki/Dynamic_random_access_memoryhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://en.wikipedia.org/wiki/Dynamic_random_access_memoryhttp://en.wikipedia.org/wiki/Static_RAM
  • 8/6/2019 Duartes Internals Papers

    27/54

    27

  • 8/6/2019 Duartes Internals Papers

    28/54

    28

    Sadly the southbridge hosts some truly sluggish performers, for even main memory is

    blazing fast compared to hard drives. Keeping with the office analogy, waiting for a hard

    drive seek is like leaving the building to roam the earth for one year and three months.

    This is why so many workloads are dominated by disk I/O and why database performance

    can drive off a cliff once the in-memory buffers are exhausted. It is also why plentiful RAM

    (for buffering) and fast hard drives are so important for overall system performance.While the sustained disk throughput is real in the sense that it is actually achieved by th e

    disk in real-world situations, it does not tell the whole story. The bane of disk performance

    are seeks, which involve moving the read/write heads across the platter to the right track

    and then waiting for the platter to spin around to the right position so that the desired

    sector can be read. Disk RPMs refer to the speed of rotation of the platters: the faster the

    RPMs, the less time you wait on average for the rotation to give you the desired sector,

    hence higher RPMs mean faster disks. A cool place to read about the impact of seeks is the

    paper where a couple of Stanford grad students describe theAnatomy of a Large-Scale

    Hypertextual Web Search Engine(pdf).

    When the disk is reading one large continuous file it achieves greater sustained read speeds

    due to the lack of seeks. Filesystem defragmentation aims to keep files in continuous

    chunks on the disk to minimize seeks and boost throughput. When it comes to how fast acomputer feels, sustained throughput is less important than seek times and the number of

    random I/O operations (reads/writes) that a disk can do per time unit. Solid state disks can

    make for agreat optionhere.

    Hard drive caches also help performance. Their tiny size a 16MB cache in a 750GB drive

    covers only 0.002% of the disk suggest theyre useless, but in reality their contribution is

    allowing a disk toqueue up writes and then perform them in one bunch, thereby allowing

    the disk to plan the order of the writes in a way that surprise minimizes seeks. Reads

    can also be grouped in this way for performance, and both the OS and the drive firmware

    engage in these optimizations.

    Finally, the diagram has various real-world throughputs for networking and other buses.

    Firewire is shown for reference but is not available natively in the Intel X48 chipset. Its funto think of the Internet as a computer bus. The latency to a fast website (say, google.com)

    is about 45ms, comparable to hard drive seek latency. In fact, while hard drives are 5

    orders of magnitude removed from main memory, theyre in the same magnitude as the

    Internet. Residential bandwidth still lags behind that of sustained hard drive reads, but the

    network is the computer in a pretty literal sense now. What happens when the Internet is

    faster than a hard drive?

    I hope this diagram is useful. Its fascinating for me to look at all these numbers together

    and see how far weve come. Sources are posted as a comment. I posted a full diagram

    showing both north and south bridgeshereif youre interested.

    http://infolab.stanford.edu/pub/papers/google.pdfhttp://infolab.stanford.edu/pub/papers/google.pdfhttp://infolab.stanford.edu/pub/papers/google.pdfhttp://infolab.stanford.edu/pub/papers/google.pdfhttp://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.htmlhttp://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.htmlhttp://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/0110.0/0925.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/0110.0/0925.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/0110.0/0925.htmlhttp://static.duartes.org/img/blogPosts/latencyAndThroughputFull.pnghttp://static.duartes.org/img/blogPosts/latencyAndThroughputFull.pnghttp://static.duartes.org/img/blogPosts/latencyAndThroughputFull.pnghttp://static.duartes.org/img/blogPosts/latencyAndThroughputFull.pnghttp://lkml.indiana.edu/hypermail/linux/kernel/0110.0/0925.htmlhttp://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.htmlhttp://infolab.stanford.edu/pub/papers/google.pdfhttp://infolab.stanford.edu/pub/papers/google.pdf
  • 8/6/2019 Duartes Internals Papers

    29/54

    29

    7.Cache: a place for concealment and safekeepingThis post shows briefly how CPU caches are organized in modern Intel processors. Cache

    discussions often lack concrete examples, obfuscating the simple concepts involved. Or

    maybe my pretty little head is slow. At any rate, heres half the story on how a Core 2 L1

    cache is accessed:

    The unit of data in the cache is the line, which is just a contiguous chunk of bytes in

    memory. This cache uses 64-byte lines. The lines are stored in cache banks or ways, and

    each way has a dedicated directory to store its housekeeping information. You can imagine

    each way and its directory as columns in a spreadsheet, in which case the rows are the sets.

    Then each cell in the way column contains a cache line, tracked by the corresponding cell in

    the directory. This particular cache has 64 sets and 8 ways, hence 512 cells to store cache

    lines, which adds up to 32KB of space.

    In this caches view of the world, physical memory is divided into 4KB physical pages. Eachpage has4KB / 64 bytes== 64 cache lines in it. When you look at a 4KB page, bytes 0

    through 63 within that page are in the first cache line, bytes 64-127 in the second cache

    line, and so on. The pattern repeats for each page, so the 3rd line in page 0 is different than

    the 3rd line in page 1.

    In a fully associative cache any line in memory can be stored in any of the cache cells.

    This makes storage flexible, but it becomes expensive to search for cells when accessing

    them. Since the L1 and L2 caches operate under tight constraints of power consumption,

    http://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://www.google.com/search?hl=en&q=(4KB+/+64+bytes)http://www.google.com/search?hl=en&q=(4KB+/+64+bytes)http://www.google.com/search?hl=en&q=(4KB+/+64+bytes)http://www.google.com/search?hl=en&q=(4KB+/+64+bytes)http://duartes.org/gustavo/blog/post/intel-cpu-caches
  • 8/6/2019 Duartes Internals Papers

    30/54

    30

    physical space, and speed, a fully associative cache is not a good trade off in most

    scenarios.

    Instead, this cache is set associative, which means that a given line in memory can only

    be stored in one specific set (or row) shown above. So the first line ofany physical page

    (bytes 0-63 within a page) must be stored in row 0, the second line in row 1, etc. Each row

    has 8 cells available to store the cache lines it is associated with, making this an 8-wayassociative set. When looking at a memory address, bits 11-6 determine the line number

    within the 4KB page and therefore the set to be used. For example, physical address

    0x800010a0 has000010in those bits so it must be stored in set 2.

    But we still have the problem of finding whichcell in the row holds the data, if any. Thats

    where the directory comes in. Each cached line is taggedby its corresponding directory cell;

    the tag is simply the number for the page where the line came from. The processor can

    address 64GB of physical RAM, so there are64GB / 4KB== 224 of these pages and thus we

    need 24 bits for our tag. Our example physical address 0x800010a0 corresponds to page

    number524,289. Heres the second half of the story:

    Since we only need to look in one set of 8 ways, the tag matching is very fast; in fact,

    electrically all tags are compared simultaneously, which I tried to show with the arrows. If

    theres a valid cache line with a matching tag, we have a cache hit. Otherwise, the request

    is forwarded to the L2 cache, and failing that to main system memory. Intel builds large L2

    caches by playing with the size and quantity of the ways, but the design is the same. For

    example, you could turn this into a 64KB cache by adding 8 more ways. Then increase the

    number of sets to 4096 and each way can store256KB. These two modifications would

    deliver a 4MB L2 cache. In this scenario, youd need 18 bits for the tags and 12 for the set

    index; the physical page size used by the cache is equal to its way size.

    If a set fills up, then a cache line must be evicted before another one can be stored. To

    avoid this, performance-sensitive programs try to organize their data so that memory

    accesses are evenly spread among cache lines. For example, suppose a program has an

    array of 512-byte objects such that some objects are 4KB apart in memory. Fields in these

    objects fall into the same lines and compete for the same cache set. If the program

    frequently accesses a given field (e.g., thevtableby calling a virtual method), the set will

    likely fill up and the cache will start trashing as lines are repeatedly evicted and later

    reloaded. Our example L1 cache can only hold the vtables for 8 of these objects due to set

    http://www.google.com/search?q=0x800010a0%20in%20binaryhttp://www.google.com/search?q=0x800010a0%20in%20binaryhttp://www.google.com/search?q=0x800010a0%20in%20binaryhttp://www.google.com/search?hl=en&q=lg(64GB+/+4KB)http://www.google.com/search?hl=en&q=lg(64GB+/+4KB)http://www.google.com/search?hl=en&q=lg(64GB+/+4KB)http://www.google.com/search?hl=en&q=0x800010a0+Bytes+/+4KBhttp://www.google.com/search?hl=en&q=0x800010a0+Bytes+/+4KBhttp://www.google.com/search?hl=en&q=0x800010a0+Bytes+/+4KBhttp://www.google.com/search?hl=en&q=64+Bytes+*+4096http://www.google.com/search?hl=en&q=64+Bytes+*+4096http://www.google.com/search?hl=en&q=64+Bytes+*+4096http://en.wikipedia.org/wiki/Vtablehttp://en.wikipedia.org/wiki/Vtablehttp://en.wikipedia.org/wiki/Vtablehttp://en.wikipedia.org/wiki/Vtablehttp://www.google.com/search?hl=en&q=64+Bytes+*+4096http://www.google.com/search?hl=en&q=0x800010a0+Bytes+/+4KBhttp://www.google.com/search?hl=en&q=lg(64GB+/+4KB)http://www.google.com/search?q=0x800010a0%20in%20binary
  • 8/6/2019 Duartes Internals Papers

    31/54

    31

    size. This is the cost of the set associativity trade-off: we can get cache misses due to set

    conflicts even when overall cache usage is not heavy. However, due to therelative speeds

    in a computer, most apps dont need to worry about this anyway.

    A memory access usually starts with a linear (virtual) address, so the L1 cache relies on the

    paging unit to obtain the physical page address used for the cache tags. By contrast, the set

    index comes from the least significant bits of the linear address and is used withouttranslation (bits 11-6 in our example). Hence the L1 cache is physically tagged but

    virtually indexed, helping the CPU to parallelize lookup operations. Because the L1 way is

    never bigger than an MMU page, a given physical memory location is guaranteed to be

    associated with the same set even with virtual indexing. L2 caches, on the other hand, must

    be physically tagged and physically indexed because their way size can be bigger than MMU

    pages. But then again, by the time a request gets to the L2 cache the physical address was

    already resolved by the L1 cache, so it works out nicely.

    Finally, a directory cell also stores the state of its corresponding cached line. A line in the L1

    code cache is either Invalid or Shared (which means valid, really). In the L1 data cache and

    the L2 cache, a line can be in any of the 4 MESI states: Modified, Exclusive, Shared, or

    Invalid. Intel caches are inclusive: the contents of the L1 cache are duplicated in the L2

    cache. These states will play a part in later posts about threading, locking, and that kind ofstuff. Next time well look at the front side bus and how memory access reallyworks. This is

    going to be memory week.

    Update:Davebrought up direct-mapped caches in acomment below. Theyre basically a

    special case of set-associative caches that have only one way. In the trade-off spectrum,

    theyre the opposite of fully associative caches: blazing fast access, lots of conflict misses.

    http://duartes.org/gustavo/blog/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/what-your-computer-does-while-you-waithttp://www.findinglisp.com/blog/http://www.findinglisp.com/blog/http://www.findinglisp.com/blog/http://duartes.org/gustavo/blog/post/intel-cpu-caches#comment-12687http://duartes.org/gustavo/blog/post/intel-cpu-caches#comment-12687http://duartes.org/gustavo/blog/post/intel-cpu-caches#comment-12687http://duartes.org/gustavo/blog/post/intel-cpu-caches#comment-12687http://www.findinglisp.com/blog/http://duartes.org/gustavo/blog/what-your-computer-does-while-you-wait
  • 8/6/2019 Duartes Internals Papers

    32/54

    32

    8.Getting Physical with MemoryWhen trying to understand complex systems, you can often learn a lot by stripping awayabstractions and looking at their lowest levels. In that spirit we take a look at memory andI/O ports in their simplest and most fundamental level: the interface between the processor

    and bus. These details underlie higher level topics like thread synchronization and the needfor the Core i7. Also, since Im a programmer I ignore things EE people care about. Heresour friend the Core 2 again:

    A Core 2 processor has 775 pins, about half of which only provide power and carry no data.Once you group the pins by functionality, the physical interface to the processor is

    surprisingly simple. The diagram shows the key pins involved in a memory or I/O portoperation: address lines, data pins, and request pins. These operations take place in thecontext of a transaction on the front side bus. FSB transactions go through 5 phases:arbitration, request, snoop, response, and data. Throughout these phases, different rolesare played by the components on the FSB, which are called agents. Normally the agentsare all the processors plus the northbridge.

    http://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://duartes.org/gustavo/blog/post/intel-cpu-caches
  • 8/6/2019 Duartes Internals Papers

    33/54

    33

    We only look at the request phase in this post, in which 2 packets are output by therequest agent, who is usually a processor. Here are the juiciest bits of the first packet,output by the address and request pins:

    The address lines output the starting physical memory address for the transaction. We have33 bits but they are interpreted as bits 35-3 of an address in which bits 2-0 are zero. Hencewe have a 36-bit address, aligned to 8 bytes, for a total of64GBaddressable physicalmemory. This has been the case since the Pentium Pro. The request pins specify what typeof transaction is being initiated; in I/O requests the address pins specify an I/O port ratherthan a memory address. After the first packet is output, the same pins transmit a secondpacket in the subsequent bus clock cycle:

    http://www.google.com/search?hl=en&q=2%5E36+byteshttp://www.google.com/search?hl=en&q=2%5E36+byteshttp://www.google.com/search?hl=en&q=2%5E36+byteshttp://www.google.com/search?hl=en&q=2%5E36+bytes
  • 8/6/2019 Duartes Internals Papers

    34/54

    34

    The attribute signals are interesting: they reflect the 5 types of memory caching behavioravailable in Intel processors. By putting this information on the FSB, the request agent letsother processors know how this transaction affects their caches, and how the memorycontroller (northbridge) should behave. The processor determines the type of a givenmemory region mainly by looking at page tables, which are maintained by the kernel.

    Typically kernels treat all RAM memory as write-back, which yields the bestperformance. In write-back mode the unit of memory access is thecache line, 64 bytes inthe Core 2. If a program reads a single byte in memory, the processor loads the wholecache line that contains that byte into the L2 and L1 caches. When a program writes tomemory, the processor only modifies the line in the cache, but does notupdate mainmemory. Later, when it becomes necessary to post the modified line to the bus, the wholecache line is written at once. So most requests have 11 in their length field, for 64 bytes.Heres a read example in which the data is not in the caches:

    Some of the physical memory range in an Intel computer ismapped to deviceslike hard

    drives and network cards instead of actual RAM memory. This allows drivers tocommunicate with their devices by writing to and reading from memory. The kernel marksthese memory regions as uncacheable in the page tables. Accesses to uncacheablememory regions are reproduced in the bus exactly as requested by a program or driver.Hence its possible to read or write single bytes, words, and so on. This is done via the byte

    enable mask in packet B above.

    The primitives discussed here have many implications.