duartes internals papers
TRANSCRIPT
-
8/6/2019 Duartes Internals Papers
1/54
-
8/6/2019 Duartes Internals Papers
2/54
2
In a motherboard the CPUs gateway to the world is the front-side bus connecting it to thenorthbridge. Whenever the CPU needs to read or write memory it does so via this bus. Ituses some pins to transmit the physical memory address it wants to write or read, whileother pins send the value to be written or receive the value being read. An Intel Core 2QX6600 has 33 pins to transmit the physical memory address (so there are 233 choices ofmemory locations) and 64 pins to send or receive data (so data is transmitted in a 64-bit
data path, or 8-byte chunks). This allows the CPU to physically address 64 gigabytes ofmemory (233 locations * 8 bytes) although most chipsets only handle up to 8 gigs of RAM.
Now comes the rub. Were used to thinking of memory only in terms of RAM, the stuff
programs read from and write to all the time. And indeed most of the memory requestsfrom the processor are routed to RAM modules by the northbridge. But not all of them.Physical memory addresses are also used for communication with assorted devices on themotherboard (this communication is calledmemory-mapped I/O). These devices includevideo cards, most PCI cards (say, a scanner or SCSI card), and also theflash memorythatstores the BIOS.
When the northbridge receives a physical memory request it decides where to route it:
should it go to RAM? Video card maybe? This routing is decided via the memory addressmap. For each region of physical memory addresses, the memory map knows the devicethat owns that region. The bulk of the addresses are mapped to RAM, but when they arent
the memory map tells the chipset which device should service requests for those addresses.This mapping of memory addresses away from RAM modules causes the classic hole in PCmemory between 640KB and 1MB. A bigger hole arises when memory addresses arereserved for video cards and PCI devices. This is why 32-bit OSes haveproblems using 4gigs of RAM. In Linux the file /proc/iomem neatly lists these address range mappings. Thediagram below shows a typical memory map for the first 4 gigs of physical memoryaddresses in an Intel PC:
http://en.wikipedia.org/wiki/Memory-mapped_IOhttp://en.wikipedia.org/wiki/Memory-mapped_IOhttp://en.wikipedia.org/wiki/Memory-mapped_IOhttp://en.wikipedia.org/wiki/Flash_memoryhttp://en.wikipedia.org/wiki/Flash_memoryhttp://en.wikipedia.org/wiki/Flash_memoryhttp://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://support.microsoft.com/kb/929605http://en.wikipedia.org/wiki/Flash_memoryhttp://en.wikipedia.org/wiki/Memory-mapped_IO -
8/6/2019 Duartes Internals Papers
3/54
3
Memory layout for the first 4 gigabytes in an Intel system.
Actual addresses and ranges depend on the specific motherboard and devices present in thecomputer, but most Core 2 systems are pretty close to the above. All of the brown regionsare mapped away from RAM. Remember that these arephysicaladdresses that are used onthe motherboard buses. Inside the CPU (for example, in the programs we run and write),
the memory addresses are logical and they must be translated by the CPU into a physicaladdress before memory is accessed on the bus.
The rules for translation of logical addresses into physical addresses are complex and theydepend on the mode in which the CPU is running (real mode, 32-bit protected mode, and64-bit protected mode). Regardless of the translation mechanism, the CPU modedetermines how much physical memory can be accessed. For example, if the CPU is runningin 32-bit mode, then it is only capable of physically addressing 4 GB (well, there is anexception calledphysical address extension, but ignore it for now). Since the top 1 GB or so
http://en.wikipedia.org/wiki/Physical_address_extensionhttp://en.wikipedia.org/wiki/Physical_address_extensionhttp://en.wikipedia.org/wiki/Physical_address_extensionhttp://en.wikipedia.org/wiki/Physical_address_extension -
8/6/2019 Duartes Internals Papers
4/54
4
of physical addresses are mapped to motherboard devices the CPU can effectively use only~3 GB of RAM (sometimes less I have a Vista machine where only 2.4 GB are usable). Ifthe CPU is inreal mode, then it can only address 1 megabyte of physical RAM (this is theonly mode early Intel processors were capable of). On the other hand, a CPU running in 64-bit mode can physically access 64GB (few chipsets support that much RAM though). In 64-bit mode it is possible to use physical addresses above the total RAM in the system to
access the RAM regions that correspond to physical addresses stolen by motherboarddevices. This is called reclaimingmemory and its done with help from the chipset.
Thats all the memory we need for the next post, which describes the boot process frompower up until the boot loader is about to jump into the kernel. If youd like to learn more
about this stuff, I highly recommend the Intel manuals. Im big into primary sources overall,but the Intel manuals in particular are well written and accurate. Here are some:
Datasheet for Intel G35 Chipsetdocuments a representative chipset for Core 2processors. This is the main source for this post.
Datasheet for Intel Core 2 Quad-Core Q6000 Sequenceis a processor datasheet. Itdocuments each pin in the processor (there arent that many actually, and after you
group them theres really not a lot to it). Fascinating stuff, though some bits arearcane. TheIntel Software Developers Manualsare outstanding. Far from arcane, they
explain beautifully all sorts of things about the architecture. Volumes 1 and 3A havethe good stuff (dont be put off by the name, the volumes are smal l and you canread selectively).
Pdraig Bradysuggested that I link to Ulrich Dreppers excellentpaper on memory.Its great stuff. I was waiting to link to it in a post about memory, but the more themerrier.
http://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Real_modehttp://download.intel.com/design/chipsets/datashts/31760701.pdfhttp://download.intel.com/design/chipsets/datashts/31760701.pdfhttp://download.intel.com/design/processor/datashts/31559205.pdfhttp://download.intel.com/design/processor/datashts/31559205.pdfhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.pixelbeat.org/http://www.pixelbeat.org/http://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://www.pixelbeat.org/http://www.intel.com/products/processor/manuals/index.htmhttp://download.intel.com/design/processor/datashts/31559205.pdfhttp://download.intel.com/design/chipsets/datashts/31760701.pdfhttp://en.wikipedia.org/wiki/Real_mode -
8/6/2019 Duartes Internals Papers
5/54
5
2.How Computers Boot UpThe previous post describedmotherboards and the memory mapin Intel computers to setthe scene for the initial phases of boot. Booting is an involved, hacky, multi-stage affair fun stuff. Heres an outline of the process:
An outline of the boot sequence
Things start rolling when you press the power button on the computer (no! do tell!). Oncethe motherboard is powered up it initializes its own firmware the chipset and other tidbits and tries to get the CPU running. If things fail at this point (e.g., the CPU is busted ormissing) then you will likely have a system that looks completely dead except for rotatingfans. A few motherboards manage to emit beeps for an absent or faulty CPU, but thezombie-with-fans state is the most common scenario based on my experience. Sometimes
USB or other devices can cause this to happen: unplugging allnon-essential devices is apossible cure for a system that was working and suddenly appears dead like this. You canthen single out the culprit device by elimination.
If all is well the CPU starts running. In a multi-processor or multi-core system one CPU isdynamically chosen to be the bootstrap processor (BSP) that runs all of the BIOS and kernelinitialization code. The remaining processors, called application processors (AP) at thispoint, remain halted until later on when they are explicitly activated by the kernel. IntelCPUs have been evolving over the years but theyre fully backwards compatible, so modern
CPUs can behave like the original 1978Intel 8086, which is exactly what they do afterpower up. In this primitive power up state the processor is inreal modewith memorypagingdisabled. This is like ancient MS-DOS where only 1 MB of memory can be addressed
and any code can write to any place in memory theres no notion of protection orprivilege.
Mostregistersin the CPU have well-defined values after power up, including the instructionpointer (EIP) which holds the memory address for the instruction being executed by theCPU. Intel CPUs use a hack whereby even though only 1MB of memory can be addressed atpower up, a hidden base address (an offset, essentially) is applied to EIP so that the firstinstruction executed is at address 0xFFFFFFF0 (16 bytes short of the end of 4 gigs of
http://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://en.wikipedia.org/wiki/Intel_8086http://en.wikipedia.org/wiki/Intel_8086http://en.wikipedia.org/wiki/Intel_8086http://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Processor_registerhttp://en.wikipedia.org/wiki/Processor_registerhttp://en.wikipedia.org/wiki/Processor_registerhttp://en.wikipedia.org/wiki/Processor_registerhttp://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Real_modehttp://en.wikipedia.org/wiki/Intel_8086http://duartes.org/gustavo/blog/post/motherboard-chipset-memory-map -
8/6/2019 Duartes Internals Papers
6/54
6
memory and well above one megabyte). This magical address is called thereset vectorandis standard for modern Intel CPUs.
The motherboard ensures that the instruction at the reset vector is ajump to the memorylocation mapped to the BIOS entry point. This jump implicitly clears the hidden baseaddress present at power up. All of these memory locations have the right contents needed
by the CPU thanks to thememory mapkept by the chipset. They are all mapped to flashmemory containing the BIOS since at this point the RAM modules have random crap inthem. An example of the relevant memory regions is shown below:
Important memory regions during boot
The CPU then starts executing BIOS code, which initializes some of the hardware in themachine. Afterwards the BIOS kicks off thePower-on Self Test(POST) which tests variouscomponents in the computer. Lack of a working video card fails the POST and causes theBIOS to halt and emit beeps to let you know whats wrong, since messages on the screenarent an option. A working video card takes us to a stage where the computer looks alive:manufacturer logos are printed, memory starts to be tested, angels blare their horns. OtherPOST failures, like a missing keyboard, lead to halts with an error message on the screen.
http://en.wikipedia.org/wiki/Reset_vectorhttp://en.wikipedia.org/wiki/Reset_vectorhttp://en.wikipedia.org/wiki/Reset_vectorhttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://en.wikipedia.org/wiki/Power_on_self_testhttp://en.wikipedia.org/wiki/Power_on_self_testhttp://en.wikipedia.org/wiki/Power_on_self_testhttp://en.wikipedia.org/wiki/Power_on_self_testhttp://duartes.org/gustavo/blog/post/motherboard-chipset-memory-maphttp://en.wikipedia.org/wiki/Reset_vector -
8/6/2019 Duartes Internals Papers
7/54
7
The POST involves a mixture of testing and initialization, including sorting out all theresources interrupts, memory ranges, I/O ports for PCI devices. Modern BIOSes thatfollow theAdvanced Configuration and Power Interfacebuild a number of data tables thatdescribe the devices in the computer; these tables are later used by the kernel.
After the POST the BIOS wants to boot up an operating system, which must be found
somewhere: hard drives, CD-ROM drives, floppy disks, etc. The actual order in which theBIOS seeks a boot device is user configurable. If there is no suitable boot device the BIOShalts with a complaint like Non-System Disk or Disk Error. A dead hard drive mightpresent with this symptom. Hopefully this doesnt happen and the BIOS finds a working diskallowing the boot to proceed.
The BIOS now reads the first 512-bytesector(sector zero) of the hard disk. This is calledtheMaster Boot Recordand it normally contains two vital components: a tiny OS-specificbootstrapping program at the start of the MBR followed by a partition table for the disk. TheBIOS however does not care about any of this: it simply loads the contents of the MBR intomemory location 0x7c00 and jumps to that location to start executing whatever code is inthe MBR.
Master Boot Record
The specific code in the MBR could be a Windows MBR loader, code from Linux loaders suchas LILO or GRUB, or even a virus. In contrast the partition table is standardized: it is a 64-byte area with four 16-byte entries describing how the disk has been divided up (so you canrun multiple operating systems or have separate volumes in the same disk). TraditionallyMicrosoft MBR code takes a look at the partition table, finds the (only) partition marked asactive, loads the boot sector for thatpartition, and runs that code. The boot sector is thefirst sector of a partition, as opposed to the first sector for the whole disk. If something iswrong with the partition table you would get messages like Invalid Partition Table or
Missing Operating System. This message does not come from the BIOS but rather fromthe MBR code loaded from disk. Thus the specific message depends on the MBR flavor.
Boot loading has gotten more sophisticated and flexible over time. The Linux boot loadersLilo and GRUB can handle a wide variety of operating systems, file systems, and bootconfigurations. Their MBR code does not necessarily follow the boot the active partitionapproach described above. But functionally the process goes like this:
http://en.wikipedia.org/wiki/ACPIhttp://en.wikipedia.org/wiki/ACPIhttp://en.wikipedia.org/wiki/ACPIhttp://en.wikipedia.org/wiki/Disk_sectorhttp://en.wikipedia.org/wiki/Disk_sectorhttp://en.wikipedia.org/wiki/Disk_sectorhttp://en.wikipedia.org/wiki/Master_boot_recordhttp://en.wikipedia.org/wiki/Master_boot_recordhttp://en.wikipedia.org/wiki/Master_boot_recordhttp://en.wikipedia.org/wiki/Master_boot_recordhttp://en.wikipedia.org/wiki/Disk_sectorhttp://en.wikipedia.org/wiki/ACPI -
8/6/2019 Duartes Internals Papers
8/54
8
1. The MBR itself contains the first stage of the boot loader. GRUB calls this stage 1.2. Due to its tiny size, the code in the MBR does just enough to load another sector
from disk that contains additional boostrap code. This sector might be the bootsector for a partition, but could also be a sector that was hard-coded into the MBRcode when the MBR was installed.
3. The MBR code plus code loaded in step 2 then read a file containing the second stageof the boot loader. In GRUB this is GRUB Stage 2, and in Windows Server this isc:\NTLDR. If step 2 fails in Windows youd get a message like NTLDR is missing.The stage 2 code then reads a boot configuration file (e.g., grub.conf in GRUB,boot.ini in Windows). It then presents boot choices to the user or simply goes aheadin a single-boot system.
4. At this point the boot loader code needs to fire up a kernel. It must know enoughabout file systems to read the kernel from the boot partition. In Linux this meansreading a file like vmlinuz-2.6.22-14-server containing the kernel, loading the fileinto memory and jumping to the kernel bootstrap code. In Windows Server 2003some of the kernel start-up code is separate from the kernel image itself and isactually embedded into NTLDR. After performing several initializations, NTDLR loadsthe kernel image from file c:\Windows\System32\ntoskrnl.exe and, just as GRUBdoes, jumps to the kernel entry point.
Theres a complication worth mentioning (aka, I told you this thing is hacky). The image for
a current Linux kernel, even compressed, does not fit into the 640K of RAM available in realmode. My vanilla Ubuntu kernel is 1.7 MB compressed. Yet the boot loader must run in realmode in order to call the BIOS routines for reading from the disk, since the kernel is clearlynot available at that point. The solution is the venerableunreal mode. This is not a trueprocessor mode (I wish the engineers at Intel were allowed to have fun like that), but rathera technique where a program switches back and forth between real mode and protectedmode in order to access memory above 1MB while still using the BIOS. If you read GRUBsource code, youll see these transitions all over the place (look under stage2/ for calls to
real_to_prot and prot_to_real). At the end of this sticky process the loader has stuffed thekernel in memory, by hook or by crook, but it leaves the processor in real mode when its
done.
Were now at the jump from Boot Loader to Early Kernel Initialization as shown in the
first diagram. Thats when things heat up as the kernel starts to unfold and set things inmotion. The next post will be a guided tour through the Linux Kernel initialization with links
to sources at theLinux Cross Reference. I cant do the same for Windows but Ill point outthe highlights.
[Update: cleared up discussion of NTLDR.]
http://en.wikipedia.org/wiki/Unreal_modehttp://en.wikipedia.org/wiki/Unreal_modehttp://en.wikipedia.org/wiki/Unreal_modehttp://lxr.linux.no/http://lxr.linux.no/http://lxr.linux.no/http://lxr.linux.no/http://en.wikipedia.org/wiki/Unreal_mode -
8/6/2019 Duartes Internals Papers
9/54
9
3.The Kernel Boot ProcessThe previous post explainedhow computers boot upright up to the point where the boot
loader, after stuffing the kernel image into memory, is about to jump into the kernel entry
point. This last post about booting takes a look at the guts of the kernel to see how an
operating system starts life. Since I have anempirical bentIll link heavily to the sources forLinux kernel 2.6.25.6 at theLinux Cross Reference. The sources are very readable if you are
familiar with C-like syntax; even if you miss some details you can get the gist of whats
happening. The main obstacle is the lack of context around some of the code, such as when
or why it runs or the underlying features of the machine. I hope to provide a bit of that
context. Due to brevity (hah!) a lot of fun stuff like interrupts and memory gets only a
nod for now. The post ends with the highlights for the Windows boot.
At this point in the Intel x86 boot story the processor is running in real-mode, is able to
address 1 MB of memory, and RAM looks like this for a modern Linux system:
RAM contents after boot loader is done
The kernel image has been loaded to memory by the boot loader using the BIOS disk I/O
services. This image is an exact copy of the file in your hard drive that contains the kernel,
e.g. /boot/vmlinuz-2.6.22-14-server. The image is split into two pieces: a small part
http://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/reality-driven-developmenthttp://duartes.org/gustavo/blog/post/reality-driven-developmenthttp://duartes.org/gustavo/blog/post/reality-driven-developmenthttp://lxr.linux.no/http://lxr.linux.no/http://lxr.linux.no/http://lxr.linux.no/http://duartes.org/gustavo/blog/post/reality-driven-developmenthttp://duartes.org/gustavo/blog/post/how-computers-boot-up -
8/6/2019 Duartes Internals Papers
10/54
10
containing the real-mode kernel code is loaded below the 640K barrier; the bulk of the
kernel, which runs in protected mode, is loaded after the first megabyte of memory.
The action starts in the real-mode kernel header pictured above. This region of memory is
used to implement theLinux boot protocolbetween the boot loader and the kernel. Some of
the values there are read by the boot loader while doing its work. These include amenities
such as a human-readable string containing the kernel version, but also crucial informationlike the size of the real-mode kernel piece. The boot loader also writes values to this region,
such as the memory address for the command-line parameters given by the user in the
boot menu. Once the boot loader is finished it has filled in all of the parameters required by
the kernel header. Its then time to jump into the kernel entry point. The diagram below
shows the code sequence for the kernel initialization, along with source directories, files,
and line numbers:
Architecture-specific Linux Kernel Initialization
The early kernel start-up for the Intel architecture is in filearch/x86/boot/header.S. Its in
assembly language, which is rare for the kernel at large but common for boot code. The
start of this file actually contains boot sector code, a left over from the days when Linux
could work without a boot loader. Nowadays this boot sector, if executed, only prints a
bugger_off_msg to the user and reboots. Modern boot loaders ignore this legacy code.
After the boot sector code we have the first 15 bytes of the real-mode kernel header; these
two pieces together add up to 512 bytes, the size of a typical disk sector on Intel hardware.
After these 512 bytes, at offset 0200, we find the very first instruction that runs as part of
the Linux kernel: the real-mode entry point. Its inheader.S:110and it is a 2-byte jumpwritten directly in machine code as 0x3aeb. You can verify this by running hexdump on your
kernel image and seeing the bytes at that offset just a sanity check to make sure its not
all a dream. The boot loader jumps into this location when it is finished, which in turn jumps
toheader.S:229where we have a regular assembly routine called start_of_setup. This short
routine sets up a stack, zeroes thebsssegment (the area that contains static variables, so
they start with zero values) for the real-mode kernel and then jumps to good old C code at
arch/x86/boot/main.c:122.
http://lxr.linux.no/linux+v2.6.25.6/Documentation/i386/boot.txthttp://lxr.linux.no/linux+v2.6.25.6/Documentation/i386/boot.txthttp://lxr.linux.no/linux+v2.6.25.6/Documentation/i386/boot.txthttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.Shttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.Shttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.Shttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L110http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L110http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L110http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L229http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L229http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L229http://en.wikipedia.org/wiki/.bsshttp://en.wikipedia.org/wiki/.bsshttp://en.wikipedia.org/wiki/.bsshttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/main.c#L122http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/main.c#L122http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/main.c#L122http://en.wikipedia.org/wiki/.bsshttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L229http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.S#L110http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/header.Shttp://lxr.linux.no/linux+v2.6.25.6/Documentation/i386/boot.txt -
8/6/2019 Duartes Internals Papers
11/54
11
main() does some house keeping like detecting memory layout, setting a video mode, etc.
It then callsgo_to_protected_mode(). Before the CPU can be set to protected mode,
however, a few tasks must be done. There are two main issues: interrupts and memory. In
real-mode theinterrupt vector tablefor the processor is always at memory address 0,
whereas in protected mode the location of the interrupt vector table is stored in a CPU
register called IDTR. Meanwhile, the translation of logical memory addresses (the onesprograms manipulate) to linear memory addresses (a raw number from 0 to the top of the
memory) is different between real-mode and protected mode. Protected mode requires a
register called GDTR to be loaded with the address of aGlobal Descriptor Tablefor memory.
So go_to_protected_mode() callssetup_idt()andsetup_gdt()to install a temporary
interrupt descriptor table and global descriptor table.
Were now ready for the plunge into protected mode, which is done by
protected_mode_jump, another assembly routine. This routine enables protected mode by
setting the PE bit in the CR0 CPU register. At this point were running withpagingdisabled;
paging is an optional feature of the processor, even in protected mode, and theres no need
for it yet. Whats important is that were no longer confined to the 640K barrier and can now
address up to 4GB of RAM. The routine then calls the 32-bit kernel entry point, which is
startup_32for compressed kernels. This routine does some basic register initializations andcallsdecompress_kernel(), a C function to do the actual decompression.
decompress_kernel() prints the familiar Decompressing Linux message. Decompression
happens in-place and once its finished the uncompressed kernel image has overwritten the
compressed one pictured in the first diagram. Hence the uncompressed contents also start
at 1MB. decompress_kernel() then prints done. and the comforting Booting the kernel.
By Booting it means a jump to the final entry point in this whole story, given to Linus by
God himself atopMountain Halti, which is the protected-mode kernel entry point at the start
of the second megabyte of RAM (0100000). That sacred location contains a routine called,
uh,startup_32. But this one is in a different directory, you see.
The second incarnation of startup_32 is also an assembly routine, but it contains 32-bit
mode initializations. It clears the bss segment for the protected-mode kernel (which is thetrue kernel that will now run until the machine reboots or shuts down), sets up the final
global descriptor table for memory, builds page tables so that paging can be turned on,
enables paging, initializes a stack, creates the final interrupt descriptor table, and finally
jumps to to the architecture-independent kernel start-up,start_kernel(). The diagram below
shows the code flow for the last leg of the boot:
http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L153http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L153http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L153http://en.wikipedia.org/wiki/Interrupt_vector_tablehttp://en.wikipedia.org/wiki/Interrupt_vector_tablehttp://en.wikipedia.org/wiki/Interrupt_vector_tablehttp://en.wikipedia.org/wiki/Global_descriptor_tablehttp://en.wikipedia.org/wiki/Global_descriptor_tablehttp://en.wikipedia.org/wiki/Global_descriptor_tablehttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L144http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L144http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L144http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pmjump.S#L31http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pmjump.S#L31http://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Paginghttp://en.wikipedia.org/wiki/Paginghttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/head_32.S#L35http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/head_32.S#L35http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/misc.c#L368http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/misc.c#L368http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/misc.c#L368http://en.wikipedia.org/wiki/Haltihttp://en.wikipedia.org/wiki/Haltihttp://en.wikipedia.org/wiki/Haltihttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L507http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L507http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L507http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L507http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://en.wikipedia.org/wiki/Haltihttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/misc.c#L368http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/compressed/head_32.S#L35http://en.wikipedia.org/wiki/Paginghttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pmjump.S#L31http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L115http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L144http://en.wikipedia.org/wiki/Global_descriptor_tablehttp://en.wikipedia.org/wiki/Interrupt_vector_tablehttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L153 -
8/6/2019 Duartes Internals Papers
12/54
12
Architecture-independent Linux Kernel Initialization
start_kernel() looks more like typical kernel code, which is nearly all C and machine
independent. The function is a long list of calls to initializations of the various kernel
subsystems and data structures. These include the scheduler, memory zones, time keeping,
and so on. start_kernel() then callsrest_init(), at which point things are almost all working.
rest_init() creates a kernel thread passing another function,kernel_init(), as the entry
point. rest_init() then callsschedule()to kickstart task scheduling and goes to sleep by
callingcpu_idle(), which is the idle thread for the Linux kernel. cpu_idle() runs forever and
so does process zero, which hosts it. Whenever there is work to do a runnable process
process zero gets booted out of the CPU, only to return when no runnable processes are
available.But heres the kicker for us. This idle loop is the end of the long thread we followed since
boot, its the final descendent of the very firstjump executed by the processor after power
up. All of this mess, from reset vector to BIOS to MBR to boot loader to real-mode kernel to
protected-mode kernel, all of it leads right here, jump by jump by jump it ends in the idle
loop for the boot processor, cpu_idle(). Which is really kind of cool. However, this cant be
the whole story otherwise the computer would do no work.
At this point, the kernel thread started previously is ready to kick in, displacing process 0
and its idle thread. And so it does, at which point kernel_init() starts running since it was
given as the thread entry point.kernel_init()is responsible for initializing the remaining
CPUs in the system, which have been halted since boot. All of the code weve seen so far
has been executed in a single CPU, called the boot processor. As the other CPUs, calledapplication processors, are started they come up in real-mode and must run through several
initializations as well. Many of the code paths are common, as you can see in the code for
startup_32, but there are slight forks taken by the late-coming application processors.
Finally, kernel_init() callsinit_post(), which tries to execute a user-mode process in the
following order: /sbin/init, /etc/init, /bin/init, and /bin/sh. If all fail, the kernel will panic.
Luckily init is usually there, and starts running as PID 1. It checks its configuration file to
figure out which processes to launch, which might include X11 Windows, programs for
http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L432http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L432http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L432http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/kernel/sched.c#L3959http://lxr.linux.no/linux+v2.6.25.6/kernel/sched.c#L3959http://lxr.linux.no/linux+v2.6.25.6/kernel/sched.c#L3959http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/process_32.c#L180http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/process_32.c#L180http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/process_32.c#L180http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L769http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L769http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L769http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L769http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L86http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/process_32.c#L180http://lxr.linux.no/linux+v2.6.25.6/kernel/sched.c#L3959http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L808http://lxr.linux.no/linux+v2.6.25.6/init/main.c#L432 -
8/6/2019 Duartes Internals Papers
13/54
13
logging in on the console, network daemons, and so on. Thus ends the boot process as yet
another Linux box starts running somewhere. May your uptime be long and untroubled.
The process for Windows is similar in many ways, given the common architecture. Many of
the same problems are faced and similar initializations must be done. When it comes to
boot one of the biggest differences is that Windows packs all of the real-mode kernel code,
and some of the initial protected mode code, into the boot loader itself (C:\NTLDR). Soinstead of having two regions in the same kernel image, Windows uses different binary
images. Plus Linux completely separates boot loader and kernel; in a way this automatically
falls out of the open source process. The diagram below shows the main bits for the
Windows kernel:
Windows Kernel InitializationThe Windows user-mode start-up is naturally very different. Theres no /sbin/init, but rather
Csrss.exe and Winlogon.exe. Winlogon spawns Services.exe, which starts all of the
Windows Services, and Lsass.exe, the local security authentication subsystem. The classic
Windows login dialog runs in the context of Winlogon.
This is the end of this boot series. Thanks everyone for reading and for feedback. Im sorry
some things got superficial treatment; Ive gotta start somewhere and only so much fits into
blog-sized bites. But nothing like a day after the next; my plan is to do regular Software
Illustrated posts like this series along with other topics. Meanwhile, here are some
resources: The best, most important resource, is source code for real kernels, either Linux or one
of the BSDs. Intel publishes excellentSoftware Developers Manuals, which you can download for
free. Understanding the Linux Kernelis a good book and walks through a lot of the Linux
Kernel sources. Its getting outdated and its dry, but Id still recommend it to anyone
who wants to grok the kernel.Linux Device Driversis more fun, teaches well, but islimited in scope. Finally, Patrick Moroney suggestedLinux Kernel DevelopmentbyRobert Love in the comments for this post. Ive heard other positive reviews for that
book, so it sounds worth checking out.
http://www.intel.com/products/processor/manuals/index.htmhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.intel.com/products/processor/manuals/index.htmhttp://www.amazon.com/exec/obidos/ASIN/0596005652/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005652/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005903/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005903/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005903/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0672327201/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0672327201/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0672327201/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0672327201/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005903/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0596005652/gustduar-20http://www.intel.com/products/processor/manuals/index.htm -
8/6/2019 Duartes Internals Papers
14/54
14
For Windows, the best reference by far isWindows Internalsby David Solomon andMark Russinovich, the latter of Sysinternals fame. This is a great book, well-writtenand thorough. The main downside is the lack of source code.
[Update: In acomment below, Nix covered a lot of ground on the initial root file system that
I glossed over. Thanks toMarius Barbufor catching a mistake where I wrote "CR3" instead
of GDTR]
http://www.amazon.com/exec/obidos/ASIN/0735625301/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0735625301/gustduar-20http://www.amazon.com/exec/obidos/ASIN/0735625301/gustduar-20http://blogs.technet.com/markrussinovich/http://blogs.technet.com/markrussinovich/http://duartes.org/gustavo/blog/category/internals/#comment-13790http://duartes.org/gustavo/blog/category/internals/#comment-13790http://duartes.org/gustavo/blog/category/internals/#comment-13790http://www.sirartisan.net/http://www.sirartisan.net/http://www.sirartisan.net/http://www.sirartisan.net/http://duartes.org/gustavo/blog/category/internals/#comment-13790http://blogs.technet.com/markrussinovich/http://www.amazon.com/exec/obidos/ASIN/0735625301/gustduar-20 -
8/6/2019 Duartes Internals Papers
15/54
15
4.Memory Translation and SegmentationThis post is the first in a series about memory and protection in Intel-compatible (x86)
computers, going further down the path of how kernels work. As in theboot series, Ill link
to Linux kernel sources but give Windows examples as well (sorry, Im ignorant about the
BSDs and the Mac, but most of the discussion applies). Let me know what I screw up.
In thechipsetsthat power Intel motherboards, memory is accessed by the CPU via the front
side bus, which connects it to the northbridge chip. The memory addresses exchanged in
the front side bus are physical memory addresses, raw numbers from zero to the top of
the available physical memory. These numbers are mapped to physical RAM sticks by the
northbridge. Physical addresses are concrete and final no translation, no paging, no
privilege checks you put them on the bus and thats that. Within the CPU, however,
programs use logical memory addresses, which must be translated into physical
addresses before memory access can take place. Conceptually address translation looks like
this:
Memory address translation in x86 CPUs with paging enabled
This is not a physical diagram, only a depiction of the address translation process,
specifically for when the CPU has paging enabled. If you turn off paging, the output from
the segmentation unit is already a physical address; in 16-bit real mode that is always the
case. Translation starts when the CPU executes an instruction that refers to a memory
address. The first step is translating that logic address into a linear address. But why go
through this step instead of having software use linear (or physical) addresses directly? For
roughly the same reason humans have an appendix whose primary function is getting
infected. Its a wrinkle of evolution. To really make sense of x86 segmentation we need to
go back to 1978.
The original8086had 16-bit registers and its instructions used mostly 8-bit or 16-bit
operands. This allowed code to work with 216 bytes, or 64K of memory, yet Intel engineers
were keen on letting the CPU use more memory without expanding the size of registers and
instructions. So they introduced segment registers as a means to tell the CPU which 64K
chunk of memory a programs instructions were going to work on. It was a reasonable
solution: first you load a segment register, effectively saying here, I want to work on the
http://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/kernel-boot-processhttp://duartes.org/gustavo/blog/post/kernel-boot-processhttp://duartes.org/gustavo/blog/post/kernel-boot-processhttp://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-maphttp://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-maphttp://en.wikipedia.org/wiki/8086http://en.wikipedia.org/wiki/8086http://en.wikipedia.org/wiki/8086http://en.wikipedia.org/wiki/8086http://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-maphttp://duartes.org/gustavo/blog/post/kernel-boot-processhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentation -
8/6/2019 Duartes Internals Papers
16/54
16
memory chunk starting at X; afterwards, 16-bit memory addresses used by your code are
interpreted as offsets into your chunk, or segment. There were four segment registers: one
for the stack (ss), one for program code (cs), and two for data (ds, es). Most programs
were small enough back then to fit their whole stack, code, and data each in a 64K
segment, so segmentation was often transparent.
Nowadays segmentation is still present and is always enabled in x86 processors. Eachinstruction that touches memory implicitly uses a segment register. For example, a jump
instruction uses the code segment register (cs) whereas a stack push instruction uses the
stack segment register (ss). In most cases you can explicitly override the segment register
used by an instruction. Segment registers store 16-bit segment selectors; they can be
loaded directly with instructions like MOV. The sole exception is cs, which can only be
changed by instructions that affect the flow of execution, like CALL or JMP. Though
segmentation is always on, it works differently in real mode versus protected mode.
In real mode, such as duringearly boot, the segment selector is a 16-bit number specifying
the physical memory address for the start of a segment. This number must somehow be
scaled, otherwise it would also be limited to 64K, defeating the purpose of segmentation.
For example, the CPU could use the segment selector as the 16 most significant bits of the
physical memory address (by shifting it 16 bits to the left, which is equivalent to multiplyingby 216). This simple rule would enable segments to address 4 gigs of memory in 64K
chunks. Sadly Intel made a bizarre decision to multiply the segment selector by only 24 (or
16), which in a single stroke confined memory to about 1MB and unduly complicated
translation. Heres an example showing a jump instruction where cs contains 01000:
Real mode segmentationReal mode segment starts range from 0 all the way to 0xFFFF0 (16 bytes short of 1 MB) in
16-byte increments. To these values you add a 16-bit offset (the logical address) between 0
and 0xFFFF. Itfollowsthat there are multiple segment/offset combinations pointing to the
same memory location, and physical addresses fall above 1MB if your segment is high
enough (see the infamousA20 line). Also, when writing C code in real mode afar pointeris
a pointer that contains both the segment selector andthe logical address, which allows it to
address 1MB of memory. Far indeed. As programs started getting bigger and outgrowing
http://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://duartes.org/gustavo/blog/post/how-computers-boot-uphttp://mirror.href.com/thestarman/asm/debug/Segments.htmlhttp://mirror.href.com/thestarman/asm/debug/Segments.htmlhttp://mirror.href.com/thestarman/asm/debug/Segments.htmlhttp://en.wikipedia.org/wiki/A20_linehttp://en.wikipedia.org/wiki/A20_linehttp://en.wikipedia.org/wiki/A20_linehttp://en.wikipedia.org/wiki/C_memory_modelhttp://en.wikipedia.org/wiki/C_memory_modelhttp://en.wikipedia.org/wiki/C_memory_modelhttp://en.wikipedia.org/wiki/C_memory_modelhttp://en.wikipedia.org/wiki/A20_linehttp://mirror.href.com/thestarman/asm/debug/Segments.htmlhttp://duartes.org/gustavo/blog/post/how-computers-boot-up -
8/6/2019 Duartes Internals Papers
17/54
17
64K segments, segmentation and its strange ways complicated development for the x86
platform. This may all sound quaintly odd now but it has driven programmers into the
wretched depths of madness.
In 32-bit protected mode, a segment selector is no longer a raw number, but instead it
contains an index into a table ofsegment descriptors. The table is simply an array
containing 8-byte records, where each record describes one segment and looks thus:
Segment descriptor
There are three types of segments: code, data, and system. For brevity, only the common
features in the descriptor are shown here. The base address is a 32-bit linear address
pointing to the beginning of the segment, while the limit specifies how big the segment is.
Adding the base address to a logical memory address yields a linear address. DPL is the
descriptor privilege level; it is a number from 0 (most privileged, kernel mode) to 3 (least
privileged, user mode) that controls access to the segment.
These segment descriptors are stored in two tables: the Global Descriptor Table (GDT)
and the Local Descriptor Table (LDT). Each CPU (or core) in a computer contains a
register called gdtr which stores the linear memory address of the first byte in the GDT. To
choose a segment, you must load a segment register with a segment selector in the
following format:
Segment Selector
The TI bit is 0 for the GDT and 1 for the LDT, while the index specifies the desired segment
selector within the table. Well deal with RPL, Requested Privilege Level, later on. Now,
come to think of it, when the CPU is in 32-bit mode registers and instructions can address
the entire linear address space anyway, so theres really no need to give them a push with a
base address or other shenanigan. So why not set the base address to zero and let logical
addresses coincide with linear addresses? Intel docs call this flat model and its exactly
what modern x86 kernels do (they use the basic flat model, specifically). Basic flat model is
equivalent to disabling segmentation when it comes to translating memory addresses. So in
all its glory, heres the jump example running in 32-bit protected mode, with real-world
values for a Linux user-mode app:
-
8/6/2019 Duartes Internals Papers
18/54
18
Protected Mode Segmentation
The contents of a segment descriptor are cached once they are accessed, so theres no need
to actually read the GDT in subsequent accesses, which would kill performance. Each
segment register has a hidden part to store the cached descriptor that corresponds to its
segment selector. For more details, including more info on the LDT, see chapter 3 of the
Intel System Programming Guide Volume 3a. Volumes 2a and 2b, which cover every x86
instruction, also shed light on the various types of x86 addressing operands 16-bit, 16-bit
with segment selector (which can be used by far pointers), 32-bit, etc.
In Linux, only 3 segment descriptors are used during boot. They are defined with the
GDT_ENTRYmacro and stored in theboot_gdtarray. Two of the segments are flat,addressing the entire 32-bit space: a code segment loaded into cs and a data segment
loaded into the other segment registers. The third segment is a system segment called the
Task State Segment. After boot, each CPU has its own copy of the GDT. They are all nearly
identical, but a few entries change depending on the running process. You can see the
layout of the Linux GDT insegment.hand its instantiation ishere. There are four primary
GDT entries: two flat ones for code and data in kernel mode, and another two for user
mode. When looking at the Linux GDT, notice the holes inserted on purpose to align data
with CPU cache lines an artifact of thevon Neumann bottleneckthat has become a
http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L103http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L103http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L119http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L119http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L119http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/segment.h#L15http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/segment.h#L15http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/segment.h#L15http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24http://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/cpu/common.c#L24http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/segment.h#L15http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L119http://lxr.linux.no/linux+v2.6.25.6/arch/x86/boot/pm.c#L103 -
8/6/2019 Duartes Internals Papers
19/54
19
plague. Finally, the classic Segmentation fault Unix error message is notdue to x86-style
segments, but rather invalid memory addresses normally detected by the paging unit
alas, topic for an upcoming post.
Intel deftly worked around their original segmentation kludge, offering a flexible way for us
to choose whether to segment or go flat. Since coinciding logical and linear addresses are
simpler to handle, they became standard, such that 64-bit mode now enforces a flat linearaddress space. But even in flat mode segments are still crucial for x86 protection, the
mechanism that defends the kernel from user-mode processes and every process from each
other. Its a dog eat dog world out there! In the next post, well take a peek at protection
levels and how segments implement them.
-
8/6/2019 Duartes Internals Papers
20/54
20
5.CPU Rings, Privilege, and ProtectionYou probably know intuitively that applications have limited powers in Intel x86 computers
and that only operating system code can perform certain tasks, but do you know how this
really works? This post takes a look at x86 privilege levels, the mechanism whereby the
OS and CPU conspire to restrict what user-mode programs can do. There are four privilege
levels, numbered 0 (most privileged) to 3 (least privileged), and three main resources being
protected: memory, I/O ports, and the ability to execute certain machine instructions. At
any given time, an x86 CPU is running in a specific privilege level, which determines what
code can and cannot do. These privilege levels are often described as protection rings, with
the innermost ring corresponding to highest privilege. Most modern x86 kernels use only
two privilege levels, 0 and 3:
x86 Protection Rings
About 15 machine instructions, out of dozens, are restricted by the CPU to ring zero. Many
others have limitations on their operands. These instructions can subvert the protection
mechanism or otherwise foment chaos if allowed in user mode, so they are reserved to the
kernel. An attempt to run them outside of ring zero causes a general-protection exception,
like when a program uses invalid memory addresses. Likewise, access to memory and I/O
ports is restricted based on privilege level. But before we look at protection mechanisms,
lets see exactlyhow the CPU keeps track of the current privilege level, which involves the
segment selectorsfrom the previous post. Here they are:
http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protectionhttp://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protectionhttp://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protectionhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection -
8/6/2019 Duartes Internals Papers
21/54
21
Segment Selectors Data and Code
The full contents of data segment selectors are loaded directly by code into various segment
registers such as ss (stack segment register) and ds (data segment register). This includes
the contents of the Requested Privilege Level (RPL) field, whose meaning we tackle in a bit.
The code segment register (cs) is, however, magical. First, its contents cannot be set
directly by load instructions such as mov, but rather only by instructions that alter the flow
of program execution, like call. Second, and importantly for us, instead of an RPL field that
can be set by code, cs has a Current Privilege Level (CPL) field maintained by the CPUitself. This 2-bit CPL field in the code segment register is always equal tothe CPUs
current privilege level. The Intel docs wobble a little on this fact, and sometimes online
documents confuse the issue, but thats the hard and fast rule. At any time, no matter
whats going on in the CPU, a look at the CPL in cs will tell you the privilege level code is
running with.
Keep in mind that the CPU privilege level has nothing to do with operating system
users. Whether youre root, Administrator, guest, or a regular user, it does not matter. All
user code runs in ring 3 and all kernel code runs in ring 0, regardless of the OS user
on whose behalf the code operates. Sometimes certain kernel tasks can be pushed to user
mode, for example user-mode device drivers in Windows Vista, but these are just special
processes doing a job for the kernel and can usually be killed without major consequences.Due to restricted access to memory and I/O ports, user mode can do almost nothing to the
outside world without calling on the kernel. It cant open files, send network packets, print
to the screen, or allocate memory. User processes run in a severely limited sandbox set up
by the gods of ring zero. Thats why its impossible, by design, for a process to leak memory
beyond its existence or leave open files after it exits. All of the data structures that control
such things memory, open files, etc cannot be touched directly by user code; once a
process finishes, the sandbox is torn down by the kernel. Thats why our servers can have
600 days of uptime as long as the hardware and the kernel dont crap out, stuff can run
for ever. This is also why Windows 95 / 98 crashed so much: its not because M$ sucks
but because important data structures were left accessible to user mode for compatibility
reasons. It was probably a good trade-off at the time, albeit at high cost.
The CPU protects memory at two crucial points: when a segment selector is loaded and
when a page of memory is accessed with a linear address. Protection thus mirrorsmemory
address translationwhere both segmentation and paging are involved. When a data
segment selector is being loaded, the check below takes place:
http://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentationhttp://duartes.org/gustavo/blog/post/memory-translation-and-segmentation -
8/6/2019 Duartes Internals Papers
22/54
22
x86 Segment ProtectionSince a higher number means less privilege, MAX() above picks the least privileged of CPL
and RPL, and compares it to the descriptor privilege level (DPL). If the DPL is higher or
equal, then access is allowed. The idea behind RPL is to allow kernel code to load a segment
using lowered privilege. For example, you could use an RPL of 3 to ensure that a given
operation uses segments accessible to user-mode. The exception is for the stack segment
register ss, for which the three of CPL, RPL, and DPL must match exactly.
In truth, segment protection scarcely matters because modern kernels use a flat address
space where the user-mode segments can reach the entire linear address space. Useful
memory protection is done in the paging unit when a linear address is converted into a
physical address. Each memory page is a block of bytes described by a page table entry
containing two fields related to protection: a supervisor flag and a read/write flag. Thesupervisor flag is the primary x86 memory protection mechanism used by kernels. When it
is on, the page cannot be accessed from ring 3. While the read/write flag isnt as important
for enforcing privilege, its still useful. When a process is loaded, pages storing binary
images (code) are marked as read only, thereby catching some pointer errors if a program
attempts to write to these pages. This flag is also used to implementcopy on writewhen a
process is forked in Unix. Upon forking, the parents pages are marked read only and shared
with the forked child. If either process attempts to write to the page, the processor triggers
a fault and the kernel knows to duplicate the page and mark it read/write for the writing
process.
Finally, we need a way for the CPU to switch between privilege levels. If ring 3 code could
transfer control to arbitrary spots in the kernel, it would be easy to subvert the operating
system by jumping into the wrong (right?) places. A controlled transfer is necessary. This is
accomplished via gate descriptors and via the sysenter instruction. A gate descriptor is a
segment descriptor of type system, and comes in four sub-types: call-gate descriptor,
interrupt-gate descriptor, trap-gate descriptor, and task-gate descriptor. Call gates provide
a kernel entry point that can be used with ordinary call and jmp instructions, but they arent
used much so Ill ignore them. Task gates arent so hot either (in Linux, they are only used
in double faults, which are caused by either kernel or hardware problems).
http://todo/http://todo/http://todo/http://todo/ -
8/6/2019 Duartes Internals Papers
23/54
23
That leaves two juicier ones: interrupt and trap gates, which are used to handle hardware
interrupts (e.g., keyboard, timer, disks) and exceptions (e.g., page faults, divide by zero).
Ill refer to both as an interrupt. These gate descriptors are stored in the Interrupt
Descriptor Table (IDT). Each interrupt is assigned a number between 0 and 255 called a
vector, which the processor uses as an index into the IDT when figuring out which gate
descriptor to use when handling the interrupt. Interrupt and trap gates are nearly identical.Their format is shown below along with the privilege checks enforced when an interrupt
happens. I filled in some values for the Linux kernel to make things concrete.
Interrupt Descriptor with Privilege Check
Both the DPL and the segment selector in the gate regulate access, while segment selector
plus offset together nail down an entry point for the interrupt handler code. Kernels
normally use the segment selector for the kernel code segment in these gate descriptors.
An interrupt can never transfer control from a more-privileged to a less-privileged ring.
Privilege must either stay the same (when the kernel itself is interrupted) or be elevated
(when user-mode code is interrupted). In either case, the resulting CPL will be equal to to
the DPL of the destination code segment; if the CPL changes, a stack switch also occurs. If
an interrupt is triggered by code via an instruction like int n, one more check takes place:
the gate DPL must be at the same or lower privilege as the CPL. This prevents user code
from triggering random interrupts. If these checks fail you guessed it a general-
protection exception happens. All Linux interrupt handlers end up running in ring zero.
-
8/6/2019 Duartes Internals Papers
24/54
24
During initialization, the Linux kernel first sets up an IDT insetup_idt()that ignores all
interrupts. It then uses functions ininclude/asm-x86/desc.hto flesh out common IDT
entries inarch/x86/kernel/traps_32.c. In Linux, a gate descriptor with system in its name
is accessible from user mode and its set function uses a DPL of 3. Asystem gate is an
Intel trap gate accessible to user mode. Otherwise, the terminology matches up. Hardware
interrupt gates are not set here however, but instead in the appropriate drivers.Three gates are accessible to user mode: vectors 3 and 4 are used for debugging and
checking for numeric overflows, respectively. Then a system gate is set up for the
SYSCALL_VECTOR, which is 080 for the x86 architecture. This was the mechanism for a
process to transfer control to the kernel, to make a system call, and back in the day I
applied for an int 080 vanity license plate . Starting with the Pentium Pro, the
sysenter instruction was introduced as a faster way to make system calls. It relies on
special-purpose CPU registers that store the code segment, entry point, and other tidbits for
the kernel system call handler. When sysenter is executed the CPU does no privilege
checking, going immediately into CPL 0 and loading new values into the registers for code
and stack (cs, eip, ss, and esp). Only ring zero can load the sysenter setup registers, which
is done inenable_sep_cpu().
Finally, when its time to return to ring 3, the kernel issues an iret or sysexit instruction to
return from interrupts and system calls, respectively, thus leaving ring 0 and resuming
execution of user code with a CPL of 3. Vim tells me Im approaching 1,900 words, so I/O
port protection is for another day. This concludes our tour of x86 rings and protection.
Thanks for reading!
http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L475http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L475http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L475http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/desc.h#L322http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/desc.h#L322http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/desc.h#L322http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/traps_32.c#L1140http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/traps_32.c#L1140http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/traps_32.c#L1140http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/mach-default/irq_vectors.h#L31http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/mach-default/irq_vectors.h#L31http://lxr.linux.no/linux+v2.6.25.6/arch/x86/vdso/vdso32-setup.c#L235http://lxr.linux.no/linux+v2.6.25.6/arch/x86/vdso/vdso32-setup.c#L235http://lxr.linux.no/linux+v2.6.25.6/arch/x86/vdso/vdso32-setup.c#L235http://lxr.linux.no/linux+v2.6.25.6/arch/x86/vdso/vdso32-setup.c#L235http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/mach-default/irq_vectors.h#L31http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/traps_32.c#L1140http://lxr.linux.no/linux+v2.6.25.6/include/asm-x86/desc.h#L322http://lxr.linux.no/linux+v2.6.25.6/arch/x86/kernel/head_32.S#L475 -
8/6/2019 Duartes Internals Papers
25/54
25
6.What Your Computer Does While You WaitThis post takes a look at the speed latency and throughput of various subsystems in a
modern commodity PC, an Intel Core 2 Duo at 3.0GHz. I hope to give a feel for the relative
speed of each component and a cheatsheet for back-of-the-envelope performance
calculations. Ive tried to show real-world throughputs (the sources are posted as a
comment) rather than theoretical maximums. Time units are nanoseconds (ns, 10-9
seconds), milliseconds (ms, 10-3 seconds), and seconds (s). Throughput units are in
megabytes and gigabytes per second. Lets start with CPU and memory, the north of the
northbridge:
The first thing that jumps out is how absurdly fast our processors are. Most simple
instructions on the Core 2 take one clock cycle to execute, hence a third of a
nanosecond at 3.0Ghz. For reference, light only travels ~4 inches (10 cm) in the time
taken by a clock cycle. Its worth keeping this in mind when youre thinking of optimization
instructions are comically cheap to execute nowadays.
http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait -
8/6/2019 Duartes Internals Papers
26/54
26
As the CPU works away, it must read from and write to system memory, which it accesses
via the L1 and L2 caches. The caches usestatic RAM, a much faster (and expensive) type of
memory than theDRAMmemory used as the main system memory. The caches are part of
the processor itself and for the pricier memory we get very low latency. One way in which
instruction-level optimization is still very relevant is code size. Due to caching, there can be
massive performance differences between code that fits wholly into the L1/L2 caches andcode that needs to be marshalled into and out of the caches as it executes.
Normally when the CPU needs to touch the contents of a memory region they must either
be in the L1/L2 caches already or be brought in from the main system memory. Here we
see our first major hit, a massive ~250 cycles of latency that often leads to a stall, when
the CPU has no work to do while it waits. To put this into perspective, reading from L1
cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a
book from a nearby shelf (14 seconds), and main system memory is taking a 4-minute walk
down the hall to buy a Twix bar.
The exact latency of main memory is variable and depends on the application and many
other factors. For example, it depends on the CAS latency and specifications of the actual
RAM stick that is in the computer. It also depends on how successful the processor is at
prefetching guessing which parts of memory will be needed based on the code that isexecuting and having them brought into the caches ahead of time.
Looking at L1/L2 cache performance versus main memory performance, it is clear how
much there is to gain from larger L2 caches and from applications designed to use it well.
For a discussion of all things memory, see Ulrich DreppersWhat Every Programmer Should
Know About Memory(pdf), a fine paper on the subject.
People refer to the bottleneck between CPU and memory as thevon Neumann bottleneck.
Now, the front side bus bandwidth, ~10GB/s, actually looks decent. At that rate, you could
read all of 8GB of system memory in less than one second or read 100 bytes in 10ns. Sadly
this throughput is a theoretical maximum (unlike most others in the diagram) and cannot be
achieved due to delays in the main RAM circuitry. Many discrete wait periods are required
when accessing memory. The electrical protocol for access calls for delays after a memoryrow is selected, after a column is selected, before data can be read reliably, and so on. The
use of capacitors calls for periodic refreshes of the data stored in memory lest some bits get
corrupted, which adds further overhead. Certain consecutive memory accesses may happen
more quickly but there are still delays, and more so for random access. Latency is always
present.
Down in the southbridge we have a number of other buses (e.g., PCIe, USB) and
peripherals connected:
http://en.wikipedia.org/wiki/Static_RAMhttp://en.wikipedia.org/wiki/Static_RAMhttp://en.wikipedia.org/wiki/Static_RAMhttp://en.wikipedia.org/wiki/Dynamic_random_access_memoryhttp://en.wikipedia.org/wiki/Dynamic_random_access_memoryhttp://en.wikipedia.org/wiki/Dynamic_random_access_memoryhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://en.wikipedia.org/wiki/Von_Neumann_bottleneck#Von_Neumann_bottleneckhttp://people.redhat.com/drepper/cpumemory.pdfhttp://people.redhat.com/drepper/cpumemory.pdfhttp://en.wikipedia.org/wiki/Dynamic_random_access_memoryhttp://en.wikipedia.org/wiki/Static_RAM -
8/6/2019 Duartes Internals Papers
27/54
27
-
8/6/2019 Duartes Internals Papers
28/54
28
Sadly the southbridge hosts some truly sluggish performers, for even main memory is
blazing fast compared to hard drives. Keeping with the office analogy, waiting for a hard
drive seek is like leaving the building to roam the earth for one year and three months.
This is why so many workloads are dominated by disk I/O and why database performance
can drive off a cliff once the in-memory buffers are exhausted. It is also why plentiful RAM
(for buffering) and fast hard drives are so important for overall system performance.While the sustained disk throughput is real in the sense that it is actually achieved by th e
disk in real-world situations, it does not tell the whole story. The bane of disk performance
are seeks, which involve moving the read/write heads across the platter to the right track
and then waiting for the platter to spin around to the right position so that the desired
sector can be read. Disk RPMs refer to the speed of rotation of the platters: the faster the
RPMs, the less time you wait on average for the rotation to give you the desired sector,
hence higher RPMs mean faster disks. A cool place to read about the impact of seeks is the
paper where a couple of Stanford grad students describe theAnatomy of a Large-Scale
Hypertextual Web Search Engine(pdf).
When the disk is reading one large continuous file it achieves greater sustained read speeds
due to the lack of seeks. Filesystem defragmentation aims to keep files in continuous
chunks on the disk to minimize seeks and boost throughput. When it comes to how fast acomputer feels, sustained throughput is less important than seek times and the number of
random I/O operations (reads/writes) that a disk can do per time unit. Solid state disks can
make for agreat optionhere.
Hard drive caches also help performance. Their tiny size a 16MB cache in a 750GB drive
covers only 0.002% of the disk suggest theyre useless, but in reality their contribution is
allowing a disk toqueue up writes and then perform them in one bunch, thereby allowing
the disk to plan the order of the writes in a way that surprise minimizes seeks. Reads
can also be grouped in this way for performance, and both the OS and the drive firmware
engage in these optimizations.
Finally, the diagram has various real-world throughputs for networking and other buses.
Firewire is shown for reference but is not available natively in the Intel X48 chipset. Its funto think of the Internet as a computer bus. The latency to a fast website (say, google.com)
is about 45ms, comparable to hard drive seek latency. In fact, while hard drives are 5
orders of magnitude removed from main memory, theyre in the same magnitude as the
Internet. Residential bandwidth still lags behind that of sustained hard drive reads, but the
network is the computer in a pretty literal sense now. What happens when the Internet is
faster than a hard drive?
I hope this diagram is useful. Its fascinating for me to look at all these numbers together
and see how far weve come. Sources are posted as a comment. I posted a full diagram
showing both north and south bridgeshereif youre interested.
http://infolab.stanford.edu/pub/papers/google.pdfhttp://infolab.stanford.edu/pub/papers/google.pdfhttp://infolab.stanford.edu/pub/papers/google.pdfhttp://infolab.stanford.edu/pub/papers/google.pdfhttp://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.htmlhttp://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.htmlhttp://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/0110.0/0925.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/0110.0/0925.htmlhttp://lkml.indiana.edu/hypermail/linux/kernel/0110.0/0925.htmlhttp://static.duartes.org/img/blogPosts/latencyAndThroughputFull.pnghttp://static.duartes.org/img/blogPosts/latencyAndThroughputFull.pnghttp://static.duartes.org/img/blogPosts/latencyAndThroughputFull.pnghttp://static.duartes.org/img/blogPosts/latencyAndThroughputFull.pnghttp://lkml.indiana.edu/hypermail/linux/kernel/0110.0/0925.htmlhttp://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.htmlhttp://infolab.stanford.edu/pub/papers/google.pdfhttp://infolab.stanford.edu/pub/papers/google.pdf -
8/6/2019 Duartes Internals Papers
29/54
29
7.Cache: a place for concealment and safekeepingThis post shows briefly how CPU caches are organized in modern Intel processors. Cache
discussions often lack concrete examples, obfuscating the simple concepts involved. Or
maybe my pretty little head is slow. At any rate, heres half the story on how a Core 2 L1
cache is accessed:
The unit of data in the cache is the line, which is just a contiguous chunk of bytes in
memory. This cache uses 64-byte lines. The lines are stored in cache banks or ways, and
each way has a dedicated directory to store its housekeeping information. You can imagine
each way and its directory as columns in a spreadsheet, in which case the rows are the sets.
Then each cell in the way column contains a cache line, tracked by the corresponding cell in
the directory. This particular cache has 64 sets and 8 ways, hence 512 cells to store cache
lines, which adds up to 32KB of space.
In this caches view of the world, physical memory is divided into 4KB physical pages. Eachpage has4KB / 64 bytes== 64 cache lines in it. When you look at a 4KB page, bytes 0
through 63 within that page are in the first cache line, bytes 64-127 in the second cache
line, and so on. The pattern repeats for each page, so the 3rd line in page 0 is different than
the 3rd line in page 1.
In a fully associative cache any line in memory can be stored in any of the cache cells.
This makes storage flexible, but it becomes expensive to search for cells when accessing
them. Since the L1 and L2 caches operate under tight constraints of power consumption,
http://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://www.google.com/search?hl=en&q=(4KB+/+64+bytes)http://www.google.com/search?hl=en&q=(4KB+/+64+bytes)http://www.google.com/search?hl=en&q=(4KB+/+64+bytes)http://www.google.com/search?hl=en&q=(4KB+/+64+bytes)http://duartes.org/gustavo/blog/post/intel-cpu-caches -
8/6/2019 Duartes Internals Papers
30/54
30
physical space, and speed, a fully associative cache is not a good trade off in most
scenarios.
Instead, this cache is set associative, which means that a given line in memory can only
be stored in one specific set (or row) shown above. So the first line ofany physical page
(bytes 0-63 within a page) must be stored in row 0, the second line in row 1, etc. Each row
has 8 cells available to store the cache lines it is associated with, making this an 8-wayassociative set. When looking at a memory address, bits 11-6 determine the line number
within the 4KB page and therefore the set to be used. For example, physical address
0x800010a0 has000010in those bits so it must be stored in set 2.
But we still have the problem of finding whichcell in the row holds the data, if any. Thats
where the directory comes in. Each cached line is taggedby its corresponding directory cell;
the tag is simply the number for the page where the line came from. The processor can
address 64GB of physical RAM, so there are64GB / 4KB== 224 of these pages and thus we
need 24 bits for our tag. Our example physical address 0x800010a0 corresponds to page
number524,289. Heres the second half of the story:
Since we only need to look in one set of 8 ways, the tag matching is very fast; in fact,
electrically all tags are compared simultaneously, which I tried to show with the arrows. If
theres a valid cache line with a matching tag, we have a cache hit. Otherwise, the request
is forwarded to the L2 cache, and failing that to main system memory. Intel builds large L2
caches by playing with the size and quantity of the ways, but the design is the same. For
example, you could turn this into a 64KB cache by adding 8 more ways. Then increase the
number of sets to 4096 and each way can store256KB. These two modifications would
deliver a 4MB L2 cache. In this scenario, youd need 18 bits for the tags and 12 for the set
index; the physical page size used by the cache is equal to its way size.
If a set fills up, then a cache line must be evicted before another one can be stored. To
avoid this, performance-sensitive programs try to organize their data so that memory
accesses are evenly spread among cache lines. For example, suppose a program has an
array of 512-byte objects such that some objects are 4KB apart in memory. Fields in these
objects fall into the same lines and compete for the same cache set. If the program
frequently accesses a given field (e.g., thevtableby calling a virtual method), the set will
likely fill up and the cache will start trashing as lines are repeatedly evicted and later
reloaded. Our example L1 cache can only hold the vtables for 8 of these objects due to set
http://www.google.com/search?q=0x800010a0%20in%20binaryhttp://www.google.com/search?q=0x800010a0%20in%20binaryhttp://www.google.com/search?q=0x800010a0%20in%20binaryhttp://www.google.com/search?hl=en&q=lg(64GB+/+4KB)http://www.google.com/search?hl=en&q=lg(64GB+/+4KB)http://www.google.com/search?hl=en&q=lg(64GB+/+4KB)http://www.google.com/search?hl=en&q=0x800010a0+Bytes+/+4KBhttp://www.google.com/search?hl=en&q=0x800010a0+Bytes+/+4KBhttp://www.google.com/search?hl=en&q=0x800010a0+Bytes+/+4KBhttp://www.google.com/search?hl=en&q=64+Bytes+*+4096http://www.google.com/search?hl=en&q=64+Bytes+*+4096http://www.google.com/search?hl=en&q=64+Bytes+*+4096http://en.wikipedia.org/wiki/Vtablehttp://en.wikipedia.org/wiki/Vtablehttp://en.wikipedia.org/wiki/Vtablehttp://en.wikipedia.org/wiki/Vtablehttp://www.google.com/search?hl=en&q=64+Bytes+*+4096http://www.google.com/search?hl=en&q=0x800010a0+Bytes+/+4KBhttp://www.google.com/search?hl=en&q=lg(64GB+/+4KB)http://www.google.com/search?q=0x800010a0%20in%20binary -
8/6/2019 Duartes Internals Papers
31/54
31
size. This is the cost of the set associativity trade-off: we can get cache misses due to set
conflicts even when overall cache usage is not heavy. However, due to therelative speeds
in a computer, most apps dont need to worry about this anyway.
A memory access usually starts with a linear (virtual) address, so the L1 cache relies on the
paging unit to obtain the physical page address used for the cache tags. By contrast, the set
index comes from the least significant bits of the linear address and is used withouttranslation (bits 11-6 in our example). Hence the L1 cache is physically tagged but
virtually indexed, helping the CPU to parallelize lookup operations. Because the L1 way is
never bigger than an MMU page, a given physical memory location is guaranteed to be
associated with the same set even with virtual indexing. L2 caches, on the other hand, must
be physically tagged and physically indexed because their way size can be bigger than MMU
pages. But then again, by the time a request gets to the L2 cache the physical address was
already resolved by the L1 cache, so it works out nicely.
Finally, a directory cell also stores the state of its corresponding cached line. A line in the L1
code cache is either Invalid or Shared (which means valid, really). In the L1 data cache and
the L2 cache, a line can be in any of the 4 MESI states: Modified, Exclusive, Shared, or
Invalid. Intel caches are inclusive: the contents of the L1 cache are duplicated in the L2
cache. These states will play a part in later posts about threading, locking, and that kind ofstuff. Next time well look at the front side bus and how memory access reallyworks. This is
going to be memory week.
Update:Davebrought up direct-mapped caches in acomment below. Theyre basically a
special case of set-associative caches that have only one way. In the trade-off spectrum,
theyre the opposite of fully associative caches: blazing fast access, lots of conflict misses.
http://duartes.org/gustavo/blog/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/what-your-computer-does-while-you-waithttp://duartes.org/gustavo/blog/what-your-computer-does-while-you-waithttp://www.findinglisp.com/blog/http://www.findinglisp.com/blog/http://www.findinglisp.com/blog/http://duartes.org/gustavo/blog/post/intel-cpu-caches#comment-12687http://duartes.org/gustavo/blog/post/intel-cpu-caches#comment-12687http://duartes.org/gustavo/blog/post/intel-cpu-caches#comment-12687http://duartes.org/gustavo/blog/post/intel-cpu-caches#comment-12687http://www.findinglisp.com/blog/http://duartes.org/gustavo/blog/what-your-computer-does-while-you-wait -
8/6/2019 Duartes Internals Papers
32/54
32
8.Getting Physical with MemoryWhen trying to understand complex systems, you can often learn a lot by stripping awayabstractions and looking at their lowest levels. In that spirit we take a look at memory andI/O ports in their simplest and most fundamental level: the interface between the processor
and bus. These details underlie higher level topics like thread synchronization and the needfor the Core i7. Also, since Im a programmer I ignore things EE people care about. Heresour friend the Core 2 again:
A Core 2 processor has 775 pins, about half of which only provide power and carry no data.Once you group the pins by functionality, the physical interface to the processor is
surprisingly simple. The diagram shows the key pins involved in a memory or I/O portoperation: address lines, data pins, and request pins. These operations take place in thecontext of a transaction on the front side bus. FSB transactions go through 5 phases:arbitration, request, snoop, response, and data. Throughout these phases, different rolesare played by the components on the FSB, which are called agents. Normally the agentsare all the processors plus the northbridge.
http://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://duartes.org/gustavo/blog/post/intel-cpu-cacheshttp://duartes.org/gustavo/blog/post/intel-cpu-caches -
8/6/2019 Duartes Internals Papers
33/54
33
We only look at the request phase in this post, in which 2 packets are output by therequest agent, who is usually a processor. Here are the juiciest bits of the first packet,output by the address and request pins:
The address lines output the starting physical memory address for the transaction. We have33 bits but they are interpreted as bits 35-3 of an address in which bits 2-0 are zero. Hencewe have a 36-bit address, aligned to 8 bytes, for a total of64GBaddressable physicalmemory. This has been the case since the Pentium Pro. The request pins specify what typeof transaction is being initiated; in I/O requests the address pins specify an I/O port ratherthan a memory address. After the first packet is output, the same pins transmit a secondpacket in the subsequent bus clock cycle:
http://www.google.com/search?hl=en&q=2%5E36+byteshttp://www.google.com/search?hl=en&q=2%5E36+byteshttp://www.google.com/search?hl=en&q=2%5E36+byteshttp://www.google.com/search?hl=en&q=2%5E36+bytes -
8/6/2019 Duartes Internals Papers
34/54
34
The attribute signals are interesting: they reflect the 5 types of memory caching behavioravailable in Intel processors. By putting this information on the FSB, the request agent letsother processors know how this transaction affects their caches, and how the memorycontroller (northbridge) should behave. The processor determines the type of a givenmemory region mainly by looking at page tables, which are maintained by the kernel.
Typically kernels treat all RAM memory as write-back, which yields the bestperformance. In write-back mode the unit of memory access is thecache line, 64 bytes inthe Core 2. If a program reads a single byte in memory, the processor loads the wholecache line that contains that byte into the L2 and L1 caches. When a program writes tomemory, the processor only modifies the line in the cache, but does notupdate mainmemory. Later, when it becomes necessary to post the modified line to the bus, the wholecache line is written at once. So most requests have 11 in their length field, for 64 bytes.Heres a read example in which the data is not in the caches:
Some of the physical memory range in an Intel computer ismapped to deviceslike hard
drives and network cards instead of actual RAM memory. This allows drivers tocommunicate with their devices by writing to and reading from memory. The kernel marksthese memory regions as uncacheable in the page tables. Accesses to uncacheablememory regions are reproduced in the bus exactly as requested by a program or driver.Hence its possible to read or write single bytes, words, and so on. This is done via the byte
enable mask in packet B above.
The primitives discussed here have many implications.