1 linux virtual memory for intel processor debzani deb

29
1 Linux Virtual Memory for Intel Processor Debzani Deb

Upload: abraham-thornton

Post on 12-Jan-2016

260 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Linux Virtual Memory for Intel Processor Debzani Deb

1

Linux Virtual Memory for Intel Processor

Debzani Deb

Page 2: 1 Linux Virtual Memory for Intel Processor Debzani Deb

2

Overview

• Overview of Virtual memory.• What are the supports available in Intel architecture for

virtual memory.• How Linux use those hardware support and implement

virtual memory.• Process Address Space.• Page fault handler.• What are the additional improvements in kernel2.6.• References.

Page 3: 1 Linux Virtual Memory for Intel Processor Debzani Deb

3

Introduction• In “Virtual Memory” environment a large logical address

space is simulated with a small amount of physical memory (RAM) and some disk storage (swap space).

• Processor’s addressable logical address is converted to physical address during program execution.

• Implementation requires extensive hardware assistance and a lot of complex OS code and time.

• Virtual memory can be implemented as – Paging : Fixed sized memory blocks.– Segmentation: variable sized memory blocks.

• Fetch technique : Demand Paging• Replacement technique: Least Recently Used (LRU)

algorithm.

Page 4: 1 Linux Virtual Memory for Intel Processor Debzani Deb

4

Why Virtual Memory?

OS(8 MB)

Process 1(50 MB)

Process 2(50 MB)

Process 3(30 MB)

RAM

Process may be too big for Physical Memory

There are more active process than the physical memory can hold.

Solution: “Virtual Memory” where a large virtual address space(4GB) for each process is simulated with a small amount of physical memory (RAM) and some disk storage (swap space).

Page 5: 1 Linux Virtual Memory for Intel Processor Debzani Deb

5

Virtual MemoryProcess 1 Process 2

OS(8 MB)

Page1(1)

Page 4(1)

Page 5(1)

Page 3(1)

Page 2(1)

Page 7(1)

Page 6(1)

Page1(2)

Page 5(2)

Page 3(2)

Page 2(2)

Page 6(2)

Page 7(2)

RAM

Page1(1)

Page 2(1)

Process 1 RunningProcess 1 Sleep

Process 2 Scheduled to run

Page1(2)

Process 2 Running

Page 2(2)

Page1(1) Page 3(2)Process 2 faulted

The system works because principle of locality holds.

Thrashing : System swaps in/out all the time, no real work is done.

Page 6: 1 Linux Virtual Memory for Intel Processor Debzani Deb

6

IA-32 Virtual Memory• IA-32 architecture supports either pure segmentation or

segmentation/paging virtual memory.• Logical address

– Consists of a segment selector(16 bit) and an offset(32 bit). • Linear Address (LA) or Virtual Address (VA)

– The base address of the segment + offset. This 32 bit address is used to address 4GB of memory.

• Physical Address (PA)– 32 bit Address in RAM.

Linear AddressSegmentation

UnitLogical AddressPaging

Unit Physical Address

Page 7: 1 Linux Virtual Memory for Intel Processor Debzani Deb

7

IA-32 Virtual Memory

Page 8: 1 Linux Virtual Memory for Intel Processor Debzani Deb

8

IA-32 Segmentation(1)

• Segment Registers (6)– Hold and retrieve segment selectors quickly.

• CS (Code segment register) points to a segment containing program instructions. Also includes Current privilege Level (CPL) field to denote privilege level : 0 means kernel mode and 3 means user mode.

• DS (Data segment register) points to a segment containing static and external data.

• SS (Stack segment register) points to a segment containing the current program stack.

• ES, FS & GS are general purpose registers and may refer to arbitrary data segments.

Page 9: 1 Linux Virtual Memory for Intel Processor Debzani Deb

9

IA-32 Segmentation(2)

• Segment Descriptors (8 Byte)– Unique Segment Identifier.– Stored in Global Descriptor Table (GDT).– Contains

• 32 bit Base address of the segment• 20 bit limit• 4 bit Type that denote segment type and access rights.• DPL (Descriptor Privilege Level) Field : 0 means use is

restricted to only kernel mode, 3 means both mode.

Page 10: 1 Linux Virtual Memory for Intel Processor Debzani Deb

10

IA-32 Protection• Protection

– Intel Use 4 Privilege levels: 0-3 with 0 being the most privilege level.

– The privilege level of executing program is determined by the privilege level of the code segment currently executing.

• CPL (Current privilege level): Bit 0 & 1 of CS (code segment) register.

• The processor changes CPL when program control is transferred to a code segment with a different privilege level.

– DPL (Descriptor’s privilege level): Bits in Segment descriptor. When the currently executing code segment attempts to access a segment, The DPL is compared to the CPL of CS.

– Programs executing in a high privilege level can not access segments with a lower privilege level while programs low privilege level can access all segments.

Page 11: 1 Linux Virtual Memory for Intel Processor Debzani Deb

11

Segmentation in Linux• There is no mode bit to disable segmentation.• Linux prefer paging over segmentation because of simplicity and

portability.• The pages are divided among 4 Segments.• All process use the same logical address and segment descriptors.• GDT is implemented is /arch/i386/kernet/head.S• Each time CPL in CS change, DS and SS changed correspondingly.• SS points to DS.

Segments used by Linux

Type DPL Accessed By

Kernel Code Code, Read, Execute 0 Kernel

Kernel Data Data, Read , Write 0 Kernel

User Code Code, Read, Execute 3 Both

User Data Data, Read , Write 3 Both

Page 12: 1 Linux Virtual Memory for Intel Processor Debzani Deb

12

Protection in Linux• Segments overlap in linear address space

/arch/i386/kernet/head.S• Thus access is effectively allowed to the entire virtual

address space using any of the above segments.• All processes have two segments

– 0 - 3GB: user segment – 3GB - 4GB kernel segment– Boundary is determined by PAGE_OFFSET = 0xC00000000.– Process in user mode (CPL = 3) can only access addresses lower

than 3 GB (only segments with DPL = 3). – Process in kernel mode (e.g. after a system call) can access both.

When CPL = 0, can access segments (DPL =0,3)– Any distinction between code and data is enforced at the page level,

not at the segment level: R/W , U/S bit of page.

Page 13: 1 Linux Virtual Memory for Intel Processor Debzani Deb

13

IA-32 Paging• Paging

– RAM is partitioned into fixed-sized page frames.– Linear address is divided into same size pages– The processor use information contained in page directories and

page tables (stored in RAM) to map linear to physical address and to generate page fault exception.

– Translation Lookaside Buffers (TLB) are used to store most recently accessed page directory and table entries to reduce access time.

• Intel supports 4KB, 2MB, 4MB page size.• Paging is controlled by three flags in the processor’s control registers

and sets by OS during initialization.– PG (paging): Available in all Intel processor starting from 80386.

Enable paging.– PSE (page size extensions): Introduced in the Pentium processor.

Permit large page(4 MB/2 MB when PAE is set)– PAE (physical address extension): Introduced in the Pentium Pro

processors. Provides a method of extending physical address to 36 bits(64MB). Support page size of 4 KB/2 MB.

Page 14: 1 Linux Virtual Memory for Intel Processor Debzani Deb

14

Page Table and directories• 32 bit linear address is divided into 3 fields(4KB page)

– Page Directory : Most significant 10 bits (1024 entry)– Page Table: The intermediate 10 bits (1024 entry)– Offset: Least significant 12 bits (Each page is 4KB)

• Incase of 2MB/4MB page, most significant 10 bits are for page directory and rest 22 bits are for page offset. Page tables are not used.

Page 15: 1 Linux Virtual Memory for Intel Processor Debzani Deb

15

Page Directory and Page table Entries• When 32 bit address and 4KB page used

– 20 bit base address, bits 12 through 32.– Present: when set, Page is in RAM.– Read/Write: When set, page can be read and written into. – User/supervisor: When set, user privilege level, otherwise both. – Accessed: sets each time paging unit access the entry.– PCD (page-level cache disable) and PWT (page-level write

through)– Dirty: Applies page table entries only. Sets when the page is

accessed for write.– Global: Introduced in Pentium Pro. Applies page table entries

only. When set indicates a global page and prevent the page flushed from TLB when context switch occurs.

– Page size: Applies page directories only. When 1 refers to 2MB/4MB page frame & PGD points to page. 4KB page when 0.

• This flags are checked by hardware to see whether requested kind of addressing can be performed.

Page 16: 1 Linux Virtual Memory for Intel Processor Debzani Deb

16

Paging in Linux(1)• Linux uses 3 level paging to adopt to 64 bit architectures.

– Page global directory (PGD)– Page Middle directory (PMD)– Page table

• Linear address is divided into four parts: three table offset and an page offset.

• What happens with IA-32, which use only two level page tables?– Linux makes the PMD entry points back to PGD.– IA-32 contains 1024 entries in PGD, one entry in PMD and 1024

entries in page table. • Each process has its own PGD. During context switch,

PGD base value of the process executing next is loaded into CR3 and TLB get flushed.

Page 17: 1 Linux Virtual Memory for Intel Processor Debzani Deb

17

Paging in Linux(2)• Linux use PAE, but don’t

use PSE.• Also use page size (PS)

flag of PGD to refer different page size for that specific PGD.

• Mixing 4MB and 4 KB page size– Kernel use large

page(4MB) and one level translation to reduce TLB entries and memory.

– Application use 4KB page.

PAE PS of PGD

Page size

Physical Address size

0 0 4KB 32 bit

0 1 4MB 32 bit

1 0 4KB 36 bit

1 1 2MB 36 bit

Page 18: 1 Linux Virtual Memory for Intel Processor Debzani Deb

18

Paging in Linux(3)• include/asm-i386/page.h 5 #define PAGE_SHIFT 12

6 #define PAGE_SIZE (1UL << PAGE_SHIFT)

7 #define PAGE_MASK (~(PAGE_SIZE-1))

• include/asm-i386/pgtable.h include/asm-i386/pgtable-2level.h

• Page table lookup code : mm/memory.c

Page 19: 1 Linux Virtual Memory for Intel Processor Debzani Deb

19

Paging in Linux (4)• The linear address space is split into two parts.

– The userspace(0-3GB) can be addressed in both mode– Kernel space(3GB-4GB) can be accessed in only kernel mode.– PAGE_OFFSET is defined as 0xc0000000 (3 GB)

• Kernel Paging (4 MB page)– Kernel code and data stored in a group of reserved page frame.– Never be dynamically assigned or swapped to disk.– Kernel maintains a set of page tables rooted at Master Kernel Page

Global Directory.

• How kernel initializes it’s own page tables? – swapper_pg_dir is initialized during kernel compilation.– Phase 1: Kernel can address the first 8 MB of RAM by either LA

identical to PA or 8MB starts from 0xc0000000.– Phase 2: Only transform LA starts from 0xc0000000 to PA from 0.

• Where Paging starts? /arch/i386/kernel/head.S

Page 20: 1 Linux Virtual Memory for Intel Processor Debzani Deb

20

Physical Memory Management

• Physical memory is divided into three Zones: DMA, Normal & HighMEM.

• Page frames are assigned from these zones.• Each physical page is associated with a page descriptor• All pages are stored in mem_map array.• Requesting page frames: alloc_pages() allocates groups of

contiguous page frames and use buddy system.• If alloc_pages can’t find a free page frame, it calls

try_to_free_pages() to reclaim.• try_to_free_pages() reclaim pages according to LRU

algorithm.• Memory for small data structures are carried out by Slab

Allocator.

Page 21: 1 Linux Virtual Memory for Intel Processor Debzani Deb

21

Process Address Space• The linear address space is

split into two parts.– The userspace(0-3GB) changes

with each context switch and accessed in both mode.

– Kernel space(3GB-4GB) remains constant and accessed while in kernel mode.

• Memory descriptor mm_struct.

– One structure exits for each process and is shared among threads.

– Memory descriptor for kernel threads.

Kernel

File name, Environment

Arguments

Stack

Heap

Data

Code

Header

Shared Libs

PAGE_OFFSET 0xC0000000

Kernel code & data

User code & data

Page 22: 1 Linux Virtual Memory for Intel Processor Debzani Deb

22

Memory Regions

• Full address space rarely used• Each address space consists of several non overlapping

page aligned regions that are in use.– Each region contains pages with same protection and purpose.– A list of mapped regions by /proc/PID/maps– Regions are described by vm_area_struct – If a file is memory mapped, the file pointer is available through

vm_file. – do_mmap(), find_vma(), get_unmapped_aera()

Page 23: 1 Linux Virtual Memory for Intel Processor Debzani Deb

23

Process Address Space

Start

End

Next

Start

End

Next

Start

End

Next

Linear Address

Memory Descriptor

Memory Regions

mmapmmap_cache

Page 24: 1 Linux Virtual Memory for Intel Processor Debzani Deb

24

Page faulting• Demand fetching

– Page is only fetched from swap space when hardware raise a page fault exception, which then the OS traps and allocates a page.

– A number of pages after the faulting page is prefetched.

• Two types of page fault– Major: Has to read from disk, expensive.– Minor: Page in swap cache, protection fault.

• Architecture specific function do_page_fault(). – basically decides what type of fault and how can it be handled.– If it is a valid page fault in a valid memory region then call

architecture independent function handle_mm_fault().• It allocates the required page table entries and calls handle_pte_fault.

Page 25: 1 Linux Virtual Memory for Intel Processor Debzani Deb

25

Do_page_fault() flow diagram

Page 26: 1 Linux Virtual Memory for Intel Processor Debzani Deb

26

handle_mm_fault() Call graphhandle_mm_fault

Allocates required page table entries, if they don’t exist

handle_pte_fault

Based on properties, corresponding handlers are called

do_no_page

If first time allocation

do_swap_page

Pages swapped out to disk

do_wp_page

Copy on Write (COW) page

do_anonymous_pageHandle anonymous

access

Page 27: 1 Linux Virtual Memory for Intel Processor Debzani Deb

27

Copy on Write (COW)• During fork kernel duplicates the parent address space

to child. It requires– Allocating page frames for the page tables of child process.– Allocating page frames for the pages of the child process.– Copying the pages of parent process to the pages of child

process.

• Linux use an efficient copy on write approach– The pages and page table entries are shared between parent

and child process and can’t be modified.– Whenever either one tries to write, a write fault occurs.– Kernel then duplicates the page into a new page frame and

marks it as writable.– The original page frame remain write protected. When other

process tries to write, kernel check whether it is only owner. If so then the page become writable.

Page 28: 1 Linux Virtual Memory for Intel Processor Debzani Deb

28

What’s different in 2.6• The big change is Linux's new support for NUMA

servers. Support for high end systems with multiple processors, with separate memory pools directly connected to each processor.

• Support for Intel's PAE (Physical Address Extension) allows the access up to 64 GB of RAM in paged mode. Linux can now run applications that access large blocks of memory. – For example, bigger databases are now supported on Linux.

• Reverse Mapping– Multiple virtual pages (pages shared by different processes)

might point to the same physical page. – The technique is useful when the kernel wants to free a

particular physical page.

Page 29: 1 Linux Virtual Memory for Intel Processor Debzani Deb

29

References• IA-32 Intel® Architecture Software Developer’s Manual Volume 3:

System Programming Guide (Document 253668): Chapter 3 & 4. • Bovet, D., and Cesati, M. Understanding the Linux Kernel.

O'Reilly, 2001. (chapter 2, 7, 8 & 16)• Virtual memory management for Linux 2.4 kernel: Description   

Code documentation• http://home.earthlink.net/~jknapka/linux-mm/vmoutline.html • Dietel & Dietel, Operating Systems, Prentice Hall , 2004• The Wonderful World of Linux 2.6 by Joseph Pranevich