page reclaim

28
Linuxカーネル ページ回収 吉田雅徳@siburu 2014/7/27(Sun)

Upload: siburu

Post on 30-Jun-2015

442 views

Category:

Software


5 download

DESCRIPTION

Investigation on (basic of) Linux's page reclaim function.

TRANSCRIPT

Page 1: Page reclaim

Linuxカーネル

ページ回収 吉田雅徳@siburu!

2014/7/27(Sun)

Page 2: Page reclaim

1. 前回のあらすじ

Page 3: Page reclaim

What’s Page Frame❖ page frame = A page-sized/aligned piece of RAM!

❖ struct page = An one-on-one structure in kernel for each page frame!

❖ mem_map!

❖ Unique array of struct page's which covers all RAM that a kernel manages.!

❖ but in CONFIG_SPARSEMEM environment!

❖ There's no unique mem_map.!

❖ Instead, there's a list of 2MB-sized arrays of struct page's.!

❖ You must use __pfn_to_page(), __page_to_pfn() or wrappers of them.

Page 4: Page reclaim

What’s NUMA❖ NUMA(Non-Uniform Memory Architecture)!

❖ System is comprised of nodes.!

❖ Each node is defined by a set of CPUs and one physical memory range.!

❖ Memory access latency differs depending on source and destination nodes.!

❖ NUMA configuration!

❖ ACPI provides NUMA configuration:!

❖ SRAT(Static Resource Affinity Table)!

❖ To know which CPUs and memory range are contained in which NUMA node?!

❖ SLIT(System Locality Information Table)!

❖ To know how far a NUMA node is from another node?

Page 5: Page reclaim

What’s Memory Zone❖ Physical memory is separated by address range:!

❖ ZONE_DMA: <16MB!

❖ ZONE_DMA32: <4GB!

❖ ZONE_NORMAL: the rest!

❖ ZONE_MOVABLE: none by default.!

❖ This is used to define a hot-removable physical memory range.

Page 6: Page reclaim

struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];!};

Memory node, zone

物理アドレス Range1 Range2

CPU1 CPU2 CPU3 CPU4

struct pglist_data {! struct zone node_zone[MAX_NR_ZONES];! …!};

NUMA node1 NUMA node2

❖ どのpglist_dataにも各ZONE(DMA~MOVABLE)に対応するzone構造体が用意される(但し一部の中身は空かもしれない)

Page 7: Page reclaim

Memory Allocation

1. At first, checks threshold for each zone (threshold = watermark and dirty-ratio).!

❖ If all zones are failed, the kernel goes into page reclaim path (=today’s topic).!

2. If some zone is ok, allocates a page from the zone’s buddy system.!

❖ 0-order page is allocated from per-cpu cache.!

❖ higher order page is obtained from per-order lists of pages

Page 8: Page reclaim

Memory Deallocation❖ Page is returned to buddy system.!

❖ 0-order page is returned to per-cpu cache via free_hot_cold_page().!

❖ Cold page: A page estimated not to be on CPU cache!

❖ This is linked to the tail of LRU list of the per-cpu cache.!

❖ Hot page: A page estimated to be on CPU cache!

❖ This is linked to the head of LRU list of the per-cpu cache.!

❖ higher order page is directly returned to per-order lists of pages.

Page 9: Page reclaim

Buddy System

4k 4k 4k

8k 8k 8k

4m 4m 4m

・・・

Per-cpu cache

4k 4k 4k

Per-zone buddy system

order0 (de)alloc

HOT COLD

order1

order10

・・・

Page 10: Page reclaim

2. ページの回収2.1 Direct reclaim!2.2 Daemon reclaim

Page 11: Page reclaim

ページ割当フローの復習❖ __alloc_pages_nodemask(ページ割当基本関数)!

❖ get_page_from_freelist(1st: local zones, low wmark) → get_page_from_freelist(2nd: all zones)!

❖ __alloc_pages_slowpath!

1. wake_all_kswapds(kswapd達の起床)!

2. get_page_from_freelist(3rd: all zones, min wmark)!

3. if {__GFP,PF}_MEMALLOC → __alloc_pages_high_priority!

4. __alloc_pages_direct_compact(非同期的)!

5. __alloc_pages_direct_reclaim(本コンテキストで直接ページ回収)!

6. if not did_some_progress → __alloc_pages_may_oom!

7. リトライ(2.へ) 又は __alloc_pages_direct_compact(同期的)

Page 12: Page reclaim

2.1 Direct Reclaim (ページ割当要求者本人による回収)

Page 13: Page reclaim

__alloc_pages_direct_reclaim()❖ __perform_reclaim!

❖ current->flags |= PF_MEMALLOC!❖ ページ回収の延長でページ割当が必要になった時に、緊急備蓄分を使用できるように!

❖ try_to_free_pages!

❖ throttle_direct_reclaim!

❖ if !pfmemalloc_watermark_ok →  kswapdによりokになるのを待機!

❖ do_try_to_free_pages!

❖ current->flags &= ~PF_MEMALLOC!

❖ get_page_from_freelist!

❖ drain_all_pages!

❖ get_page_from_freelist

Page 14: Page reclaim

pfmemalloc_watermark_ok()❖ ARGS!

❖ pgdat(type: struct pglist_data)!

❖ RETURN!

❖ type: bool!

❖ node’s free_pages > 0.5 * node’s min_wmark!

❖ DESC!

❖ node単位で(zone単位でなく)、フリーページ量を min watermarkの半分と比較し、超えていればOK!

❖ 下回っていればfalseを返すとともに、 当該nodeのkswapdを起床!

❖ メモリ逼迫したnodeではdirect reclaimはやめて kswapdに任せる、その閾値を決める関数。

Page 15: Page reclaim

do_try_to_free_pages()❖ Core function for page reclaim, which is called at 3 different scenes!

❖ try_to_free_pages() → Global reclaim path via __alloc_pages_nodemask()!

❖ try_to_free_mem_cgroup_pages() → Per-memcg reclaim path!

❖ Right before per-memcg slab allocation!

❖ Right before per-memcg file page allocation!

❖ Right before per-memcg anon page allocation!

❖ Right before per-memcg swapin allocation!

❖ shrink_all_memory() → Hibernation path!

❖ Arguments: (1)struct zonelist *zonelist (2)struct scan_control *sc

Page 16: Page reclaim

struct scan_controlstruct scan_control {!! unsigned long nr_scanned;!! unsigned long nr_reclaimed;!! unsigned long nr_to_reclaim;!! …!! int swappiness; // 0..100!! …!! struct mem_cgroup *target_mem_cgroup;!! …!! nodemask_t! *nodemask;!};!

Page 17: Page reclaim

do_try_to_free_pagesの処理❖ 以下二つのループ!

❖ shrink_zones()!❖ 後述!

❖ wakeup_flusher_threads()!

❖ shrink_zonesが、回収目標(scan_context::nr_to_reclaim)の 1.5倍以上のページをスキャンするたび、呼び出し。!

❖ 最大で、スキャンした分のページをライトバックするよう、

全ブロックデバイス(bdi)に要求。

Page 18: Page reclaim

shrink_zones()1. for_each_zone_zonelist_nodemask:!

1. mem_cgroup_soft_limit_reclaim!

❖ while mem_cgroup_largest_soft_limit_node:!

❖ mem_cgroup_soft_reclaim!

❖ shrink_zoneに進む前に、当該zoneを使ってる memcgでlimitを超えてるものについて、 ページ回収を済ませる処理!

2. shrink_zone!

❖ foreach mem_cgroup_iter:!

❖ shrink_lruvec!

❖ ここでのiterationはGlobal reclaimの場合は root memcgから回収!

2. shrink_slab!❖ スラブについては次回以降で・・・

Page 19: Page reclaim

shrink_lruvec()❖ per-zone page freer!

1. get_scan_count!❖ 回収目標ページ数決定!

2. while 目標未達:!

❖ shrink_list(LRU_INACTIVE_ANON)!

❖ shrink_list(LRU_ACTIVE_ANON)!

❖ shrink_list(LRU_INACTIVE_FILE)!

❖ shrink_list(LRU_ACTIVE_FILE)!

3. if INACTIVEな無名メモリだけでは不足:!

❖ shrink_active_list

Page 20: Page reclaim

shrink_list()❖ shrink_{active or inactive}_listを呼ぶ、但し、activeリストを

shrinkするのは、対となるinactiveリストより大きい場合のみ!

1. if ACTIVEなリストを指定:!

❖ if size of lru(ACTIVE) > size of lru(INACTIVE):!

❖ shrink_active_list!

2. else:!

❖ shrink_inactive_list

Page 21: Page reclaim

shrink_{active,inactive}_list❖ shrink_active_list()!

1. Traverse pages in an active list!

2. Find inactive pages in the list and move them to an inactive list!

❖ shrink_inactive_list()!

❖ foreach page:!

1. page_mapped(page) => try_to_unmap(page)!

2. if PageDirty(page) => pageout(page)

Page 22: Page reclaim

inactiveなページとは❖ !laptop_modeの場合!

❖ active LRU listの末尾から、単純に指定数分のページをinactiveなページとして取得!

❖ laptop_modeの場合!

❖ active LRU listの末尾から、cleanな指定数分のページをinactiveなページとして取得

Page 23: Page reclaim

try_to_unmap()❖ Unmap a specified page from all corresponding mappings!

1. Set up struct rmap_walk_control.!

2. rmap_walk_{file, anon, or ksm}!

❖ rmap walk is iterating VMAs and unmapping from it!

A. file: traverse address_space::i_mmap tree!

B. anon: traverse anon_vma tree!

C. ksm: traverse all merged anon_vma trees!

❖ each operation is similar to that for anon

Page 24: Page reclaim

A. rmap_walk_file

page

address_space(inode)i_mmap(type: rb_root)

vma vma vma vma

pgtbl pgtbl pgtbl pgtbl

unmap

Page 25: Page reclaim

B. rmap_walk_anon

page

anon_vmarb_root(type:rb_root)

vma vma vma vma

pgtbl pgtbl pgtbl pgtbl

unmap

Page 26: Page reclaim

C. rmap_walk_ksm

page

stable_nodehlist

anon!vma

anon vma

anon!vma

vma vma vma vma

pgtbl pgtbl pgtbl pgtbl

anon!vma

Page 27: Page reclaim

2.2 Daemon Reclaim (KSwapDによる代行回収)

Page 28: Page reclaim

kswapd❖ Processing overview!

1. Wake up!

2. balance_pgdat()!

3. Sleep!

❖ balance_pgdat()!

❖ Work until all zones of pgdat are at or over hi-wmark.!

❖ reclaim function: kswapd_shrink_zone()