file systems f2fs - skku

46
ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) File Systems – F2FS Dongkun Shin ([email protected] ) Embedded Software Laboratory Sungkyunkwan University http://nyx.skku.ac.kr/

Upload: others

Post on 18-Dec-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected])

File Systems – F2FS

Dongkun Shin ([email protected])

Embedded Software Laboratory

Sungkyunkwan University

http://nyx.skku.ac.kr/

Page 2: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 2

Log-Structured File System

• Assume the whole disk space as a big contiguous area

• Write all data sequentially

– Application’s random I/O is converted to sequential I/O through

LFS

• “frequent metadata updates” is key challenge in LFS

• Recover quickly with “checkpoint”

Page 3: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 3

Log-Structured File System

Page 4: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 4

Log-Structured File System

• Garbage Collection (Cleaner)

– Reuse of segments while writing

– A key challenge in LFS with snapshot

Page 5: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 5

Critical Issues of LFS

• Wandering tree problem

• High cleaning overhead

Page 6: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 6

LFS cleaning

• To make free segments– LFS cleaner copies valid blocks to other free segment

• victim segment selection– utilization: how much is to be gained by cleaning

– age: how likely is the segment to change soon anyway

• On-demand cleaning– Overall performance decreases

• Background cleaning – It does not affect the performance

A B C D A B C D

Copy

Segment 1 Segment 2 Segment 3

Page 7: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 7

Hole-plugging

• Matthews et al. employed Hole-plugging in LFS [Matth

ews et al, ACM OSR ’97]

• The cleaner copies valid blocks to holes of other

segments

• Invoke random writes

A B C D

Copy

Segment 1 Segment 2 Segment 3

A B

Page 8: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 8

Slack Space Recycling (SSR)

• directly recycles slack space to avoid on-demand

cleaning

• Slack Space is invalid area in used segment

• Invoke random writes

SSRSSR

A B C D

Segment 1 Segment 2 Segment 3

Segment buffer

E F G H

E F G H

Page 9: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 9

F2FS

• A mainlined filesystem (since kernel 3.8)

• Article: An f2fs teardown

– http://lwn.net/Articles/518988/

• Flash-Friendly File System

– Log-structured approach

– Various parameters for adjusting to geometry of flash

memory

• Developed by Samsung

https://www.sammobile.com/news/galaxy-note-10-uses-f2fs-not-ext4-file-system-whats-the-difference/

Page 10: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 10

F2FS Features

• Flash Awareness

– Log-structured approach

– Adjust to the geometry of flash memory

– Align FS data structures to the FTL operation units.

Page 11: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 11

F2FS Features

• Solved wandering tree problem

– Fixed Location Area and New Indexing Scheme (Over-

writable)

– Avoiding Metadata Update Propagation

Page 12: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 12

F2FS Features

• Solved High Cleaning Overhead of LFS

• Multi-head Logs and Hot/Cold Data Separation

– Write-time data separation → more chances to get binomial

distribution

– Two different victim selection policies for foreground and

background cleaning

• Automatic background cleaning

• Adaptive Write Policy for High Utilization

– Switches write policy to threaded logging at right time

(logging to FTL overprovision space)

– Graceful performance degradation at high utilization

Page 13: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 13

File Structure

Page 14: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 14

Disk Layout

• Superblock– Basic information of the file system

– Disk layout parameters

– Pointer to valid check point

– 1 copy block

• Block (4KB)

• Segment (2MB)– 512 blocks

Random Write Area Sequential Write Area

over-write Allow over-writes No over-writes

Update Checkpoint op File op / Checkpoint op

Area Checkpoint AreaNode Address TableSegment Information TableSegment Summary Area

Main Area- File Contents ( + Directory )- File Nodes

Page 15: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 15

Disk Layout

• Sections (Collection of segments: configurable by power of 2)– Cleaning unit: one section at a time

– Aligned to the zone size

• Six open sections– Hot / Warm / Cold for nodes and data

– file content (data) are separate from indexing information(nodes)

– hotness based on various heuristics

• Zones: Collection of sections (2MB)– Keep six open sections in different zones.

Page 16: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 16

Segment Type

NAT

Dir InodeDirectory Data

File InodeFile Data File Data

.jpg

. . .

Indirect Node

Direct NodeCOLD

WARM

HOT

small size write

Page 17: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 17

Random Write Area (Metadata)

• Checkpoint (CP) – File system information at the moment

– bitmaps for valid NAT/SIT sets, inode lists, summary entries of current active segments

• Node Address Table (NAT)– Block address table for all the node blocks stored in main area

– Inode number, pointer to block address of node block

• Segment Information Table (SIT)– Segment information such as valid block count and bitmap for the validity

of all the blocks

• Segment Summary Area (SSA)– Summary entries which contains the owner information(back pointers to

NAT-id, inode) of all the data and node blocks

Page 18: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 18

Node Address Table (NAT)

• NAT structure

struct f2fs_nat_entry {__u8 version; /* latest version of cached nat entry */__le32 ino; /* inode number */__le32 block_addr; /* block address */

} __packed;

struct f2fs_nat_block {struct f2fs_nat_entry entries[NAT_ENTRY_PER_BLOCK];

} __packed;

Page 19: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 19

Segment summary area (SSA)

• Summary block– Has the information of one segment

– 512 (=ENTRIES_IN_SUM) summary entries

• Summary entry– node id, node version, offset in node

– Data block: nid of its direct node

– Node block: nid of the node

– Referred at Cleaning and Crash Recovery

• Spare area– Journal for NAT/SIT entry

struct f2fs_summary_block {struct f2fs_summary entries[ENTRIES_IN_SUM];struct f2fs_journal journal;struct summary_footer footer;

} __packed;

Page 20: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 20

Cleaning

• Cleaning Process– Reclaim obsolete data scattered across the whole storage for new

empty log space

– Get victim segments through referencing segment usage table

– Load parent index structures of there-in data identified from segment summary blocks

– Move valid data by checking their cross-reference

• Goal– Hide cleaning latencies to users

– Reduce the amount of valid data to be moved

– Move data quickly

• Issues– Hot and cold data separation

– Victim selection policy

Page 21: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 21

Cleaning

• Six active logs for static hot and cold data separation

• Support a background Cleaning process using Kernel thread

• Background Cleaning (BG Cleaning)

– Triggered the cleaning job when the system is idle

– Cost-benefit policy

• Foreground Cleaning (FG Cleaning)

– Triggered when there are not enough free segments to serve VFS calls

– Greedy policy

Policy Cost-benefit Greedy

Selectingvictim

segment

According to thenumber of valid blocks andthe segment age

Having the smallest number

of valid blocks

Page 22: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 22

GC Condition

• Background GC triggering condition

1. Operated by a kernel thread

2. GC is not conducted currently.

3. There are enough invalid blocks.

• Adjust sleep time (3min ~ 6min) based on # of invalid blocks

4. IO subsystem is idle by checking the # of writeback pages.

5. IO subsystem is idle by checking the # of requests in bdev's request

list.

• Foreground GC triggering condition

– When there are not enough free segments to serve VFS calls.

• Free Sections ≤ Reserved Sections + Dirty node

• Reserved Section =(1

𝑜𝑣𝑒𝑟𝑝𝑟𝑜𝑣𝑖𝑠𝑖𝑜𝑛 𝑟𝑎𝑡𝑖𝑜+ 5) × 𝑠𝑒𝑔𝑠_𝑝𝑒𝑟_𝑠𝑒𝑐 (Default = 25)

Page 23: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 23

Segment Allocation

• Copy-and-compactions scheme (LFS Alloc)

– Good for sequential write performance

– Cause cleaning overhead under high utilization

• Threaded log scheme (SSR Alloc)

– No cleaning process is needed

– Cause random write

LFS Alloc - Allocating free segment

SSR Alloc - Allocating used segment

FREE VALID INVALID

Page 24: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 24

Segment Allocation

• Adaptive logging

– Normally, copy-and-compaction is adopted

– If there is not enough free space, the policy is dynamically

changed to threaded logging

Reserved Section threshold

Over-provisioningthreshold

LFS Alloc

SSR Alloc + LFS Alloc (warm node)

LFS Alloc + SSR Alloc + FG Cleaning

At idle time,BG Cleaning Process

Warm nodes is NOTselected for an SSR victim

Page 25: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 25

SSR Condition

• Exception

– Warm Node• For recovery

• LFS Allocation

– Utilization == 100%• Inplace Update (linux 3.8)

FG

When there are not enough free segments to serve VFS calls

- Free Sections ≤ Reserved Sections + Dirty Node

- Reserved Sections → Depend on overprovision ratio (Default = 25)

SSR

Checking SSR condition before allocating segment

- Free sections ≤ Overprovision Sections

- Default overprovision ratio = 5 (format option)

Page 26: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 26

Adaptive Write Policy

• Experiment

– Embedded system with eMMC 12GB partition

– Iozone random write tests on several 1GB files

• Results

– Sustained performance is improved by adaptive write.

Page 27: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 27

Performance

Page 28: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 28

Block Trace

Page 29: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 29

Performance Drop under Aging

Page 30: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected])

F2FS version-up history

4.0 ~ 4.19

30

Page 31: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 31

4.0, 4.1

• batched trim

– submits split discard commands

– To avoid long latency due to huge trim commands• Trim invokes checkpoint

• Add an optional rb-tree based extent cache

– an improvement over the original extent info cache

– extent maps between contiguous logical address and

physical address

– Lower memory consumption

• Enable inline data by default

– Small (< 3.4KB) files can be stored directly in the inode

Page 32: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 32

4.3

• ioctl F2FS_IOC_GARBAGE_COLLECT

– triggers a cleaning job explicitly by users

• Enhance multithread performance

– Protect submit_bio operation with writepages mutex lock

– writepages mutex lock serializes all block address allocation

and page submitting pairs from different inodes

• Introduce a shrinker

– reclaim memory consumed by a number of in-memory f2fs

data structures

Page 33: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 33

4.5, 4.7

• data_flush mount option

– enables data flushing before checkpoint in order to persist

data of regular and symlink

– Flush all dirty pages as well as write-submitted data

• /proc entry to show valid block bitmap for user to be

aware of fragmentation

• Speedup fallocate

– improve the expand_inode speed in fallocate by allocating

data blocks as many as possible in single locked node page

Page 34: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 34

4.8

• Add lazytime mount option– keep atime in a file's in-memory inode until there is some other reason

to do so, or until the inode itself is being pushed out of memory

• Add discard/nodiscard mount option– Enable/disable real-time discard in f2fs, if discard is enabled, f2fs will issue

discard/TRIM commands when a segment is cleaned.

• Support the move_range ioctl– move a range of data blocks from one file to another

• mode=lfs mount option– Turn IPU/SSR off

• flush_merge option by default – Merge concurrent cache_flush commands as much as possible to

eliminate redundant command issues.

– If the underlying device handles the cache_flush command relatively slowly, recommend to enable this option.

Page 35: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 35

4.9

• Support async discard

– all discard commands can be issued and be waited for

endio in batch to improve performance.

Page 36: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 36

4.11

• inline_xattr/noinline_xattr– stores xattr entry in each inode

• Support IO alignment for DATA and NODE writes– fill dummy blocks in write bios

– eliminate underlying dummy page problem which FTL conducts in order to close MLC or TLC partial written pages

– BIO_MAX_PAGES = 256

• Add a kernel thread to issue discard commands asynchronously

xattr : http://man7.org/linux/man-pages/man7/xattr.7.html

Page 37: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 37

4.12, 4.13

• Enable small discard by default– 4K granularity small discard

– discard_granularity in /sysfs• controls the granularity of discard command size. • It will issue discard commands if the size is larger than given granularity. • Its unit size is 4KB, and 4 (=16KB) is set by default.• The maximum value is 128 (=512KB).

• Write small sized IO to hot log– Split small and large IOs separately in order to get more

consecutive big writes

– small sized IO threshold: min_hot_blocks in /sysfs

• Add ioctl to do gc with target block address– f2fs_ioc_gc_range() to move blocks located in the given

range.

Page 38: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 38

4.14

• Support F2FS_IOC_FS{GET,SET}XATTR

• Support inode checksum

– Inode read verifies checksum

• Introduce gc_urgent mode for background

GC

– /via sysfs

– gc_urgent = 0 [default]

– gc_urgent = 1, background thread starts to do GC by

given gc_urgent_sleep_time interval.

Page 39: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 39

GC-related sysfs files

• gc_urgent

• gc_urgent_sleep_time– controls sleep time for gc_urgent. 500 ms is set by default.

• gc_min_sleep_time, gc_max_sleep_time

– controls the minimum/maximum sleep time for the garbage collection thread.

• gc_no_gc_sleep_time– controls the default sleep time for the garbage collection thread.

• gc_idle: select victim policy for garbage collection. – gc_idle = 0 (default) will disable this option. (Adaptive)

– gc_idle = 1: Cost Benefit, gc_idle = 2: Greedy

• reclaim_segments– If the number of prefree segments > reclaim_segments, f2fs tries to conduct

checkpoint to reclaim the prefree segments to free segments.

– By default, 5% over total # of segments.

Prefree segment has only pre-invalid blocks and invalid blocks.

Checkpoint will change Prefree segment to free segment.

Page 40: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 40

4.15

• Support flexible inline xattr size

– expand inline xattr size flexibly according to user's

requirement.

• Export SSR allocation threshold in sysfs

– min_ssr_sections

Page 41: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 41

4.16

• sysfs interface readdir_ra

– to enable/disable readaheading inode block

• Add an ioctl to disable GC for specific file

– would be useful, when user wants to keep its block map

• Add F2FS_IOC_PRECACHE_EXTENTS

ioctl

– precache extent info like ext4

– eliminate synchronous waiting of mapping info

Page 42: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 42

4.17

• Add block allocation policies to pass down write

hints given by user

– whint_mode=off (default). F2FS only passes down

WRITE_LIFE_NOT_SET

– whint_mode=user-based. F2FS tries to pass down hints given

by users

– whint_mode=fs-based. F2FS passes down hints with its policy.

• Expose extension_list to user and introduce hot

file extension

– Via /sysfs

Page 43: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 43

write hintsWrite-hint Policy-----------------1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.

2) whint_mode=user-based. F2FS tries to pass down hints given by users.

User F2FS Block---- ---- -----

META WRITE_LIFE_NOT_SETHOT_NODE "WARM_NODE "COLD_NODE "

ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREMEextension list " "

-- buffered ioWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREMEWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORTWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SETWRITE_LIFE_NONE " "WRITE_LIFE_MEDIUM " "WRITE_LIFE_LONG " "

-- direct ioWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREMEWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORTWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SETWRITE_LIFE_NONE “ WRITE_LIFE_NONEWRITE_LIFE_MEDIUM “ WRITE_LIFE_MEDIUMWRITE_LIFE_LONG “ WRITE_LIFE_LONG

3) whint_mode=fs-based. F2FS passes down hints with its policy.

User F2FS Block---- ---- -----

META WRITE_LIFE_MEDIUMHOT_NODE WRITE_LIFE_NOT_SETWARM_NODE "COLD_NODE WRITE_LIFE_NONE

ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREMEextension list " "

-- buffered ioWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREMEWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORTWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONGWRITE_LIFE_NONE " "WRITE_LIFE_MEDIUM " "WRITE_LIFE_LONG " "

-- direct ioWRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREMEWRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORTWRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SETWRITE_LIFE_NONE " WRITE_LIFE_NONEWRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUMWRITE_LIFE_LONG " WRITE_LIFE_LONG

short, medium, long, extreme

not_set, none

Page 44: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 44

write hints

f2fs_io_type_to_rw_hint()

Data Hot WRITE_LIFE_SHORT

Warm WRITE_LIFE_NOT_SET

Cold WRITE_LIFE_EXTREME

Node

Meta

WRITE_LIFE_NOT_SET

whint_mode == WHINT_MODE_USER

Data Hot WRITE_LIFE_SHORT

Warm WRITE_LIFE_LONG

Cold WRITE_LIFE_EXTREME

Node Hot

Warm

WRITE_LIFE_NOT_SET

Cold WRITE_LIFE_NONE

Meta WRITE_LIFE_MEDIUM

whint_mode == WHINT_MODE_FS

Page 45: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 45

4.17

• Add mount option for segment allocation policy

– "alloc_mode=reuse" case,

• allocate segments from 0'th segment all the time to reassign segments

• It'd be useful for small-sized eMMC parts

– "alloc_mode=default“

• heap-based

• nowait aio support

– non-blocking AIO

• Support large NAT bitmap

– More number of nodes

Page 46: File Systems F2FS - SKKU

ECE5658, Operating Systems Design, Fall 2019, Dongkun Shin ([email protected]) 46

4.18, 4.19

• Enable -o discard by default

– enable real-time discard by default

– Issue discard at every segment cleaning

• Add fsync_mode=nobarrier for non-atomic files

– No flush command

• fsync_mode=%s

– Control the policy of fsync. Currently supports posix, strict, and nobarrier.

– posix mode (default)

• fsync will follow POSIX semantics and does a light operation to improve the filesystem

performance.

– strict mode

• fsync will be heavy and behaves in line with xfs, ext4 and btrfs, where xfstest generic/342

will pass, but the performance will regress.

– nobarrier is based on posix

• doesn't issue flush command for non-atomic files likewise "nobarrier" mount option.