ZNS+: Advanced Zoned Namespace Interface for Supporting In-Storage
Zone Compaction
Kyuhwa Han1,2, Hyunho Gwak1, Dongkun Shin1, and Joo-Young Hwang2
1Sungkyunkwan University, 2Samsung Electronics
Zoned Name Space (ZNS) Storage
• The logical address space is divided into fixed-sized zones.
• Each zone must be written sequentially and reset explicitly for reuse
Zone 0 Zone 1 Zone 2 Zone 3 Zone N
Zone Size
Write Pointer
Written
blocks
Remaining
blocks
Sequential write
Storage LBA range
• • •
https://zonedstorage.io/ https://nvmexpress.org/
QLC(4bit) ZNS SSD
(June 2021)
2
SSD Architecture 101
block
block
block
…
block
page
page
page
…
page
Block (Erase Unit)
Sequential write,
cannot overwrite
before erase
Chip
4KB~16KB
Processor
DRAM
* write buf
* L2P map
NVMe
Controller
F/W (FTL)Flash
controller
Flash
controller
Flash
controller
Flash
controller
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
flash
chip
copyback
(A)
(B)
(C)
(D)
BlockL2P mappingX
(B)
X
(D)
BlockL2P mapping
(A)
(C)
Block
x
x
X
X
X
(D)
Block
X
(C)
X
(A)
Block Block
X
X
X
(D)
Block
X
(C)
X
(A)
Block
(D)
(C)
(A)
Block
X
X
X
X
Block
X
X
X
X
Block
(D)
(C)
(A)
Block Block Block
(D)
(C)
(A)
Block
Initial logical-to-physical
(L2P) address mappingOut-of-place update
Address re-mapping
Copy valid pagesBefore GC
Address re-mapping Erase invalid blocks
parallel channels parallel ways
16 (4 channels x 4 ways) flash chips can be accessed in parallel
3
Garbage Collection (GC)
Why ZNS? Isolation, hot/cold separation
Data
Stream 1
Data
Stream 2
• • •
Data
Stream 3
Data
Stream
Arrival order-based placement Zone-based placement
Regular SSD
Flash Block Group 1 Flash Block Group 2
Flash Block Group 3 Flash Block Group 4
Data
Stream 1
Data
Stream 2
• • •
Data
Stream 3
Data
Stream
Zone 1 Zone 2 Zone 3
ZNS SSD
Flash Block Group 1 Flash Block Group 2
Flash Block Group 3 Flash Block Group 4
4
Why ZNS? Small L2P Translation Table
Regular SSD ZNS SSD
B F C A H E D G
LBN Fblock/Fpage
0 0/1
1 1/1
2 0/2
3 0/0
4 1/3
5 1/0
6 0/3
7 1/2
LBN-level L2P translation table
Random Write Sequential Write
Zone Fblock
0 0
1 1
A B C D E F G H
Zone 0 Zone 1
Fpage 0 A
Fpage 1 B
Fpage 2 C
Fpage 3 D
Fblock 0
LBN 0 LBN 1 LBN 2 LBN 3 LBN 4 LBN 5 LBN 6 LBN 7 LBN 0 LBN 1 LBN 2 LBN 3 LBN 4 LBN 5 LBN 6 LBN 7
Fpage 0 E
Fpage 1 F
Fpage 2 G
Fpage 3 H
Fblock 1
Fpage 0 A
Fpage 1 B
Fpage 2 C
Fpage 3 D
Fblock 0
Fpage 0 E
Fpage 1 F
Fpage 2 G
Fpage 3 H
Fblock 1
5
Sequential
Write
Sequential
Write
Sequential
Write
Sequential
Write
LBN: Logical Block Number
Logical Block Size = 4KB
Logical Zone Size = xxMB
Zone-level L2P
translation table
Why ZNS? GC-less, Predictable
Regular SSD ZNS SSD
B F C1 A1 H E1 D G1
LBN Fblock/Fpage
0 0/1
1 1/1
2 0/2
3 0/0
4 1/3
5 1/0
6 0/3
7 1/2
LBN-level L2P translation table
Random Write Sequential Write
Zone Fblock
0 0
1 1
Zone-level L2P
translation table
A1 B1 C1 D1 E F G H
Zone 0 Zone 1
Fpage 0 A
Fpage 1 B
Fpage 2 C
Fpage 3 D
Fblock 0
LBN 0 LBN 1 LBN 2 LBN 3 LBN 4 LBN 5 LBN 6 LBN 7 LBN 0 LBN 1 LBN 2 LBN 3 LBN 4 LBN 5 LBN 6 LBN 7
Fpage 0 E
Fpage 1 F
Fpage 2 G
Fpage 3 H
Fblock 1
Fpage 0 A
Fpage 1 B
Fpage 2 C
Fpage 3 D
Fblock 0
Fpage 0 E
Fpage 1 F
Fpage 2 G
Fpage 3 H
Fblock 1
Fpage 0 A1
Fpage 1 C1
Fpage 2 E1
Fpage 3 G1
Fblock 2Fpage 0 A1
Fpage 1 B1
Fpage 2 C1
Fpage 3 D1
Fblock 2
2/0
2/3
2/1
2/2
2
Fpage 0
Fpage 1
Fpage 2
Fpage 3
Fblock 3 (OP)GC-less, No OP
WAF 1
(write amp. factor)
GC: valid page copy
➔ write amplified, unexpected delay
3/0
3/2
3/3
3/1
6
New IO Stack for ZNS
Log-Structured File System (LFS) – Append Logging (Sequential Write)
Per-Zone In-Order Queue
ZNS SSD
Segment 0
= Zone 0
Segment 1
= Zone 1
Segment 2
= Zone 2
Flash block group 0 Flash block group 1 Flash block group 2
Zone SchedulerZone Scheduler
Zone Mapping
7
F2FS (Flash-Friendly File System)
• One of actively maintained Log-structured File Systems (LFS)
• Six types of segments: hot, warm, and cold segments for each node/data
• Multi-head logging
• Supports both append logging (AL) and threaded logging (TL)
• A patch version for ZNS is available (threaded logging is disabled)
Check
pointMetadata
Node
(hot)
Node
(warm)
Node
(cold)
Data
(hot)
Data
(warm)
Data
(cold). . .
Segments
Threaded Logging
overwrite new data new writeclean
block
valid
block
obsolete
blockSegment Compaction and Append Logging
copy
8
Segment Compaction
1. Victim segment selection - a segment with the lowest compaction cost
2. Destination block allocation - contiguous free space
3. Valid data copy - moves all valid data in the victim segment to the destination segments via host-initiated read and write requests• Many idle intervals of flash chips
4. Metadata & checkpoint update
read
request handling
write
W
W
W
W
req
DMA
File System
Block Layer
Host Bus
Flash Chip 0idle
Flash Controller Read Write write
multiple reads
victim
selection destination block allocation
page allocation
metadata update
Flash Chip 1
Flash Chip 2
Flash Chip 3
data copy
DMA
R
R
R
R R
R
R
R
W
W
W
W
W
int
idle
Normal segment compaction via host-level copy 9
Robbing Peter to Pay Paul?
Garbage Collection
Need Over-Provisioning
(OP) Space
Large L2P Mapping Table
Regular SSD
EXT4, XFS, …
Random Overwrite
No GC
Host
• Host-side GC in exchange for using GC-less ZNS SSD
• Host-side GC overhead > Device-side GC overhead ▪ IO Request Handling, H2D Data Transfer, Page Allocation, Metadata Update
Small traffic
No GC (No OP)
Small L2P Mapping Table
ZNS SSD
LFS, F2FS, Btrfs, …
Append Logging
Segment Compaction
Host
Large traffic
Can Ihelp U?
Yes, copy block yourself. You can do better!
And can I use threaded logging?
10
ZNS+: LFS-Aware ZNS
• Internal Zone Compaction (IZC)▪ zone_compaction command
▪ Can accelerate zone compaction
▪ Copy blocks within SSD
▪ Reduce host-to-device traffic
▪ SSD can schedule flash operations efficiently
• Sparse Sequential Write▪ TL_open command
▪ Can avoid zone compaction w/ threaded logging
▪ The host can overwrite a zone sparsely✓The block addresses of the consecutive writes must
be in the increasing order
▪ Transformed to dense request w/ internal plugging by SSD
▪ Can hide latency by utilizing idle flash chips
No GC (No OP)
Small L2P Mapping Table
ZNS+ SSD
LFS, F2FS, Btrfs, …
Host
Copy
Offloading
Threaded
Logging
Zone
Compaction
Internal
Plugging
Small traffic
req
copyback
File System
Block Layer
Host Bus
Flash Chip 0
Flash Chip 1
Flash Chip 2
Flash Chip 3
Flash Controller
copyback
zone_compaction
victim
selection destination block allocation
write
metadata update
int
copy offloading
R
R
R
R
W W
W W
W
Efficient internal scheduling11
ZNS+-aware LFS
• Hybrid segment recycling (HSR): segment cleaning vs. threaded logging
• Copyback-aware block allocation: maximize on-chip copyback operations
No GC (No OP)
Small L2P Mapping Table
ZNS+ SSD
Host
Copy
Offloading
Threaded
Logging
Zone
Compaction
Internal
Plugging
Small traffic
Do you know I can copy data quickly
within a flash chip?It’s copyback.
I see.Then, I will use a copyback-aware block allocation.
HSRCopyback-
aware BA
12
1 2 3 4 5 6 1 2 3 4 5 6
1
3
5
2
4
6
Chip 0 Chip 1
Zone 0 Zone 1
1
3
5
2
4
6
Zone 0
Zone 1
copyback (fast)
Logical copy
read & program (slow)
blo
ckblo
ck
Flash Controller
Experimental Setup
• ZNS+ emulator based on FEMU
• Real ZNS+ implemented at Cosmos+ OpenSSD
• Modified F2FS 4.10
• Comparison▪ ZNS vs. IZC (internal zone compaction, no TL) vs. ZNS+ (IZC and TL)
https://github.com/ucare-uchicago/femu
A QEMU-based and DRAM-backed NVMe SSD Emulator
13Cosmos+ OpenSSD
Performance
• IZC• 28.2–51.7% reduction at compaction time
• 1.3–1.9x higher throughputs than ZNS
• ZNS+• 1.3–2.9x higher throughputs than ZNS
• > 86% of the reclaimed segments are handled by threaded logging
• Metadata write traffic reduction by 48%
Compaction Time14
Flash Chip Parallelism• ZNS+ shows a faster increase rate on
performance
• Increased compaction cost can be hidden in the background by the internal plugging.
• Copyback-Aware (CAB) vs. -Unaware (CUB)
• CUB: cpbk ratio decreases linearly
• CAB: > 80% of the copy requests are processed by copyback
• ZNS+ increases cpbk ratio w/ threadlogging (internal plugging)
0
3
6
9
12
15
18
2 4 8 16 32
Per
form
ance
(K
op
s/s)
# of flash chips
ZNS IZC-CUB
IZC-CAB ZNS+-CUB
ZNS+-CAB
0%
20%
40%
60%
80%
100%
2 4 8 1632 2 4 8 1632
CUB CAB
IZC
2 4 8 1632 2 4 8 1632
CUB CAB
ZNS+
cpbk R/P
# of flash chips ➔ Zone size ➔ Compaction cost copyback (cpbk) and read-and-program (R/P) distribution
High internal B/W
ZNS << ZNS+
15
ZNS+
IZC
ZNS
Low internal B/W
ZNS ZNS+
Conclusion
• SSD evolution• Black-box model – Regular SSD
• Log-on-Log, Unpredictable Delay
• Gray-box model – Multi-streamed SSD• Host can specify the stream ID of each write request ➔ GC optimization
• White-box model – Open-Channel SSD, ZNS• Exposes SSD hardware geometry to the host
• Current ZNS imposes a high storage reclaiming overhead on the host to simplify SSDs.
• ZNS+ • Place each storage management task in the most appropriate location
• Make the host and the SSD cooperate
• Code is available• https://github.com/eslab-skku/ZNSplus
16