sc09 zfs end to end integrity(1)
TRANSCRIPT
-
8/6/2019 SC09 ZFS End to End Integrity(1)
1/22
1
Lustre ZFS TeamSun Microsystems
Lustre, ZFS, and Data Integrity
ZFS BOF Supercomputing 2009
Andreas DilgerSr. Staff EngineerSun Microsystems
-
8/6/2019 SC09 ZFS End to End Integrity(1)
2/22
ZFS BOF Supercomputing 2009
Topics
Overview
ZFS/DMU features
ZFS Data Integrity
Lustre Data Integrity
ZFS-Lustre Integration
Current Status
2
-
8/6/2019 SC09 ZFS End to End Integrity(1)
3/22
ZFS BOF Supercomputing 2009
2012 Lustre Requirements
Filesystem Limits 240 PB file system size 1 trillion files (1012) per file system
1.5 TB/s aggregate bandwidth 100k clients
Single File/Directory Limits 10 billion files in a single directory
0 to 1 PB file size rangeReliability 100h filesystem integrity check End-to-end data integrity
3
-
8/6/2019 SC09 ZFS End to End Integrity(1)
4/22
ZFS BOF Supercomputing 2009
ZFS Meets Future Requirements
4
Capacity Single filesystems 512TB+ (theoretical 264 devices * 264 bytes) Trillions of files in a single file system (theoretical 248 files per fileset) Dynamic addition of capacity/performance
Reliability and Integrity Transaction based, copy-on-write Internal data redundancy (up to triple parity/mirror and/or 3 copies) End-to-end checksum of all data/metadata Online integrity verification and reconstruction
Functionality Snapshots, filesets, compression, encryption Online incremental backup/restore Hybrid storage pools (HDD + SSD) Portable code base (BSD, Solaris, ... Linux, OS/X?)
-
8/6/2019 SC09 ZFS End to End Integrity(1)
5/22
Lustre/ZFS Project Background
ZFS-FUSE (2006) The first project of ZFS porting to the FUSE
framework for the Linux in the user space
Kernel DMU (2008) LLNL engineer ported core parts of ZFS to
RedHat Linux as a kernel module
Lustre-kDMU (2008 present) Modify Lustre to work on ZFS/DMU storage
backend for a production environment
ZFS BOF Supercomputing 2009
-
8/6/2019 SC09 ZFS End to End Integrity(1)
6/22
ZFS Overview
ZFS POSIX Layer (ZPL)
Interface to user applications
Directories, namespace, ACLs, xattr
Data Management Unit (DMU)
Objects, datatsets, snapshots
Copy-on-write atomic transactions
name->value lookup tables (ZAP)
Adaptive Replacement Cache (ARC)
Manage RAM, flash cache space
Storage Pool Allocator (SPA)
Allocate blocks at IO time
Manages multiple disks directly
Mirroring, parity, compression
ZFS BOF Supercomputing 2009
-
8/6/2019 SC09 ZFS End to End Integrity(1)
7/22
Lustre/Kernel DMU Modules
ZFS BOF Supercomputing 2009
-
8/6/2019 SC09 ZFS End to End Integrity(1)
8/22
Lustre Software Stack Updates
MDD
DMU
OSD
DMU
MDT
ZFS BOF Supercomputing 2009
2.0 MDS 2.1 MDS1.8 MDS
OFD
DMU
OSD
DMU
OST
obdfilter
fsfilt
ldiskfs
OST
MDD
ldiskfs
OSD
ldiskfs
MDT
fsfilt
ldiskfs
MDT
1.8/2.0 OSS
MDS
2.1 OSS
-
8/6/2019 SC09 ZFS End to End Integrity(1)
9/22
ZFS BOF Supercomputing 2009
How Safe Is Your Data?
There are many sources of data corruption
Software/firmware: client, NIC, router, server, HBA, disk
RAM, CPU, disk cache, media
Client network, storage network, cables
Object Storage ServerClient
Router
Application Storage
Transport
Network
Link
Network
Link
Server
Transport
Network
Link
-
8/6/2019 SC09 ZFS End to End Integrity(1)
10/22
ZFS BOF Supercomputing 2009
ZFS Industry Leading Data Integrity
-
8/6/2019 SC09 ZFS End to End Integrity(1)
11/22
ZFS BOF Supercomputing 2009
End-to-End Data Integrity
11
Lustre Network Checksum Detects data corruption over network Ext3/4 does not checksum data on disk
ZFS stores data/metadata checksums Fast (Fletcher-4 default, or none) Strong (SHA-256)
Combine for End-to-End Integrity
Integrate Lustre and ZFS checksums Avoid recompute full checksum on data Always overlap checksum coverage Use scalable tree hash method
-
8/6/2019 SC09 ZFS End to End Integrity(1)
12/22
ZFS BOF Supercomputing 2009
Hash Tree and Multiple Block Sizes
12
16kB
128kB
256kB
512kB
32kB
1024kB
64kB
DATA
SEGMENT
LEAF HASH
INTERIOR HASH
ROOT HASH
4kB
8kB
MAX PAGE
ZFS BLOCK
LUSTRE RPC SIZE
MIN
PAGE
-
8/6/2019 SC09 ZFS End to End Integrity(1)
13/22
ZFS BOF Supercomputing 2009
End-to-End Integrity Client Write
13
USER SPACEwrite(fd, buffer, size=1MB)
0 4 8 12 16 20 1024kB10161008
16kB PAGE SIZE
[buffer copy or page reference]
CLIENT
C0 C1 C2 C3
P0 P62
C252 C253C254C255
P63
C3 C5 C250 C251
P1
0 16kB 1024kB1008
ROOT HASH K
16kB PAGE HASH
KERNEL
4kB LEAF HASH
1MB RPC HASH
WRITE RPC 1024kB KSERVER RPC
BULK
RDMA
HASH TREE
Page 0 Page 63
-
8/6/2019 SC09 ZFS End to End Integrity(1)
14/22
ZFS BOF Supercomputing 2009
End-to-End Integrity Server Write
14
4kBPAGESIZE
SERVER
0 16kB 1024kB1008kB
K'
4kB LEAF HASH
1MB RPC HASH
WRITE RPC 1024kB KCLIENT
RPC
BULK
RDMA
HASH TREE
128kB BLOCK HASH
OST
K = K' ?
DMU
128kBBLOCK
SIZE
4 8 12
S0 S1 S2 S3 S4 S5 S255S254S253S252S251S250
B0 B7
B7BLOCK_PTR 896-1024kBB0BLOCK_PTR 0-128kB
-
8/6/2019 SC09 ZFS End to End Integrity(1)
15/22
ZFS BOF Supercomputing 2009
Integrity Checking Targets
15
Scale of proposed filesystem is enormous 1 trillion files checked in 100h 2-4PB of MDT filesystem metadata
~3 million files/sec, 3GB/s+ for one passNeed to handle CMD coherency as well Distributed link count on files, directories Directory parent/child relationship
Filename to FID to inode mapping
-
8/6/2019 SC09 ZFS End to End Integrity(1)
16/22
ZFS BOF Supercomputing 2009
Distributed Coherency Checking
16
Traverse MDT filesystem(s) only once Piggyback on ZFS resilver/scrub to avoid extra IO Lustre processes blocks after checksum validation
Validate OST/remote MDT data opportunistically Track which objects have not been validated (1 bit/object) Each object has a back-reference to validate user Back-reference avoids need to keep global reference table
Lazily scan OST/MDT in the background Load sensitive scanning Continuous online verification/fix of distributed state
-
8/6/2019 SC09 ZFS End to End Integrity(1)
17/22
Lustre DMU development Status
Data Alpha OST basic functionality
MD Alpha
MDS basic functionality, beginning recovery supportBeta
Layouts, performance improvement, full recovery
GA Full performance in production release
ZFS BOF Supercomputing 2009
-
8/6/2019 SC09 ZFS End to End Integrity(1)
18/22
18
Questions?
Thank You
ZFS BOF Supercomputing 2009
-
8/6/2019 SC09 ZFS End to End Integrity(1)
19/22
ZFS BOF Supercomputing 2009
Hash Tree For Non-contiguous Data
19
C45
K
C7
C457
C1
= LH(data segment 1)
C0
= LH(data segment 0)
C5
= LH(data segment 5)C4 = LH(data segment 4)
LH(x) = hash of data x (leaf)
x+y = concatenation x and y
C7
= LH(data segment 7)
C01
= IH(C0
+ C1)
C45
= IH(C4
+ C5)
C457
= IH(C45
+ C7)
C5
C7
C4
K = IH(C01
+ C457
) = ROOT hash
IH(x) = hash of data x (interior)
C01
C0
C01
C1
ZFS BOF Supercomputing 2009
-
8/6/2019 SC09 ZFS End to End Integrity(1)
20/22
ZFS BOF Supercomputing 2009
T10-DIF vs. Hash Tree
CRC-16 Guard Word All 1-bit errors
All adjacent 2-bit errors
Single 16-bit burst error
10-5 bit error rate
32-bit Reference Tag Misplaced write != 2nTB
Misplaced read != 2nTB
Fletcher-4 Checksum All 1- 2- 3- 4-bit errors
All errors affecting 4 or fewer 32-bitwords
Single 128-bit burst error 10-13 bit error rate
Hash Tree Misplaced read
Misplaced write Phantom write
Bad RAID reconstruction
-
8/6/2019 SC09 ZFS End to End Integrity(1)
21/22
ZFS BOF Supercomputing 2009
ZFS Hybrid Pool Example
Five 400GB 4200 RPM SATA DrivesSeven 146GB 10,000 RPM SASDrives
32GB SSD ZIL Device
80GB SSD Cache Device
4 Xeon 7350 Processors (16 cores)
32GB FB DDR2 ECC DRAM
OpenSolarisTM with ZFS
Configuration A Configuration B
-
8/6/2019 SC09 ZFS End to End Integrity(1)
22/22
ZFS BOF Supercomputing 2009
Raw Capacity (2TB)
ZFS Hybrid Pool Example
Hybrid Storage(DRAM + SSD + 5x 4200 SATA)
Traditional Storage (DRAM + 7x 10K SAS)
PowerRead IOPs Write IOPs
3.2x 1.1x 2x1/5