huawei smartdisk based object storage --- universal ... smartdisk based object storage --- universal...
TRANSCRIPT
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Huawei SmartDisk based Object Storage --- Universal Distributed
Storage
Qingchao Luo
Huawei Technologies Co.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Agenda
2
Object Storage Understanding UDS System Design philosophy UDS Hardware Design UDS Software Design
Future works
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Why Object Storage
3
…
iSCSI/FC Protocol Layer
Storage Layer
…
NFS/CIFS
…
File System Object Object Object Object
System Object
Key Metadata
Customized Metadata
S3
Block Storage Logical Unit Number (LUN),
Logical Block Address (LBA),
SCSI command.
High Cost, Low Latency
(<10ms), not easy to
manage, hard to scale.
File Storage Tree structure, dir/file
operations, Access Control
List (ACL) , Quota.
Low Cost, High Latency
(<100ms), easy to manage,
can be scaled out.
Object Storage Flat structure, Object has metadata and data,
which support CRUD( Create, Read, Update,
Delete) operations, HTTP based access .
Very Cheap, higher Latency (> 100ms), easy to
manage and maintenance, native scale out
architecture.
Block System
UDS (Universal Distributed Storage) for Object Storage
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
UDS Key Features
4
Unlimited Scalability
Low TCO
Extreme Reliability
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Agenda
5
Object Storage Understanding UDS System Design philosophy UDS Hardware Design UDS Software Design
Future works
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
System Architecture
Highlights: Addressing by DHT Full Decentralized System Small Unit Storage Node
Smart Disk
P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12
P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12
P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12
P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12
P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12
P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12
P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12
P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12 P19 P26 P5 P12
Access Layer
Distributed Hash(DHT) Ring
P19 P26 P5 P12
P19 P26 P5 P12
P19 P26 P5 P12
P19 P26 P5 P12
P19 P26 P5 P12
P19 P26 P5 P12
UDS
Smart Disk
Clients
Storage Layer with SoD (Self Organization Disk)
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Hardware Architecture
Access Nodes: External Interface Data Flow Control Hash Calculation Centralized Management
Switches: Exchange Channel of
internal and external data
Storage Nodes: Store Data and Metadata
Basic Cabinet
256G
Expansion Cabinet Expansion Cabinet Expansion Cabinet
IP bearer network
42 U
4 U
4 U
4 U
4 U
2 U2 U
1 U1 U
4*10GE
S3
42 U
4 U
4 U
4 U
4 U
2 U2 U
1 U1 U
4*10GE
42 U
4 U
4 U
4 U
4 U
2 U2 U
1 U1 U
4*10GE
42 U
4 U
4 U
4 U
4 U
2 U2 U
1 U1 U
4*10GE
2*10GE
2*10GE
2*10GE
2*10GE
2*10GE
2*10GE
2*10GE
2*10GE
2*10GE 2*10GE
2*10
GE
2*10
GE
Aggregation
Access Layer
Networking
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Hardware Components
One cabinet 2.1PB
Up to 84 Cabinets
GE / 10GE
High Density
Intelligent Enclosure:
4U75 slots, GE /10GE
Smart Disk:
3.9W/TB, GE
Easy Expansion, Low TCO, High Reliability. Good at Storing Massive data for long time.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Smart Disk
Good Reliability. One ARM one HDD,
chip /HDD fail only causes several TB data
at most to be rebuilt. This fault mode is
good for scale-out system recovery.
Good Scalability. Each smart disk has
one IP on one Ethernet port, every access
node can R/W easily.
Easy maintenance. Chip / memory / HDD
fail, only need replace smart disk.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Future Full ARM Design
6 5 4 3 2 1 7 14 13 12 11 10 9 8
GE Switch GE Switch
Power Power
Switch& Interface
Switch& Interface
Fan Fan Fan Fan Fan Fan
4 3 2 1 0 13 12 11 10 9 8
SOP
SOP
5 7
SOP
SOP
6 14
GE Switch
Fan
Power
4U
SOP
Power Power
SOP SoD SoD SoD SoD 0
ARM 8 Core
SoP
ARM 8 Core
ARM
Smart Disk
SoP, System over Processor
Easy deployment and maintenance.
Compute node and storage node have same
size, and both are pluggable in one enclosure.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Agenda
Object Storage Understanding UDS System Design philosophy UDS Hardware Design UDS Software Design
Future works
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
DHT(Distributed Hash Table)
Hash key space is from 0~2^32 -1, it was split into N key space partitioning (This figure has 20 partitions).
Use virtual node to mange partitions. (This figure has A~T virtual node, one virtual node has one partition).
One Physical node (one Smart Disk) may has more than one virtual node.
A key (K1) must map to a partition after hashing, finally it ‘s easy to store value on Smart Disk.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Smart Disk layout for K-V
Reserve OS Configure System log SoD Redolog SoD K-V DB Reserve
Meta Data
reserve
Index-Primary Index-Secondary Meta Data Free Block Bitmap Data
reserve reserve reserve reserve
Smart Disk provide KV interface, put value by key and retrieve value by key.
Disk Layout is used for key index and value store.
Redundant Metadata and key index improve reliability.
K-V, Key Value
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
KV Data Base for Key Index
Head 8 KB
8 KB
8 KB …… 8
KB 8
KB 8
KB ……
Hash Static Pages (20GB) Hash Collision Pages (1GB)
Page
Meta Data
reserve
Index-Primary Index-Secondary Meta Data Free Block Bitmap Data
reserve reserve reserve reserve
SoD K-V DB
Index-Primary is written by Direct IO to guarantee consistency.
Index-Secondary is for redundancy, and it is written by Asynchronous Page IO for performance.
Each is 8KB, keep alignment with OS 4KB page size.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
KV Data Base for Data
Head
4 KB
4 KB
4 KB ……
chunk
64 MB
64MB
64MB …… 64
MB
4 KB
4 KB
4 KB
4 KB
1 MB
1 MB
1 MB
1 MB
Slice
Chunk Descriptor
Meta Data
reserve
Index-Primary Index-Secondary Meta Data Free Block Bitmap Data
reserve reserve reserve reserve
SoD K-V DB
Free Block Bitmap use 4KB bitmap to manage 64MB data chunk.
Each Chunk can be split into 4KB~4MB slice.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
DRM(Disk Reliable Module)
SCSI
HBA
BLOCK
DRM
K-V Data Base
Kernel Space
User Land
DRM Functions:
HDD error Processing. According to SCSI
Sense Data and Status Code, retry IO or reset
device, etc.
HDD online diagnose. S.M.A.R.T / ERROR/
LOG information analyze.
Bad Sector recover.
Disk Heath Analyze.
Slow Disk Detect.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
EC (Erasure Code) in UDS
Algorithm to decrease possibility of putting same EC chunk into one SmartDisk.
For example, one Object is 12MB, storing by 12+5 (N+M) Erasure Code (12 data
chunk + 5 parity chunk with 1MB chunk size), avoid put these any 2 of 17 chunks in one
SmartDisk.
Partial bad Data recovery. Only part of the chunk data is bad, like 64KB of 1MB
chunk; with N+M EC, Even more than M chunks are partially bad, and these bad data
are not alignment, it maybe recovered.
Intel ISA-L(Intelligent Storage Acceleration Library). Leverage hardware accelerating
instructions to improve performance.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
DeDup & Compress
File Type .txt .wma .pdf .mp3 .doc FileSize Pre-Comress 6M 2.6M 1.6M 5.1M 2.1M
Compressed FileSize 251K 2.5M 1.4M 5.1M 1.2M
Compression Ratio 24:1 1.04:
1 1.15:1 1:1 1.75:
1 Compression Time(ms) 205 240 117 399 127
Decompression Time(ms) 60 71 32 117 33
Concurrt Requests 50 100 200 600 Object Size 1M 1M 1M 1M CPU Usage(%) 4 7.1 12.75 34.62
Post-process De-Duplication.
Single Instance technology, each
object has a hash value, objects
with the same hash value will be
deduplicated.
Compression. Compress the
Object data.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
S3 FS
GUI
NAS Gateway NFS SAMBA
FUSE
Cache
ManagerServer
UDS
NFS CIFS
S3
FUSE based S3FS.
Provide cache function to
improve performance.
Exported by NFS/CIFS.
Client can migrate
massive files into Object
Storage.
Dir/ File map to Object.
Internal translation
between Dir/File format
and Object format.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
WORM for Archive
Internet
S3 API Client
TSA Service
SoD
SHA-256 Engine
TSA
Clie
nt
UDS Internet
TSA* Gateway
Leverage GuardTime KSI(Keyless Signature
Infrastructure) technology. Without key management , easy
to deploy.
Trustful Time Stamping Authority(TSA). Guarantee time is
correct anytime , and signature to detect data tampering.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Agenda
Object Storage Understanding UDS System Design philosophy UDS Hardware Design UDS Software Design
Future works
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Slow HDD
Slow HDD increase the latency.
How to detect Slow HDD?
Isolate Slow HDD quickly to decrease the penalty.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
QoS(Quality of Service)
Multi-Tenant has different SLA.
Reserved resource for Tenant.
Priority Schedule for Tenant.
Guarantee QoS while load of one SmartDisk is very heavy.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Performance(Latency)
Write through cache in Access node and SmartDisk now.
Add BBU (Battery Backup Unit) in enclosure to let SmartDisk use
write-back cache.
Add capacitor in SmartDisk to do write-back cache.
Hybrid Hard Disk for SmartDisk. Put Metadata (key index & bitmap)
in flash.
Internal Data Repair Task affects performance since it may need
longer time.
2013 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
25
Thank You Q & A