kenny kim pspace isw5 · 2012-08-16 · 1 oasisfs (object-based storage architecture for scalable,...
TRANSCRIPT
1
OASISfs(Object-based storage Architecture for Scalable, Intelligent, and Secure file system)
An Implementation of OSD based Cluster File System(and experiences of initial deployment)
Kenny KimC.E.O. PSPACE, [email protected]
2
▣ Contents◈ PSPACE, inc.◈ OASISfs Development Team◈ OASISfs Overview◈ OASISfs Runtime characteristics◈ ezCon: Web based OASISfs Management Software◈ OASISfs Verification Process◈ Deployment Experience◈ Future Work (in progress)◈ Remarks
3
PSPACE, inc.▣ Founded in 2004.
◈Housed in GyungGi, Korea◈very-very small company of 7 engineers
▣ High Performance Computing Specialists▣ Specialties in
◈ High Performance Computing: GPU Computing◈ High Performance Network: InfiniBand, 10G, Myrinet◈ High Performance Storage: OASIS, Lustre, PVFS◈ Cluster Management Software: ezConTM
◈ Resource Management Software: PBSPro, Torque
▣ Storage System Development Team◈ Began development in 2005◈ Co-Developed with ETRI (Electrical and Telecommunication Research Institute)
◈ Product debut in 3rdQ of 2007
4
OASISfs Development Team▣7 engineers from PSPACE, inc.
◈Headed by Mr. Kenny Kim◈Mainly developing InfiniStorTM (storage solution box using OASISfs)
– End user support– Management Software Development and HA Support Development– Hardware Development (suitable for OASISfs)
▣10 engineers from ETRI◈Headed by Dr. Jun Kim◈OASIS Core Development
▣Other Participants◈MS-Windows port by SoftOnNet, inc.◈Backup support by Prof. Yoo, JS at ChungBuk National University◈Security support by S.S.K. University◈Software testing by SureSoft, inc.
5
OASISfs Requirement▣OSD standard compliance▣Scaleable file system
◈Scalable in Capacity & Performance▣General purpose file system
◈POSIX compliance◈Support LINUX and Windows
▣Provide file management system◈Provide backup system◈Provide GUI, CLI MGMT methods
▣Complete in 18(+4) Months
6
OASISfs: Value Proposition
Stable Scalable Manageable
Users
Easy to Users
Reliable
Resilient to Failure
CIO (and Managers)
MaximumPerformance atReasonable andPredictable Price
Administrator
Provide Intuitive andComplete
ManagementSoftware
Provide “Stable, Scale in Performance and Capacity,and Manageable” Storage File System for Normal People
By enablingScalable in Performance and CapacityEfficient Resource Share and ManagementHigh Availability, Raid SupportIntuitive and Easy to Use Management Software
7
OASISfs Specification◈ Performance: 400MB/s per OSD (Bonded Quad Gigabit)◈ Scalability: 100’s Storage Server Devices (OSDs)
100’s Clients◈ Interoperability: Linux and Windows in Native Mode◈ Simplicity: A single, shared, coherent filesystem◈ Standards: POSIX Standards
OSD 1.0iSCSI
◈ Raid support: Linear, Linear+1, Raid0, Raid5, (RAID0+1 in future)– flexible price/performance resiliency choices
◈ Supported Platforms–Linux 32Bit, Linux 64Bit, Kernel 2.6.10, Kernel 2.6.18–Windows XP 32Bit Single CPU
8
▣ Functionalities
OOut-of-band I/O
Performance Object StripeParallel I/O
Read BalanceI/O Road Balance
Active/StandbyMDS HA
HALinear+1, LinearRAID0, RAID5
OSD HA
ofs_fsckFile Check Utility
99 (Tested)Max Client
Scalability99 (Tested)Max OSD
2Max MDS
On-lineOSD Expansion
Linux (2.6.10, 2.6.18),Windows (XP, 32Bit)OS
TCP, InfinibandI/FNetwork
OMulti NIC Support
Interoperability
CPUs
Supported
i386, x86_64, EM64T
Platform
OASISfs 2.0Items
O100% POSIX
FunctionOlockf, flock
△mmap
OQuota Support
File Set Support
X1 Target-Multi File Set
X1 OSD-Multi Target
Omount, fstab
OBackup
OCLI
MGMT OWeb
OMonitoring
Write-ThruMDS Cache
Cache Management
Write-BackOSD Cache
Unix TypeCache Coherency
FileCache Granularity
Updater Invalidation, RevokeCache Coherency
9
▣Block I/O & Object I/O
File
Block Device File System Object Device File System
File
Light-weightPer File Metadata
(Inode)Heavy-weightPer File Metadata
(Inode)
100110011000100001101
FilesystemMetadata
(free block bitmap)
100110011000100001101
10
▣I-Node Management & Metadata File Management
SAN/NAS
SAN
Sync forNamespace,Inode, Data
Sync for SuperBlock, Free
Block Bitmap,Inode Bitmap
Sync forNamespace,Inode, Data
No Sync forStorageMetadata
More S
calabl
e!!
11
▣ InfiniStorTM HW Configuration
10/100/1000 GbE
Linux Servers
Window
(App. Server)
Client Module (FM)
Metadata Server
Object Storage Server
OSD Server
(Disks)
Single Volume Storage
HA
12
▣ MDS◈ Serves meta data information◈ holds directory & file attributes
▣ OSTs:◈ Serves file data information
▣ OASIS MGMT◈ Contains Configuration data
▣ OASIS FM◈ reads/write directly to OST storage devic
es▣ High speed network interconnect
◈ GigE◈ Myrinet
Focus on Scalable Performance and CapacitySeparation of Meta data & file dataScalable meta dataScalable file dataEfficient lockingObject architecture
▣ MDS◈ Serves meta data information◈ holds directory & file attributes
▣ OSTs:◈ Serves file data information
▣ OASIS MGMT◈ Contains Configuration data
▣ OASIS FM◈ reads/write directly to OST storage device
s▣ High speed network interconnect
◈ GigE◈ InfiniBand
▣ MDS◈ Serves meta data information◈ holds directory & file attributes
▣ OSTs:◈ Serves file data information
▣ OASIS MGMT:◈ Contains Configuration data
▣ OASIS FM◈ reads/write directly to OST storage device
s▣ High speed network interconnect
◈ GigE◈ InfiniBand
▣ MDS◈ Serves meta data information◈ holds directory & file attributes
▣ OSTs:◈ Serves file data information
▣ OASIS MGMT:◈ Contains Configuration data
▣ OASIS FM◈ reads/write directly to OST storage device
s▣ High speed network interconnect
◈ GigE◈ InfiniBand
▣ MDS◈ Serves meta data information◈ holds directory & file attributes
▣ OSTs:◈ Serves file data information
▣ OASIS MGMT:◈ Contains Configuration data
▣ OASIS FM◈ reads/write directly to OST storage device
s▣ High speed network interconnect
◈ GigE◈ InfiniBand
▣ MDS◈ Serves meta data information◈ holds directory & file attributes
▣ OSTs:◈ Serves file data information
▣ OASIS MGMT:◈ Contains Configuration data
▣ OASIS FM◈ reads/write directly to OST storage device
s▣ High speed network interconnect
◈ GigE◈ InfiniBand
Meta-data Server(MDS)
OASIS FM
OASIS MGMT
Object Storage Device(OSD)
Object Storage Device(OSD)
OASIS MGMT
OASIS FM
Meta-data Server(MDS)
Directory operations,Meta-data & concurrency
file status,file creation
file I/O &file locking
Configuration information, network connectiondetails, & security management
Not encumbered by existing architecture
▣ OASISfs Architecture
13
▣ InfiniStorTM Protocol Architecture
VFS
SO Driver
SCSI mid Layer
OSD/iSCSI Driver
Linux SCSI Stack
OASIS File System (FM)
Multiple Object DevicesDriver (MOD)
Linux Block Layer
TCP/IP Stack
EXT3, XFS
Metadata Manager (MM)VFS iSCSI Target
T10 Object Manager
Object I/O Manager (OM)
SO Driver
SCSI mid Layer
OSD/iSCSI Driver
Linux SCSIStack
TCP/IP Stack
Metadata Server
Object Storage Target
RPCDirectory, Metadata, Concurrency
OSD/iSCSIFile Status & Creation
OSD/iSCSISystem & Parallel File I/O, File Locking
File Manager (Client)
EXT3, XFS
Namespace
Objects
TCP/IP Stack
Proprietary
Open Source
14
▣ Characteristics◈ OSD Based Cluster File System
– Network based Data Share– Volume Management– Linux, Window based Client File System– Client of Clients Support via NFS/CIFS– Near-Linear Performance-Capacity Scalability as Storage Server Scale– Configuration possible based on user need (Performance, Capacity, Budg
et, Availability, …)◈Out-of-Band architecture◈Standard Compliance◈High Availability for Metadata Servers, Object Storage Servers
15
▣ Execution Simulation
OSD
Client
MDSGigabit EthernetSwitch
Fileset
/
home share
big.avi
Data
Metadata
Data
InfiniStorTM Runtime Characteristics
16
▣ Storage Virtualization using Object RAID
OASIS/MDS
2) From multiple OSDs, create VFS using provided configuration tool
3) Provide VFS Conf. Info
500GB HDD 6TB OSD1) Install 12 HDDs to a OSD
OASIS/FM
POSIXAPI
Multiple IO to/from OSDs
File
OASIS/OM
Perform Virtualization
• Virtualizes Multiple OSDs
• OASIS/MDS: Virtualized Configuration MGMT• OASIS/Client: Mapping Physical Storage System to Virtual File System
OASIS/OM
OASIS/OM
OASIS/OM
17
▣ Online OSD Add: No Service Interruption
OASIS/MDS
OASIS ClientsOASIS OSDs
2) Add HDDs to OSD
3) Connect OSDs online
4) Register OSD to MDS by Configuration tool 5) New configuration info is applied
to all clients automatically
6) New conf. is acknowledged and use with no interrupt
1) check if space is available
Low Space
18
▣ RAID SupportLinear OSD
Store a file on 1 OSDBalanced IO for read operation
RAID0 OSDStripe a file and store each stripe on a OSDFastest IO for both read and write operation
RAID1 OSDStorage a file on 2 OSDs (Mirror)Resilience for OSD TroubleImprove Concurrent Read Ops of a file
RAID5 OSDStripe a file and add parity for the stripes, and store eachon a OSDResilience for OSD TroubleImproved OSD usage
OASIS/MDS
OASIS ClientsOASIS OSDs
OASIS/MDS
OASIS ClientsOASIS OSDs
OASIS/MDS
OASIS ClientsOASIS OSDs
OASIS/MDS
OASIS ClientsOASIS OSDs
P
P
19
▣ Cache-Coherency Support Method• Near Unix Semantics Support• Eliminate Performance Degration when Clients do not approach to a file at a same time
NFSServer
NFSClient
NFSClient
AAA
A’
A’A
554433221100 A’
NFS(Policy: Time interval, File Open)
LustreServer
LustreClient
LustreClient
AAA
A’
A’A
A’
Lustre(Policy: Check when access)
????????
OASISServer
OASISClient
OASISClient
AAA
A’
A’A
A’
OASIS(Policy: Updater Invalidation)
??
20
▣ InfiniStorTM Configuration & Monitoring Software
ezCon: Web based OASIS Management Software
21
▣Management SW: InfiniStorTM Backup Software
22
▣Management SW : OSD Server Monitoring Software
23
▣Management SW: Application Server Monitoring Software
24
▣ Open Document◈ Windows Client File Management SW Spec◈ Detailed Documents
– Linux Client File Management Block– Metadata Management Block– Object Storage Management Block
▣ Test suite◈ User Level (File Access API) Test Suite
– POSIX Test Suite– Linux Test Plan Suite– Self Made Test Suite (200K cases)
◈ Blackbox based Concurrent Use Test Test Suite– 100 Clients => Each Client creates 20 threads => Each threads creates 1,000 files/s– 100 Clients => Each Client creates 50 threads => Each threads reads all files– 100 Clients => Each Client creates empty file in infinity loop (until MDS metadata spac
e run out).
OASIS Verification Process
25
Deployment Experience 1▣ Korea Supercomputing Center Bio Informatics Division
◈Hardware Configuration– 4 OSDs, 1 MDS– 40 Application Servers– Gigabit Network
◈Application– Parallel Blast: Bio-Informatics Gene Sequencing Program– Software Pattern
– Each Application Server runs 8 Processes of Parallel Blast– Each Parallel Blast runs (read) 1MB to 4GB Target file at once– Run it indefinitely
– Each CPU creates 50 1KB files every second◈Result
– Very Slow (as expected)– No Crash– Found bottle point in OASISfs and IMPROVED parallel IO Performance by
4 times
26
Deployment Experience 2▣ PANDORA TV: Korea’s largest UCC service
◈ Hardware Configuration– 2-3 OSDs, 1 MDS– 1-4 Application Servers– Gigabit Network
◈ Application– Streaming Service– Each Application Server (Apache Web Server) delivers 600-900 VOD Streams (FTP
Downloads)– Each Stream is of size from 00KB to 800MB– Very-Very Random Access (Mostly READ ops)– Operation type
– File open -> seek by 128KB*n -> read 128KB -> close◈ Result
– Each application server gets sustained 300MB/s read– 1 Stop by bug in Linux Kernel 2.6.10 EXT3 Hash table.
– Fixed by changing OSD FS from EXT3 to EXT2– No Crash for 3 Months
27
Future Work (in progress)▣OASISfs Version4
◈10K Client support – Each client does 100 Random I/O’s◈Autonomous & Dynamic Data Redundancy support◈Hot Data duplication support (for frequent read access)◈DeDuplication support◈ Improved MDS
– Active-Active MDS support– Probably 8-16 Concurrent MDS support (all active)– Shared-All or Shared-Nothing (not decided)
◈OSD– 1 MDS : Multiple OST support– OST-Network mapped OST support
◈Client– Kernel Patch-less support (probably slower than current version 10%??)– RPC or Socket based (no more iSCSI???)– Probably FUSE based (??)
28
InfiniStorTM Combines the Best of SAN and NAS
Shared data (as with NAS)
High bandwidth, low overhead,Secure Access (as with SAN)
High Scalability (much higher than NAS)
High Availability
Easy Management regardless of Client (as with NAS)
Various Communication Media SupportCan use existing network interconnects
Gb Ethernet, 10 Gb Ethernet, InfiniBand, …
Lower cost than connecting Fibre Channel to
hundreds of application clients
systemsupport
System area network
Sys Admin
Network I/O
Application ServerDedicated resources
For Metadata service and
lock management
InfiniStorTM
29
Thank you