ceph day amsterdam 2015 - deploying flash storage for ceph without compromising performance
TRANSCRIPT
Ceph Day Amsterdam – March 31st, 2015
Unleash Ceph over Flash Storage Potential with
Mellanox High-Performance Interconnect
Chloe Jian Ma, Senior Director, Cloud Marketing
© 2015 Mellanox Technologies 2- Mellanox Confidential -
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
StorageFront / Back-End
Server / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
© 2015 Mellanox Technologies 3- Mellanox Confidential -
Solid State Storage Technology Evolution – Lower Latency
Advanced Networking and Protocol Offloads Required to Match Storage Media Performance
0.1
10
1000
HD SSD NVM
Ac
ce
ss
Tim
e (
mic
ro-S
ec
)
Storage Media Technology
50%
100%
Networked Storage
Storage Protocol (SW) Network
Storage Media
Network
HW & SW
Hard
Drives
NAND
Flash
Next Gen
NVM
© 2015 Mellanox Technologies 4- Mellanox Confidential -
Scale-Out Architecture Requires A Fast Network
Scale-out grows capacity and performance in parallel
Requires fast network for replication, sharing, and metadata (file)
• Throughput requires bandwidth
• IOPS requires low latency
• Efficiency requires CPU offload
Proven in HPC, storage appliances, cloud, and now… Ceph
Interconnect Capabilities Determine Scale Out Performance
© 2015 Mellanox Technologies 5- Mellanox Confidential -
Ceph and Networks
High performance networks enable maximum cluster availability
• Clients, OSD, Monitors and Metadata servers communicate over multiple network layers
• Real-time requirements for heartbeat, replication, recovery and re-balancing
Cluster (“backend”) network performance dictates cluster’s performance and scalability
• “Network load between Ceph OSD Daemons easily dwarfs the network load between Ceph Clients
and the Ceph Storage Cluster” (Ceph Documentation)
© 2015 Mellanox Technologies 6- Mellanox Confidential -
How Customers Deploy Ceph with Mellanox Interconnect
Building Scalable, Performing Storage Solutions
• Cluster network @ 40Gb Ethernet
• Clients @ 10G/40Gb Ethernet
High performance at Low Cost
• Allows more capacity per OSD
• Lower cost/TB
Flash Deployment Options
• All HDD (no flash)
• Flash for OSD Journals
• 100% Flash in OSDs
Faster Cluster Network Improves Price/Capacity and Price/Performance
© 2015 Mellanox Technologies 7- Mellanox Confidential -
Ceph Deployment Using 10GbE and 40GbE
Cluster (Private) Network @ 40/56GbE
• Smooth HA, unblocked heartbeats, efficient data balancing
Throughput Clients @ 40/56GbE
• Guaranties line rate for high ingress/egress clients
IOPs Clients @ 10GbE or 40/56GbE
• 100K+ IOPs/Client @4K blocks
20x Higher Throughput , 4x Higher IOPs with 40Gb Ethernet Clients!(http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)
Throughput Testing results based on fio benchmark, 8m block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
IOPs Testing results based on fio benchmark, 4k block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
Cluster Network
Admin Node
40GbE
Public Network10GbE/40GBE
Ceph Nodes(Monitors, OSDs, MDS)
Client Nodes10GbE/40GbE
© 2015 Mellanox Technologies 8- Mellanox Confidential -
OK, But How Do We Further Improve IOPS? We Use RDMA!
RDMA over InfiniBand or Ethernet
KE
RN
EL
HA
RD
WA
RE
US
ER
RACK 1
OS
NIC Buffer 1
Application
1Application
2
OS
Buffer 1
NICBuffer 1
TCP/IP
RACK 2
HCA HCA
Buffer 1Buffer 1
Buffer 1
Buffer 1
Buffer 1
© 2015 Mellanox Technologies 9- Mellanox Confidential -
Ceph Throughput using 40Gb and 56Gb Ethernet
0
1000
2000
3000
4000
5000
6000
64KB Random Read 256KB Random Read
MB
/s 40Gb TCP MTU=1500
56Gb TCP MTU=4500
56 Mb RDMA MTU=4500
One OSD, One Client, 8 Threads
56
Gb
RD
MA
56
Gb
TC
P
40
Gb
TC
P
56
Gb
RD
MA
56
Gb
TC
P
40
Gb
TC
P
© 2015 Mellanox Technologies 10- Mellanox Confidential -
By SanDisk & Mellanox
Optimizing Ceph for Flash
© 2015 Mellanox Technologies 11- Mellanox Confidential -
Ceph Flash Optimization
Highlights Compared to Stock Ceph• Read performance up to 8x better
• Write performance up to 2x better with tuning
Optimizations• All-flash storage for OSDs
• Enhanced parallelism and lock optimization
• Optimization for reads from flash
• Improvements to Ceph messenger
Test Configuration• InfiniFlash Storage with IFOS 1.0 EAP3
• Up to 4 RBDs
• 2 Ceph OSD nodes, connected to InfiniFlash
• 40GbE NICs from Mellanox
SanDisk InfiniFlash
© 2015 Mellanox Technologies 12- Mellanox Confidential -
8K Random - 2 RBD/Client with File System
IOPS: 2 LUNs /Client (Total 4 Clients)
0
50000
100000
150000
200000
250000
300000
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
Lat(ms): 2 LUNs/Client (Total 4 Clients)
0
20
40
60
80
100
120
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
IOPS Latency
(ms)
[Queue Depth]
Read PercentIFOS 1.0 Stock Ceph
© 2015 Mellanox Technologies 13- Mellanox Confidential -
Performance: 64K Random -2 RBD/Client with File System
IOPS: 2 LUNs/Client (Total 4 Clients) Lat(ms): 2 LUNs/Client (Total 4 Clients)
0
20000
40000
60000
80000
100000
120000
140000
160000
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32
0 25 50 75 100
IOPS Latency
(ms)
[Queue Depth]
Read PercentIFOS 1.0 Stock Ceph
© 2015 Mellanox Technologies 14- Mellanox Confidential -
XioMessenger
Adding RDMA To Ceph
© 2015 Mellanox Technologies 15- Mellanox Confidential -
I/O Offload Frees Up CPU for Application Processing
~88% CPU
Efficiency
Us
er
Sp
ac
eS
ys
tem
Sp
ac
e
~53% CPU
Efficiency
~47% CPU
Overhead/Idle
~12% CPU
Overhead/Idle
Without RDMA With RDMA and Offload
Us
er
Sp
ac
eS
ys
tem
Sp
ac
e
© 2015 Mellanox Technologies 16- Mellanox Confidential -
Adding RDMA to Ceph
RDMA Beta to be Included in Hammer
• Mellanox, Red Hat, CohortFS, and Community collaboration
• Full RDMA expected in Infernalis
Refactoring of Ceph messaging layer
• New RDMA messenger layer called XioMessenger
• New class hierarchy allowing multiple transports (simple one is TCP)
• Async design that leverages Accelio
• Reduced locks; Reduced number of threads
XioMessenger built on top of Accelio (RDMA abstraction layer)
• Integrated into all CEPH user space components: daemons and clients
• “public network” and “cluster network”
© 2015 Mellanox Technologies 17- Mellanox Confidential -
Open source!
• https://github.com/accelio/accelio/ && www.accelio.org
Faster RDMA integration to application
Asynchronous
Maximize msg and CPU parallelism
• Enable >10GB/s from single node
• Enable <10usec latency under load
In Giant and Hammer
• http://wiki.ceph.com/Planning/Blueprints/Giant/Accelio_RDMA_Messenger
Accelio, High-Performance Reliable Messaging and RPC Library
© 2015 Mellanox Technologies 18- Mellanox Confidential -
Ceph 4KB Read IOPS: 40Gb TCP vs. 40Gb RDMA
0
50
100
150
200
250
300
350
400
450
2 OSDs, 4 clients 4 OSDs, 4 clients 8 OSDs, 4 clients
Th
ou
san
ds o
f IO
PS
40Gb TCP
40Gb RDMA
34
co
res in
OS
D
24
co
res in
clie
nt
38
co
res in
OS
D
24
co
res in
clie
nt
38
co
res in
OS
D
30
co
res in
clie
nt
36
co
res in
OS
D
32
co
res in
clie
nt
34
co
res in
OS
D
27
co
res in
clie
nt
38
co
res in
OS
D
24
co
res in
clie
nt
RDMATCP RDMATCP RDMATCP
© 2015 Mellanox Technologies 19- Mellanox Confidential -
Ceph-Powered Solutions
Deployment Examples
© 2015 Mellanox Technologies 20- Mellanox Confidential -
Ceph For Large Scale Storage– Fujitsu Eternus CD10000
Hyperscale Storage
• 4 to 224 nodes
• Up to 56 PB raw capacity
Runs Ceph with Enhancements
• 3 different storage nodes
• Object, block, and file storage
Mellanox InfiniBand Cluster Network
• 40Gb InfiniBand cluster network
• 10Gb Ethernet front end network
© 2015 Mellanox Technologies 21- Mellanox Confidential -
Media & Entertainment Storage – StorageFoundry Nautilus
Turnkey Object Storage
• Built on Ceph
• Pre-configured for rapid deployment
• Mellanox 10/40GbE networking
High-Capacity Configuration
• 6-8TB Helium-filled drives
• Up to 2PB in 18U
High-Performance Configuration
• Single client read 2.2 GB/s
• SSD caching + Hard Drives
• Supports Ethernet, IB, FC, FCoE front-end ports
More information: www.storagefoundry.net
© 2015 Mellanox Technologies 22- Mellanox Confidential -
SanDisk InfiniFlash
Flash Storage System
• Announced March 3, 2015
• InfiniFlash OS uses Ceph
• 512 TB (raw) in one 3U enclosure
• Tested with 40GbE networking
High Throughput
• Up to 7GB/s
• Up to 1M IOPS with two nodes
More information:
• http://bigdataflash.sandisk.com/infiniflash
© 2015 Mellanox Technologies 23- Mellanox Confidential -
More Ceph Solutions
Cloud – OnyxCCS ElectraStack
• Turnkey IaaS
• Multi-tenant computing system
• 5x faster Node/Data restoration
• https://www.onyxccs.com/products/8-series
Flextronics CloudLabs
• OpenStack on CloudX design
• 2SSD + 20HDD per node
• Mix of 1Gb/40GbE network
• http://www.flextronics.com/
ISS Storage Supercore
• Healthcare solution
• 82,000 IOPS on 512B reads
• 74,000 IOPS on 4KB reads
• 1.1GB/s on 256KB reads
• http://www.iss-integration.com/supercore.html
Scalable Informatics Unison
• High availability cluster
• 60 HDD in 4U
• Tier 1 performance at archive cost
• https://scalableinformatics.com/unison.html
© 2015 Mellanox Technologies 24- Mellanox Confidential -
Summary
Ceph scalability and performance benefit from high performance networks
Ceph being optimized for flash storage
End-to-end 40/56 Gb/s transport accelerates Ceph today
• 100Gb/s testing has begun!
• Available in various Ceph solutions and appliances
RDMA is next to optimize flash performance—beta in Hammer
Thank You