windows server 2016 microsoft storage spaces direct … · windows server 2016 microsoft storage...
TRANSCRIPT
WINDOWS SERVER
2016
Microsoft Storage Spaces
Direct – the future of Hyper-V
and Azure Stack (EN)
Carsten Rachfahl
Microsoft Cloud & Datacenter MVP
Microsoft Regional Director Germany
WINDOWS SERVER
2016
Carsten RachfahlMicrosoft CDM MVP
Microsoft Reginal Director
Organisator of the Cloud & Datacenter
Conference Germany http://cdc-gemany.de
@hypervserver
one of the Hyper-V Amigos
I blog, do screencast and interviews at
https://www.hyper-v-server.de
WINDOWS SERVER
2016 Agenda
• S2D overview
• S2D in depth
• Deployment options
• Performance
• S2D in the {X} Stack
• Q&A
WINDOWS SERVER
2016
Storage Array
Traditional Storage Array
Compute
Fibre Channel / iSCSI / FCoE / SAS
WINDOWS SERVER
2016
Storage Array
Disks
Traditional Storage Array
Backplane
Compute
Fibre Channel / iSCSI / FCoE / SAS
Controller Controller
Storage Software Storage Software
WINDOWS SERVER
2016
Scale-out File Server
Enclosure (JBOD)
Shared Storage Spaces
SAS
Compute
SMB3
Storage Software
WINDOWS SERVER
2016
Scale-out File Server
Enclosure (JBOD)
Shared Storage Spaces
SAS
Compute
SMB3
Storage Software
WINDOWS SERVER
2016
Scale-out File Server
Enclosure (JBOD)
Shared Storage Spaces
SAS
Compute
SMB3
Storage Software
WINDOWS SERVER
2016
Scale-out File Server
Converged with Storage Spaces Direct
Compute
SMB3
Storage Software
WINDOWS SERVER
2016Microsoft Storage Spaces Direct
What is Storage Spaces Direct?
■ Software-defined storage
■ Highly available and scalable
■ Storage for Hyper-V and Private Cloud
Why Storage Spaces Direct?
■ Servers with local storage
■ Industry standard hardware
■ Lower cost flash with SATA SSDs
■ Better flash performance with NVMe SSDs
■ Ethernet/RDMA network as storage fabric
Hyper-V cluster with local attached storage
WINDOWS SERVER
2016Storage Stack
File System (CSVFS with ReFS)
■ Fast VHDX creation, expansion and checkpoints
■ Cluster-wide data access
Storage Spaces
■ Scalable pool with all disk devices
■ Resilient virtual disk
Software Storage Bus
■ Storage Bus Cache
■ Leverages SMB3 and SMB Direct
Servers with local disks
■ SATA, SAS and NVMe
Storage Spaces Storage Pool
Storage Spaces Virtual Disk
Scale-Out File Server
CSVFS Cluster File System
Software Storage Bus
Virtual Machines
2 1
ReFS On-Disk File System
Virtual Machines
Storage Spaces Virtual Disk
2
1
Converged
Hyper-Converged
SMB 3
WINDOWS SERVER
2016Software Storage Bus
Virtual storage bus spanning all servers
Virtualizes physical disks and enclosures
Consists of:
■ Clusport: Initiator (virtual HBA)
■ ClusBflt: Target (virtual disk / enclosures)
SMB3/SMB Direct transport
■ RDMA enabled networks for latency and CPU
Bandwidth management
■ Fair device access from any server
■ IO prioritization (App vs System)
■ De-randomization of random IO
■ Drives sequential IO pattern on rotational media
ClusPort
SpacePort
Virtual Disks
File System
Cluster Shared Volumes File System (CSVFS)
Application
Node 1
Block over SMB
ClusBflt
Node 2
Physical Devices
ClusPort
SpacePort
Virtual Disks
File System
ClusBflt
Physical Devices
WINDOWS SERVER
2016Built-In Cache
Integral part of Software Storage Bus
Cache scoped to local machine
Agnostic to storage pools and virtual disks
Automatic configuration when enabling S2D
■ Special partition on each caching device
■ Leaves 32GB for pool and virtual disks metadata
■ Round robin binding of SSD to HDD
■ Rebinding with topology change
Cache behavior
■ All writes up to 256KB are cached
■ Reads of 64KB or less are cached on first miss
■ Reads of 64+ KB are cached on second miss (<10 minutes)
■ Sequential reads of 32+KB are not cached
■ Write cache only on all flash systems
Caching Devices
Capacity Devices
WINDOWS SERVER
2016Storage Pool
Metadata on select devices
■ Improves pool scalability
■ Improved pool update performance
Device selection
■ Faster media is preferred
■ Metadata on up to 10 devices
■ Evenly spread across fault domain
■ Dynamic update on node or device failure
Potential metadata
devices
WINDOWS SERVER
2016Volume Types
■ Performance volumes (mirror)
■ Usually 3-way Mirror or 2-way Mirror
■ Capacity volumes (parity)
■ Should be double Parity
■ Hybrid volumes
■ hybrid of 3-way mirror and double parity
Mirror Parity
Mirror
Parity
WINDOWS SERVER
2016
Server 1
X1
Server 2
X2
Server 3
PX
Server 4
Y1
Server 5
Y2
Server 6
PY
Server 7
Q
Server 8
Hybrid VolumesVolume with mirror and parity
Requires at least 4 nodes
Requires ReFS
Mirror for hot data
Optimized for write performance
Little CPU or storage churn
Parity for cold data
Erasure coding storage efficiency
CPU or storage churn only on cold data
Local Reconstruction Codes (LRC) algorithm
Nodes Mirror
Efficiency
Parity
Efficiency
SSD + HDD
Parity
Efficiency
All-Flash
Resiliency
4 33% 50% 50% 2 node
8 33% 66% 66% 2 node
12 33% 72% 75% 2 node
16 33% 72% 80% 2 node
A A’ A’’B B’ B’’Mirror
Parity
WINDOWS SERVER
2016LRC data reconstruction
Most common failure is 1 fault domain
1 disk failure (X2)
■ Read X1 and PX
■ Recalculate X2
■ Write X2 to different disk
■ Total of 2 reads and 1 write
Traditional Reed Solomon
■ 4 data, 2 parity
■ Total of 4 reads and 2 write
LRC requires 50% less disk IO
Tolerant to failure of 2 fault domains
2 disk failure (X1 and X2)
■ Read PX, Y1, Y2 and Q
■ Recalculate and write X1 to a different disk
■ Recalculate and write X2 to a different disk
■ Total of 4 reads and 2 writes
Traditional Reed Solomon
■ 4 data and 2 parity
■ Total of 4 reads and 2 writes
LRC and RS requires the same disk IO
Server 1
X1
Server 2
X2
Server 3
PX
Server 4
Y1
Server 5
Y2
Server 6
PY
Server 7
Q
Server 8
WINDOWS SERVER
2016ReFS Real-Time Tiering
Writes go to mirror tier (hot data)
Rotate data into parity tier as needed (cold data)
Erasure Code calculation only on rotation
Updates to data stored in parity tier
■ Updated data is written to mirror tier
■ Old data in parity tier is invalidated (metadata operation)
Mirror tier Parity tier
W
ReFS
WINDOWS SERVER
2016ReFS VM Optimizations
Basics
■ Metadata checksums with optional user data checksum
■ Data corruption detection and repair
■ On-volume backup of critical metadata with online
repair
Efficient VM Checkpoints and Backup
■ VHD(X) checkpoints cleaned up without physical data
copies
■ Data migrated between parent and child VHD(X) files as
a ReFS metadata operation
Reduction of I/O to disk
Increased speed
■ Reduces impact of checkpoint clean-up to foreground
workloads
Accelerated Fixed VHD(X) Creation
■ Fixed VHD(X) files zeroed with metadata
operations
■ Minimal impact on workloads
■ Decreases VM deployment time
Quick Dynamic VHD(X) Expansion
■ Dynamic VHD(X) files zeroed with metadata
operations
■ Minimal impact on workloads
■ Reduces latency spike for foreground
workloads
WINDOWS SERVER
2016Scale
2 node (minimum)
■ Only 2-way Mirror
4 node to 16 node (maximum)
■ 2-way and 3-way mirror
■ Parity possible
■ Hybrid Disk
3 node
■ 2-way and 3-way mirror
With 16 nodes a
Maximum of 416 devices
Minimum 6 devices
(2 cache + 4 capacity drives)
WINDOWS SERVER
2016Deployment Options
SQL 2016 and storage resources together
Easy deployment and management (I hope )sql sql sql sql
WINDOWS SERVER
2016Deployment Options
Compute and storage resources together
Easy deployment and management
WINDOWS SERVER
2016Deployment Options
Hyper-ConvergedCompute and storage resources together
Easy deployment and management
Compute and storage resources separate
Scaling for larger deployments
SMB 3 Fabric
WINDOWS SERVER
2016
DELL PowerEdge R730XD
HPE ProLiant DL380 Gen9
Cisco UCS C240 M4
Intel MCB2224TAF3
Quanta D51B-2U (MSW6000)
DataON S2D-3110
Fujitsu Primergy RX2540 M2 Inspur NF5280M4
Lenovo X3650 M5 NEC Express5800 R120f-2M
RAID Inc. Ability™ HCI Series S2D200 SuperMicro SYS-2028U-TRT+
WINDOWS SERVER
2016
Mini-ITX Motherboard
Intel Xeon E3v5 1235L 4C 2.00 GHz
2 x 16 GB ECC DDR4
6 x 4TB SATA HDDUSB3 DOM 2 x 200GB SATA SSD
2 Node PoC Project Kepler-47
WINDOWS SERVER
2016
Mini-ITX Motherboard
Intel Xeon E3v5 1235L 4C 2.00 GHz
2 x 16 GB ECC DDR4
6 x 4TB SATA HDDUSB3 DOM 2 x 200GB SATA SSD
U-NAS
NSC-800
2 Node PoC Project Kepler-47
WINDOWS SERVER
2016
Server and drive fault tolerance
20+ TB of mirrored storage capacity
50+ GB of memory for 5-10 mid-sized VMs
Great for remote/branch office!
2 Node PoC Project Kepler-47
WINDOWS SERVER
2016
Microsoft and Intel showcase at IDF’15
Load Profile Total IOPS IOPS/Server
100% 4K Read 4.2M IOPS 268K IOPS
90%/10% 4K Read/Write 3.5M IOPS 218K IOPS
70%/30% 4K Read/Write 2.3M IOPS 143K IOPS
Showcase Hardware16 Intel® Server System S2600WT(2U) nodes
• Dual Intel® Xeon® processor E5-2699 v3 Processors
• 128GB Memory (16GB DDR4-2133 1.2V DR x4 RDIMM)
Storage per Server
• 4 - Intel® SSD DC P3700 Series (800 GB, 2.5” SFF)
• Boot Drive: 1 Intel® SSD DC S3710 Series (200 GB, 2.5” SFF)
Network per server
• 1 Chelsio® 10GbE iWARP RDMA Card (CHELT520CRG1P10)
• Intel® Ethernet Server Adapter X540-AT2 for management
Load Generator (8 VMs per Compute Node => 128 VMs)
• 8 virtual cores and 7.5 GB memory
• DISKSPD with 8 threads and Queue Depth of 20 per thread
WINDOWS SERVER
2016Performance Video auf Channel9
Configuration:
4x Dell R730XD
2x Xeon E5-2660v3 2.6Ghz (10c20t)
256GB DRAM (16x 16GB DDR4 2133 MHz DIMM)
4x Samsung PM1725 3.2TB NVME SSD (PCIe 3.0 x8 AIC)
Dell HBA330 ■ 4x Intel S3710 800GB SATA SSD
■ 12x Seagate 4TB Enterprise Capacity 3.5” SATA HDD
2x Mellanox ConnectX-4 100Gb (Dual Port 100Gb PCIe 3.0 x16) ■ Mellanox FW v. 12.14.2036
■ Mellanox ConnectX-4 Driver v. 1.35.14894
■ Device PSID MT_2150110033
■ Single port connected / adapter
WINDOWS SERVER
2016My own Benchmarks
Benchmark:
■ Microsoft VMFleet with 60 VMs on 4 Nodes
■ Diskspd testing 64kb Blöcke and
70% Read / 30% Write
Top: Fujitsu
2x E5-2680 CPUs with 2x 800GB NVMe + 4x
1.9TB SSD
Mid: Dell
2x E5-2640 with 18x 800GB SSDs
Bottom: HPE 2x E5-2660 2x 800GB SSDs + 4x 4TB HDD
WINDOWS SERVER
2016Azure Stack Integrated System
BMC Switch
ToR Switch
ToR Switch
Architecture, hardware, and topology
Security and privacy
Deployment, configuration, provisioning
Validation Monitoring, diagnostics
Business continuity
Patching and updating
Field replacement of parts