vmworld 2013: extreme performance series: storage in a flash
DESCRIPTION
VMworld 2013 Sankaran Sivathanu, VMware Mark Achtemichuk, VMware Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshareTRANSCRIPT
Extreme Performance Series:
Storage in a Flash
Sankaran Sivathanu, VMware
Mark Achtemichuk, VMware
VSVC5603
#VSVC5603
2
Flash Storage
Flash is everywhere
Used extensively in smartphones, tablets, laptop computers,
storage arrays, etc.
Adopting flash in enterprise servers?
• Presents an economical alternative to having a storage array
How does VMware embrace flash technology in vSphere 5.5?
• Native support for provisioning of flash resources
• Flash caching support in ESXi storage stack
• vSAN leverages flash storage for high performance
Focus:
Application performance on vSphere 5.5 when leveraging flash
3
Agenda
vSphere Flash Read Cache
(vFRC)
Virtual SAN (vSAN)
Configurations and Troubleshooting
4
vFRC Overview
5
vFRC – Overview
vSphere 5.5 introduces vSphere Flash Infrastructure layer
• Aggregates flash storage devices into a unified flash resource
• Supports locally connected flash devices (PCIe, SAS/SATA drives, etc.,)
Flash resource can be used for read caching of VM I/Os
• vSphere Flash Read Cache (vFRC)
Write Policy
• Write-through : write I/Os are written to persistent storage and vFRC
simultaneously
• Large writes are filtered – avoids cache pollution with log/streaming data
Caches configured on per-VMDK basis
• Can be custom configured based on workload
6
vFRC - Overview
Flash Read Cache
VMDKs
VM Layer
ESX Layer
Storage Layer
7
vFRC Tunables
8
Performance Tunables in vFRC
What workloads can benefit from vFRC ?
• Read-dominated I/O pattern
• High repeated access of data (E.g. 20% of working set accessed 80% of time)
• Sufficient flash capacity to hold data that is accessed repeatedly
What impacts vFRC performance ?
• Cache Size – should be big enough to hold active working set of workload
• Cache Block Size – should match the dominant I/O size of workload
• Flash Device Types – PCIe flash cards vs. SSD drives
9
a. Cache Size
Cache sizes are specified manually when enabling vFRC for
a VMDK
• Depends on working set of the application
• Should be sized to hold active working set
Inadequate cache sizes lead to increased cache miss rate
Over-sized cache leads to wastage of flash resources and sub-
optimal performance during vMotion
• By default cache is migrated during vMotion
• Over-sized cache increases vMotion time
How to determine the right working set size?
• vscsiStats workload tracing
Cache size can be modified at run-time if necessary
10
b. Cache Block Size
Basic unit of cache fill and cache eviction operation
Affects effective utilization of cache capacity
Bigger cache blocks lead to internal fragmentation, but consumes
less memory
Smaller cache blocks consumes more memory (upto 2% memory
space overhead)
Default cache block size is 8KB
0
0.5
1
1.5
2
2.5
4 8 16 32 64 128 256 512
MemoryConsumed/Cachesize(%)
CacheBlockSize(KB)
MemoryOverheadwrt.vFRCblocksize
0
300
600
900
1200
1500
4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB Baseline(novFRC)
LatencyinM
icroseconds
CacheBlockSize
PerformanceImpactofCacheBlockSize
11
b. Cache Block Size
Larger Cache Block Size (Example: 512KB cache block size for
workload I/O Size of 8KB) – Internal Fragmentation
vFRC
Cache Blocks
Valid Cached Data
12
c. Flash Device Type
30k – 40k Random Read IOPS
200 – 270 MB/s Read Bandwidth
Read Latency – 75 microseconds
Write Latency – 90 microseconds
Upto 750k Random Read IOPS
Upto 3 GB/s Read Bandwidth
Read Latency – 75 microseconds
Write Latency – 15 microseconds
High Performance Low Cost
VENDOR SPECIFICATIONS
13
vFRC Performance
14
vFRC Performance – Applications
What workloads can benefit from vFRC?
• Read-dominated I/O pattern
• High repeated access of data (E.g. 20% of working set accessed 80% of time)
• Sufficient flash capacity to hold data that is accessed repeatedly
Applications Considered:
• Data Warehousing (Swingbench DSS)
• Database Transactions (DVDstore)
• Real-World Enterprise Server Workloads (Publicly available I/O Traces)
15
1. Data Warehousing Application
Decision Support System [TPC-H]
Benchmark : Swingbench 2.4 using ‘Sales History’ Schema on
Oracle 11g R2 database
SWINGBENCH DSS BENCHMARK ON RHEL
6.4 VM
QUERIES >>
<< RESULTS
ORACLE 11G R2 ON WINDOWS 2008
SERVER VM
vFRC
EMC VNX 5700
1TB LUN, RAID5 OVER 5 FC 15k RPM HDDs
16
1. Data Warehousing Application
Workload: Read dominated, High re-access rate
vFRC Configuration: 8GB Cache Size and 8KB Cache block size
0
2000
4000
6000
8000
10000
12000
SRMC SCMC PSCR SMA PPSC TSQ SQC
#oftran
sacons
Transac onType
Transac onCount
Baseline VFRC
17
1. Data Warehousing Application
Workload: Read dominated, High re-access rate
vFRC Configuration: 8GB Cache Size and 8KB Cache block size
Up to 84% improvement in average throughput
Up to 2X reduction in latency
61.7
112.9
0
20
40
60
80
100
120
Baseline VFRC
TPM
Transac onsPerMinute
20.389
10.859
0
5
10
15
20
25
Baseline VFRC
ResponseTIm
e(s)
AverageResponseTime
18
2. Database Transaction Application
Benchmark : DVDStore
• Simulates online e-commerce site operations
• Database : MS SQL Server 2008
• Database Size : 15 GB
Workload Characteristics
• 60% reads
• Mostly random I/Os
• Predominant I/O size : 8KB
VM Configuration
• 8 vCPUs, 8GB Memory
• 25GB Database disk, 10GB Log disk
Storage Array
• VNX 5700, 1TB LUN – RAID5 over 5 FC 15k RPM disk drives
19
2. Database Transaction Application
8802 8937
12319
0
2000
4000
6000
8000
10000
12000
14000
Baseline vFRC-10GB vFRC-15GB
OrdersPerMinute
Up to 39% improvement in application throughput
20
3. Enterprise Server I/O Traces
1.23
0.321
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Baseline vFRC
AverageLatency(ms)
a. Hardware Monitoring
Server Workload
• Trace from servers that logs
data from multiple hardware
monitoring programs across a
datacenter
• Collected at Microsoft
Research, Cambridge*
• Trace replayed using
IOAnalyzer
95% reads
vFRC size – 4GB
vFRC block size – 4KB
vFRC hit percentage – 85%
* Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron. 2008. Write off-loading: Practical power management for enterprise storage.
Trans. Storage 4, 3, Article 10 (November 2008).
21
3. Enterprise Server I/O Traces
b. Proxy Server Workload
• Trace from a web proxy
server
• Collected at Microsoft
Research, Cambridge*
• Trace replayed using
IOAnalyzer
67% reads
vFRC Size : 16GB
vFRC block Size : 4KB
vFRC hit percentage :
83%
* Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron. 2008. Write off-loading: Practical power management for enterprise storage.
Trans. Storage 4, 3, Article 10 (November 2008).
1.357
0.612
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Baseline vFRC
AverageLatency(ms)
22
vMotion Performance with vFRC
vFRC is fully supported by vMotion and other vSphere features
vMotion behavior of vFRC-enabled VM
• VM caches are migrated by default
• Option to drop cache during vMotion
Migrating cache preserves application performance gains
• Consumes more network bandwidth
• Increased vMotion time
Dropping cache during vMotion leads to temporary dip in
application performance gains
• No extra overhead in vMotion
• Re-warms up the cache at destination
23
vFRC Performance Best Practices
24
vFRC Configuration Guidelines
Cache size may be configured based on working set of workload
• Start with about 20% of VMDK size, and monitor vFRC stats to re-configure it
Cache block size must match dominant I/O size of workload
• Workload I/O size is not equal to VM I/O Size!
vFRC performs better with PCIe flash devices
Decide on cache migration behavior during vMotion based on:
• Criticality of application performance
• Time taken for vMotion
• Network bandwidth availability
25
Making sense of vscsiStats and vFRC Stats
vscsiStats can be used to know more about the workload
• IO Length Histogram
• Read Write Ratio
• I/O trace to compute working set size
vFRC stats provide information about cache effectiveness
• numBlocks – total number of cache blocks for a VMDK
• numBlocksCurrentlyCached – number of cache blocks that actually contains
data
• Evict:avgNumBlocksPerOp – average number of evictions
• avgCacheLatency – average device latency of flash resource
• maxCacheLatency – maximum device latency of flash resource
• cacheHitPercentage – percentage of read cache hits
26
vFRC Sizing Decisions based on vFRC Stats
Using vFRC stats to make sizing decisions
• numBlocksCurrentlyCached < numBlocks : cache size may be reduced
• numBlocksCurrentlyCached = numBlocks and Evict:avgNumBlocksPerOp is
high : cache size may be inadequate
• maxCacheLatency is very high : may be because of spike in device latency,
which may mean device has worn out
• cacheHitPercentage is high and Evict:avgNumBlocksPerOp is low : means
cache is correctly configured
For more detailed information, please refer to our performance
whitepaper http://www.vmware.com/files/pdf/techpaper/vfrc-perf-
vsphere55.pdf
27
Agenda
vSphere Flash Read Cache
(vFRC)
Virtual SAN (vSAN)
Configurations and Troubleshooting
28
Virtual SAN - Architecture
Each ESXi host contributes:
• Flash storage to absorb IOPS
• Hard disk drives to provide capacity
Virtual SAN aggregates these
resources from multiple servers
in a vSphere cluster
• Provides a global datastore for VMs
in the cluster
HA/DRS ensures that the VM
restarts on a host crash
Virtual SAN objects can be split
into multiple components for
performance and data protection
• Governed by storage policies
ESX
VSAN cluster
ESX ESX
VM
virtual disk
VSAN object
replica-1 replica-2
Witness
29
Experiment Setup
Hardware
• 16 core 2.9 GHz Dell R720 machines
• 2 x Intel PCIe R910 SSD – 200GB (1 PCIe slot)
• 12 x 10K RPM Seagate SAS disks
• 10G vSAN dedicated network, 1G for VM network
VSAN Configuration
• 2 x Disk groups per machine, 6 x Disks per disk group
• hostFailuresToTolerate 1, stripeWidth 1
Workload Characteristics
• ViewPlanner 3.0 Standard benchmark with 2 sec think-time (heavy user)
• ViewPlanner Group A : CPU intensive & Group B: I/O intensive apps.
• 1900 x 1200 resolution, PCoIP
• Windows 7 desktop’s and Winxp Clients.
• VDI workload is known to be CPU intensive but sensitive to I/O latency.
30
Virtual SAN Delivers IOPS Required by VDI
• Virtual SAN can meet the IOPS required by VDI workload
31
Virtual SAN Scale..
275
460
667
0
100
200
300
400
500
600
700
800
3 node 5 node 7 node
Nu
mb
er
of
He
avy V
DI U
se
rs
Virtual SAN scale
VSAN Linear (VSAN)
32
Group A Score Comparison
0
0.2
0.4
0.6
0.8
1
1.2
Avg Application
Latency
Group A
VSAN
SAN
• Impact to Group A application latencies is marginal
• Virtual SAN uses very few cycles of Host CPU.
33
Group B Score Comparison
0
1
2
3
4
5
6
7
Avg Application
Latency
Group B
VSAN
All-Flash-SAN
• Group B application latencies are close to All-Flash-SAN
• Virtual SAN can meet the IOPS required by VDI workload
34
VSAN VDI Consolidation compared to physical SAN array
VSAN performs better than a typical mid-range FC storage array
• vSAN benefits from local flash storage that provide high performance
Impact of VSAN CPU consumption on application performance is low
Physical SAN array is not required to run VDI workload
0
100
200
300
400
500
600
700
800
3 5 7
NumberofVMs
NumberofNodes(Servers)
VDIConsolida onRa o
Mid-rangeFCArray
vSAN
All-FlashFCArray
35
Agenda
vSphere Flash Read Cache
(vFRC)
Virtual SAN (vSAN)
Configurations and Troubleshooting
36
Define the Performance Issue
Understand Application Function & Architecture
• At a minimum know what your application does and what it’s dependent on
Select Application KPIs
• Application performance must be measured using an application counter (tps,
response time, etc.) and not virtual resource consumption
Define Success Criteria
• With your app owner, define at what level the application KPI’s must be to
consider it performant
Comparisons must be Apples-to-Apples
• Any changes to infrastructure (physical or virtual) create
comparison challenges
Now the Gap is Identified, Begin Troubleshooting
• With an understanding of the requirements and current deficiency, you can
now begin to investigate and/or tune
37
Disk I/O Latencies
Application
Guest OS
ESX Storage
Stack
VMM
Driver
KAVG
DAVG
GAVG
QAVG
* KAVG = GAVG – DAVG
Fabric
vSCSI
HBA
Time spent in ESX
storage stack is minimal,
for all practical purposes
KAVG ~= QAVG
In a well configured
system QAVG should
be zero
Array SP
38
Key Indicators – Investigative Thresholds
Kernel Latency Average (KAVG)
• This counter tracks the latencies of IO passing thru the Kernel
• Investigation Threshold: 1ms
Device Latency Average (DAVG)
• This is the latency seen at the device driver level. It includes the roundtrip time
between the HBA and the storage
• Investigation Threshold: 15-20ms, lower is better, some spikes okay
Aborts (ABRT/s)
• The number of commands aborted per second
• Investigation Threshold: 1
39
Disk I/O Queues
GQLEN – Guest Queue
AQLEN – Adapter Queue
WQLEN – World Queue
DQLEN – Device / LUN
Queue
SQLEN – Array SP Queue
DQLEN
WQLEN
SQLEN
GQLEN
DQLEN can change
dynamically when SIOC
is enabled
Reported in esxtop AQLEN
Application
Guest OS
ESX Storage
Stack
VMM
Driver
Fabric
vSCSI
HBA
Array SP
40
Performance Technical Resources
Performance Technical Papers
• http://www.vmware.com/files/pdf/techpaper/vfrc-perf-vsphere55.pdf
• http://www.vmware.com/resources/techresources/cat/91,96
Performance Best Practices
• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf
• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.1.pdf
• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.0.pdf
• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.1.pdf
Troubleshooting Performance Related Problems in vSphere Environments
• http://communities.vmware.com/docs/DOC-14905 (vSphere 4.1)
• http://communities.vmware.com/docs/DOC-19166 (vSphere 5)
• http://communities.vmware.com/docs/DOC-23094 (vSphere 5.x with vCOps)
41
Performance Community Resources
Performance Technology Pages
• http://www.vmware.com/technical-resources/performance/resources.html
Technical Marketing Blog
• http://blogs.vmware.com/vsphere/performance/
Performance Engineering Blog VROOM!
• http://blogs.vmware.com/performance
Performance Community Forum
• http://communities.vmware.com/community/vmtn/general/performance
Virtualizing Business Critical Applications
• http://www.vmware.com/solutions/business-critical-apps/
42
Extreme Performance Series Sessions
Extreme Performance Series:
vCenter of the Universe – Session #VSVC5234
Monster Virtual Machines – Session # VSVC4811
Network Speed Ahead – Session # VSVC5596
Storage in a Flash – Session # VSVC5603
Big Data:
Virtualized SAP HANA Performance, Scalability and Practices – Session
# VAPP5591
Hands on Labs:
HOL-SDC-1304 – Optimize vSphere Performance includes vFRC
43
Other VMware Activities Related to This Session
HOL:
HOL-SDC-1308
Virtual Storage Solutions
Group Discussions:
VSVC1001-GD
Performance with Mark Achtemichuk
VSVC5603
THANK YOU
Extreme Performance Series:
Storage in a Flash
Sankaran Sivathanu, VMware
Mark Achtemichuk, VMware
VSVC5603
#VSVC5603