ebay: db capacity planning at ebay
Post on 15-Apr-2017
815 views
Embed Size (px)
TRANSCRIPT
Feng Qu, Sr MTS Bass Chorng, Principal Capacity Engineer
DB Capacity Planning at eBay
#CassandraSummit2015
Who Am I?
#CassandraSummit2015 2
Bass Chorng Principal Capacity Engineer @ eBay Specializes in database performance, availability & scalability in a large website. Established DB capacity team at eBay in 2003. Loves mountain biking.
#CassandraSummit2015
eBay Site DB Traffic At A Glance NoSQL Total 52 B/Day
Cassandra 15 B Mongo 15 B CouchBase 12 B PushVM 10B
RDBMS Total 350 B
MySQL 10 B Oracle 340 B
Peak Traffic 8M/sec Site Total DB Calls 400B/Day across 2,000 NoSQL Nodes + 450 Oracle Nodes Hosting 800M Active items & 120M Active Users Y-o-Y Growth 30% ~ 35%
15 15 12 10 10
340
Billion SQL Calls per Day
Cassandra
Mongo
CouchBase
PushVM
MySQL
Oracle
Capacity Planning - Simply Put Analyze Traffic
o Data Analyze Utilization
o Data Analyze The Relationship Of The Above Two
o Same Data Forecast Growth
o Simple Models, Then Impress Your Boss. Convert Resource Need into $
o A Calculator, Then Impress Your CIOs
BTW, You Also Need To Know
Platform Domain Knowledge Server, DB Engine, IO Subsystem, Networks Relationship Between System Overhead & Utilization Seasonality & Workload Characteristics Bottlenecks Components, Systems, Platforms, Architecture, Site & Apps New Technologies
#CassandraSummit2015 4
Domain Knowledge Stack
#CassandraSummit2015 5
APPS
DB
UNIX
STORAGE
CAPACITY
CAPACITY
aka Whom To Blame Stack
Bottom of food chain =>
Data What To Collect?
Apps, Database, Sessions, CPU, Memory, Connections, IOPS, IO Time, NIC, HBA, Array
How To Collect?
Time Resolution, Aggregation Level, Retention How To Use It?
Average, Max, 95th percentile, Dashboard, Reporting, Trending
#CassandraSummit2015 6
0.0
1.0
2.0
3.0
4.0 5/
1/20
15
5/2/
2015
5/
3/20
15
5/4/
2015
5/
5/20
15
5/6/
2015
5/
7/20
15
5/8/
2015
5/
10/2
015
5/11
/201
5 5/
12/2
015
5/13
/201
5 5/
14/2
015
5/15
/201
5 5/
16/2
015
5/17
/201
5 5/
19/2
015
5/20
/201
5 5/
21/2
015
5/22
/201
5 5/
23/2
015
5/24
/201
5 5/
25/2
015
5/26
/201
5 5/
27/2
015
0 5000000
10000000 15000000 20000000 25000000 30000000 35000000 40000000
1/26
/201
5 1/
28/2
015
1/30
/201
5 2/
1/20
15
2/3/
2015
2/
5/20
15
2/7/
2015
2/
9/20
15
2/11
/201
5 2/
13/2
015
2/15
/201
5 2/
17/2
015
2/19
/201
5 2/
21/2
015
2/23
/201
5 2/
25/2
015
2/27
/201
5 3/
1/20
15
Forecast Model Traffic, Not Resources Need One Year Trend Forecast At Daily Level Eliminate Outliers No Data Is Better Than Wrong Data Convert Traffic To Resource Usage Linear Extrapolation Only (CPU Utilization, not IO Time) Simple Excel Formula Works Well For Long Term Resource Planning Only Use Average, Not Max Not All Workloads Are Predictable
#CassandraSummit2015 7
0
10
20
30
40
50
60
70
01/01/2012 01/01/2013 01/01/2014 01/01/2015
Billion Calls
CATY Traffic Forecast
Forecast Actual Capacity
Things To Watch For Myths
More CPU Makes Apps Run Faster More Data Makes Apps Run Slower Apps Run Twice As Fast On CPU Twice The Speed High Session = High Load
Pitfalls
Cause VS. Symptom Time Resolution Masks Issues Look At The Whole Picture Slow Down In Order To Go Faster < Throttle > Challenges Data Quality Data Missing, Data Source Changes, F/O Data Residency, Data Errors Varieties of Data Formats & Resolutions Data Collection In Secured Zones #CassandraSummit2015
8
Me: Everything NoSQL
CassandraSummit2015 | #CassandraSummit
Prior to 2011: Worked on Oracle at DoubleClick/Yahoo/Intuit
Worked on NoSQL at eBay Database Infrastructure team: Cassandra since 2011 MongoDB since 2012 Couchbase since 2014
Cassandra Summit speaker for 2013, 2014, 2015
DataStax Cassandra MVP for 2014, 2015
For Cassandra Capacity Measurements Throughput Latency E.g. 30,000 reads/sec with SLA of P99 at 5ms
Hardware SKU Example CPU: 20 cores Memory: 128GB RAM Storage: 1.5TB local SSD Network: 10g NIC
CassandraSummit2015 | #CassandraSummit
Benchmarking Benchmarking for different hardware High I/O SKU High memory SKU High storage SKU Bare metal or cloud
Benchmarking for different software releases Benchmarking for different workloads 100% Writes 50% Writes, 50% Reads 5% Writes, 95% Reads 100% Reads
Benchmarking Tools YCSB Cassandra-stress
Proactive and repeated process using near real-time traffic in prod like environment
CassandraSummit2015 | #CassandraSummit
Capacity Planning
Key to avoid surprise in production The concept behind capacity planning is simple, but the mechanics are harder. Business requirements may increase, need to forecast how much resource must be
added to the system to ensure that user experience continues uninterrupted Input: clearly defined capacity goal coming from business requirement and performance baseline
from benchmark test Output: Identify resources to be added, such as memory, CPU, storage, I/O, network
Always prepare for peak + headroom
CassandraSummit2015 | #CassandraSummit
Capacity Planning Process
Initial Sizing Storage size vs. data size Compaction overhead, compression ratio, RF, indexes
Cost-effective configuration to meet capacpity/latency SLA Routine Review System utilization on I/O, storage, network, CPU, memory etc Cassandra metrics on GC, compaction, latency, throughput etc Compactionstats, cfhistoralgrams, tpstats etc
Forecasting Historical comparison Traffic projection
Flex up or Flex down
CassandraSummit2015 | #CassandraSummit
Scale Up vs. Scale Out Scale Up(vertical) Pros Smaller data center footprint, such as space, power, cooling Less license cost
Cons Likely cost more using proprietary hardware Less fault tolerant Limited upgradability in future
Scale Out(horizontal) Pros Cheaper using commodity hardware More fault tolerant (unlimited) upgradability
Cons Bigger data center footprint More license cost Likely need more network equipment
CassandraSummit2015 | #CassandraSummit
Questions ?
CassandraSummit2015 | #CassandraSummit
eBay is hiring experienced NoSQL professionals, please send resume to fengqu@ebay.com