ebay: db capacity planning at ebay

Download Ebay: DB Capacity planning at eBay

Post on 15-Apr-2017

815 views

Category:

Technology

3 download

Embed Size (px)

TRANSCRIPT

  • Feng Qu, Sr MTS Bass Chorng, Principal Capacity Engineer

    DB Capacity Planning at eBay

    #CassandraSummit2015

  • Who Am I?

    #CassandraSummit2015 2

    Bass Chorng Principal Capacity Engineer @ eBay Specializes in database performance, availability & scalability in a large website. Established DB capacity team at eBay in 2003. Loves mountain biking.

  • #CassandraSummit2015

    eBay Site DB Traffic At A Glance NoSQL Total 52 B/Day

    Cassandra 15 B Mongo 15 B CouchBase 12 B PushVM 10B

    RDBMS Total 350 B

    MySQL 10 B Oracle 340 B

    Peak Traffic 8M/sec Site Total DB Calls 400B/Day across 2,000 NoSQL Nodes + 450 Oracle Nodes Hosting 800M Active items & 120M Active Users Y-o-Y Growth 30% ~ 35%

    15 15 12 10 10

    340

    Billion SQL Calls per Day

    Cassandra

    Mongo

    CouchBase

    PushVM

    MySQL

    Oracle

  • Capacity Planning - Simply Put Analyze Traffic

    o Data Analyze Utilization

    o Data Analyze The Relationship Of The Above Two

    o Same Data Forecast Growth

    o Simple Models, Then Impress Your Boss. Convert Resource Need into $

    o A Calculator, Then Impress Your CIOs

    BTW, You Also Need To Know

    Platform Domain Knowledge Server, DB Engine, IO Subsystem, Networks Relationship Between System Overhead & Utilization Seasonality & Workload Characteristics Bottlenecks Components, Systems, Platforms, Architecture, Site & Apps New Technologies

    #CassandraSummit2015 4

  • Domain Knowledge Stack

    #CassandraSummit2015 5

    APPS

    DB

    UNIX

    STORAGE

    CAPACITY

    CAPACITY

    aka Whom To Blame Stack

    Bottom of food chain =>

  • Data What To Collect?

    Apps, Database, Sessions, CPU, Memory, Connections, IOPS, IO Time, NIC, HBA, Array

    How To Collect?

    Time Resolution, Aggregation Level, Retention How To Use It?

    Average, Max, 95th percentile, Dashboard, Reporting, Trending

    #CassandraSummit2015 6

    0.0

    1.0

    2.0

    3.0

    4.0 5/

    1/20

    15

    5/2/

    2015

    5/

    3/20

    15

    5/4/

    2015

    5/

    5/20

    15

    5/6/

    2015

    5/

    7/20

    15

    5/8/

    2015

    5/

    10/2

    015

    5/11

    /201

    5 5/

    12/2

    015

    5/13

    /201

    5 5/

    14/2

    015

    5/15

    /201

    5 5/

    16/2

    015

    5/17

    /201

    5 5/

    19/2

    015

    5/20

    /201

    5 5/

    21/2

    015

    5/22

    /201

    5 5/

    23/2

    015

    5/24

    /201

    5 5/

    25/2

    015

    5/26

    /201

    5 5/

    27/2

    015

    0 5000000

    10000000 15000000 20000000 25000000 30000000 35000000 40000000

    1/26

    /201

    5 1/

    28/2

    015

    1/30

    /201

    5 2/

    1/20

    15

    2/3/

    2015

    2/

    5/20

    15

    2/7/

    2015

    2/

    9/20

    15

    2/11

    /201

    5 2/

    13/2

    015

    2/15

    /201

    5 2/

    17/2

    015

    2/19

    /201

    5 2/

    21/2

    015

    2/23

    /201

    5 2/

    25/2

    015

    2/27

    /201

    5 3/

    1/20

    15

  • Forecast Model Traffic, Not Resources Need One Year Trend Forecast At Daily Level Eliminate Outliers No Data Is Better Than Wrong Data Convert Traffic To Resource Usage Linear Extrapolation Only (CPU Utilization, not IO Time) Simple Excel Formula Works Well For Long Term Resource Planning Only Use Average, Not Max Not All Workloads Are Predictable

    #CassandraSummit2015 7

    0

    10

    20

    30

    40

    50

    60

    70

    01/01/2012 01/01/2013 01/01/2014 01/01/2015

    Billion Calls

    CATY Traffic Forecast

    Forecast Actual Capacity

  • Things To Watch For Myths

    More CPU Makes Apps Run Faster More Data Makes Apps Run Slower Apps Run Twice As Fast On CPU Twice The Speed High Session = High Load

    Pitfalls

    Cause VS. Symptom Time Resolution Masks Issues Look At The Whole Picture Slow Down In Order To Go Faster < Throttle > Challenges Data Quality Data Missing, Data Source Changes, F/O Data Residency, Data Errors Varieties of Data Formats & Resolutions Data Collection In Secured Zones #CassandraSummit2015

    8

  • Me: Everything NoSQL

    CassandraSummit2015 | #CassandraSummit

    Prior to 2011: Worked on Oracle at DoubleClick/Yahoo/Intuit

    Worked on NoSQL at eBay Database Infrastructure team: Cassandra since 2011 MongoDB since 2012 Couchbase since 2014

    Cassandra Summit speaker for 2013, 2014, 2015

    DataStax Cassandra MVP for 2014, 2015

  • For Cassandra Capacity Measurements Throughput Latency E.g. 30,000 reads/sec with SLA of P99 at 5ms

    Hardware SKU Example CPU: 20 cores Memory: 128GB RAM Storage: 1.5TB local SSD Network: 10g NIC

    CassandraSummit2015 | #CassandraSummit

  • Benchmarking Benchmarking for different hardware High I/O SKU High memory SKU High storage SKU Bare metal or cloud

    Benchmarking for different software releases Benchmarking for different workloads 100% Writes 50% Writes, 50% Reads 5% Writes, 95% Reads 100% Reads

    Benchmarking Tools YCSB Cassandra-stress

    Proactive and repeated process using near real-time traffic in prod like environment

    CassandraSummit2015 | #CassandraSummit

  • Capacity Planning

    Key to avoid surprise in production The concept behind capacity planning is simple, but the mechanics are harder. Business requirements may increase, need to forecast how much resource must be

    added to the system to ensure that user experience continues uninterrupted Input: clearly defined capacity goal coming from business requirement and performance baseline

    from benchmark test Output: Identify resources to be added, such as memory, CPU, storage, I/O, network

    Always prepare for peak + headroom

    CassandraSummit2015 | #CassandraSummit

  • Capacity Planning Process

    Initial Sizing Storage size vs. data size Compaction overhead, compression ratio, RF, indexes

    Cost-effective configuration to meet capacpity/latency SLA Routine Review System utilization on I/O, storage, network, CPU, memory etc Cassandra metrics on GC, compaction, latency, throughput etc Compactionstats, cfhistoralgrams, tpstats etc

    Forecasting Historical comparison Traffic projection

    Flex up or Flex down

    CassandraSummit2015 | #CassandraSummit

  • Scale Up vs. Scale Out Scale Up(vertical) Pros Smaller data center footprint, such as space, power, cooling Less license cost

    Cons Likely cost more using proprietary hardware Less fault tolerant Limited upgradability in future

    Scale Out(horizontal) Pros Cheaper using commodity hardware More fault tolerant (unlimited) upgradability

    Cons Bigger data center footprint More license cost Likely need more network equipment

    CassandraSummit2015 | #CassandraSummit

  • Questions ?

    CassandraSummit2015 | #CassandraSummit

    eBay is hiring experienced NoSQL professionals, please send resume to fengqu@ebay.com