agility and scalability with mongodb
DESCRIPTION
MongoDB has taken a clear lead in adoption among the new generation of databases, including the enormous variety of NoSQL offerings. A key reason for this lead has been a unique combination of agility and scalability. Agility provides business units with a quick start and flexibility to maintain development velocity, despite changing data and requirements. Scalability maintains that flexibility while providing fast, interactive performance as data volume and usage increase. We'll address the key organizational, operational, and engineering considerations to ensure that agility and scalability stay aligned at increasing scale, from small development instances to web-scale applications. We will also survey some key examples of highly-scaled customer applications of MongoDB.TRANSCRIPT
MongoDB Scalability and Agility
2
• Now
• Secure
• All varieties
• Fast and interactive
• Scalable to “Big”
• Agile to develop and deploy operationally
• Cloud and edge
Data Challenge“I want my data...”
iStock licensed (pixelfit)
3
Scalability with MongoDB
Metric Meaning Examples
Operations per Second
Concurrent reads and writes per second
> 1 Million per second
Nodes per Cluster
Horizontal scale-out, distributed to multiple data centers worldwide, with high availability, using inexpensive cloud resources
> 1000 nodes
Records / Documents
Data objects in any number of schemas or structures
> 10 billion
Data Volume Total amount of data: documents X size
> 1 Petabyte = 10^15 = 1,000,000,000,000,000≈ 2^50
Key Differentiation
5
Operational Database Landscape
6
Document Data Model
Relational MongoDB
{ first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}
7
Documents are Rich Data Structures
{ first_name: ‘Paul’, surname: ‘Miller’, cell: ‘+447557505611’ city: ‘London’, location: [45.123,47.232], Profession: [banking, finance, trader], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}
Fields can contain an array of sub-documents
Fields
Typed field values
Fields can contain arrays
String
Number
Geo-
Coordinate
s
8
Document Model Benefits
• Agility and flexibility– Data model supports business change– Rapidly iterate to meet new requirements
• Intuitive, natural data representation– Eliminates ORM layer– Developers are more productive
• Reduces the need for joins, disk seeks– Programming is more simple– Performance delivered at scale
11
Big Data Tech Interest Comparison
j.mp/Ssvpev
Architecture for Availability & Scalability
14
Replica Sets
• Replica Set – two or more copies
• Availability solution– High Availability
– Disaster Recovery
– Maintenance
• Deployment Flexibility– Data locality to users
– Workload isolation: operational & analytics
• Self-healing shard
Primary
Driver
Application
Secondary
Secondary
Replication
16
Global Data Distribution
Real-time
Real-time Real-time
Real-time
Real-time
Real-time
Real-time
Primary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
17
Automatic Sharding
• Sharding types
• Range
• Hash
• Tag-aware
• Elastic increase or decrease in capacity
• Automatic balancing
18
Query Routing
• Multiple query optimization models
• Each sharding option appropriate for different apps
Performance
20
Drag Strip: straight ahead, quarter-mile, stop
21
Road Race:stay fast, stay agile, continuous
Nürburgring, Germany
MongoDB at Scale
24
• Large data set
CarFax
25
Baseline MongoDB Comparison Initial Production
• Vehicle History Database
• 11 billion records (growing at 1 billion per year)
• 30-year-old VMS-based RDBMS
• Cumbersome
• Costly
• Performance: 4x faster than baseline, 10x key-value
• Scale out using inexpensive commodity servers
• Built-in redundancy
• Flexible dynamic schema data model
• Strong consistency
• Analytics/aggregation
• MongoDB is primary data store
• 50 servers• 10 shards• 5 node replica sets per
shard
In-depth NoSQL evaluation
26
• 13 billion+ documents– 1.5 billion documents added every year
• 1 vehicle history report is > 200 documents
• 12 Shards
• 9-node replica sets
• Replicas distributed across 3 data centers
CARFAX Sharding and Replication
27
CARFAX Replication
28
29
• 50M users.
• 6B check-ins to date (6M per day growth).
• 55M points of interest / venues.
• 1.7M merchants using the platform for marketing
• Operations Per Second: 300,000
• Documents: 5.5B (~16.5B with replication).*
Foursquare
30
• 11 MongoDB clusters– 8 are sharded
• Largest cluster for check-ins
• 15 shards (check ins)
• Shard key user_id
Foursquare clusters
31
Facebook / parse.com mobile apps
• Persistent database for 270,000 mobile applications
• 200 M end-user mobile devices
• 250% annual growth in client apps
• 500% growth in requests
• 1.5 M collections
• Key differentiators:
– Document data model
– High perf. & avail.
– Geospatial query and index
• Charity Majors operations: j.mp/X3jVRC
– Understand your database and your data, and build for them.
Scalability Exercises in the Cloud with Amazon Web Services
35
• 27x hs1.8xlarge instances
– 16x VCPU
– 24x 2TB SATA drives, RAID0
– 8x mongod microshards
• Modified Yahoo Cloud Serving Benchmark (YCSB)
– Long Integer IDs (>2B)
– Zipfian-distributed integer fields
– Aggregation queries
• Load direct to 216 shards, 10 days, $4K "objects" : 7,170,648,489, "avgObjSize" : 147,438.99952658816, "dataSize" : NumberLong("1,057,240,224,818,640") (commas added)
Petascale Database
CGroup Memory Segregation
for DB in `seq 0 3`; do sudo cgcreate \ -a mongodb:mongodb \ -t mongodb:mongodb \ -g memory:mongodb$D sudo echo 48G > \ /sys/fs/cgroup/memory/mongodb$D/memory.limit_in_bytes cgexec \ -g memory:mongodb$DB \ numactl –interleave=all \ mongod –-config ~/mongod$DB.confdone
37
• Ingest 250-byte stock quotes at 2M/s
• Concurrently run 5 QPS, subsecond/indexed response on timeStamp, accountId, instrumentId, systemKey
• 5x r3.4xlarge– 16x VCPU, 1x 320GB SSD, 122GB RAM, 16x mongod
– 2.1M insert/second direct to shards
• 16x c3.8xlarge– 32x VCPU, 2x 320GB SSD, 60GB RAM, 16x mongod, 4x mongos
– 2.1M insert/second via mongos
Megawrite Ingest
38
• 2 threads on c3.8xl
• 264 bsonsize object, _id index only
• coll.insert() 15,600 ins / sec
• coll.insert(List<DBObject>)listsize = 64: 118,000 ins / sec
• Bulk ops APIsize = 64: 120,000 ins / sec
Java API comparison
BulkWriteOperation bo = null; for(a = 0; a < this.items && stayAlive; a++) { if(bo == null) { bo = collection.initializeUnorderedBulkOperation(); } fillMap(this.m); BasicDBObject dbObject = new BasicDBObject(this.m); bo.insert(dbObject); if(0 == a % listsize) { BulkWriteResult rc = bo.execute(); bo = null; }}
7x Load with BulkOp
How do I Pick A Shard Key?
41
Shard Key characteristics
• A good shard key has:– sufficient cardinality
– distributed writes
– targeted reads ("query isolation")
• Shard key should be in every query if possible– scatter gather otherwise
• Choosing a good shard key is important!– affects performance and scalability
– changing it later is expensive
42
Hashed shard key
• Pros:– Evenly distributed writes
• Cons:– Random data (and index) updates can be IO intensive
– Range-based queries turn into scatter gather
Shard 1
mongos
Shard 2 Shard 3 Shard N
43
Low cardinality shard key
• Induces "jumbo chunks"
• Examples: boolean field
Shard 1
mongos
Shard 2 Shard 3 Shard N
[ a, b )
44
Ascending shard key
• Monotonically increasing shard key values cause "hot spots" on inserts
• Examples: timestamps, _id
Shard 1
mongos
Shard 2 Shard 3 Shard N
[ ISODate(…), $maxKey )
Ensuring Success with High Scalability
46
Success Factors
• Storage: random seeks (IOPS)
• RAM: working set based on query patterns
• Query: indexing
• Delete: most expensive operation
• Real-time vs. bulk operations
• Continuity: HA, DR, backup, restore
• Agile process: iterate by powers of 4
• Sharding: shard key and strategy
• Resources: don’t go it alone!