agility and scalability with mongodb

MongoDB Scalability and Agility

[email protected]

2

• Now

• Secure

• All varieties

• Fast and interactive

• Scalable to “Big”

• Agile to develop and deploy operationally

• Cloud and edge

Data Challenge“I want my data...”

iStock licensed (pixelfit)

3

Scalability with MongoDB

Metric Meaning Examples

Operations per Second

Concurrent reads and writes per second

> 1 Million per second

Nodes per Cluster

Horizontal scale-out, distributed to multiple data centers worldwide, with high availability, using inexpensive cloud resources

> 1000 nodes

Records / Documents

Data objects in any number of schemas or structures

> 10 billion

Data Volume Total amount of data: documents X size

> 1 Petabyte = 10^15 = 1,000,000,000,000,000≈ 2^50

Key Differentiation

5

Operational Database Landscape

6

Document Data Model

Relational MongoDB

{ first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}

7

Documents are Rich Data Structures

{ first_name: ‘Paul’, surname: ‘Miller’, cell: ‘+447557505611’ city: ‘London’, location: [45.123,47.232], Profession: [banking, finance, trader], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}

Fields can contain an array of sub-documents

Fields

Typed field values

Fields can contain arrays

String

Number

Geo-

Coordinate

s

8

Document Model Benefits

• Agility and flexibility– Data model supports business change– Rapidly iterate to meet new requirements

• Intuitive, natural data representation– Eliminates ORM layer– Developers are more productive

• Reduces the need for joins, disk seeks– Programming is more simple– Performance delivered at scale

11

Big Data Tech Interest Comparison

j.mp/Ssvpev

http://j.mp/Ssvpev

http://j.mp/Ssvpev

http://j.mp/Ssvpev

12

Enterprise Adoption Comparison

bit.ly/1vAI7rF

http://bit.ly/1vAI7rF

Architecture for Availability & Scalability

14

Replica Sets

• Replica Set – two or more copies

• Availability solution– High Availability

– Disaster Recovery

– Maintenance

• Deployment Flexibility– Data locality to users

– Workload isolation: operational & analytics

• Self-healing shard

Primary

Driver

Application

Secondary

Secondary

Replication

16

Global Data Distribution

Real-time

Real-time Real-time

Real-time

Real-time

Real-time

Real-time

Primary

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

17

Automatic Sharding

• Sharding types

• Range

• Hash

• Tag-aware

• Elastic increase or decrease in capacity

• Automatic balancing

18

Query Routing

• Multiple query optimization models

• Each sharding option appropriate for different apps

Performance

20

Drag Strip: straight ahead, quarter-mile, stop

21

Road Race:stay fast, stay agile, continuous

Nürburgring, Germany

MongoDB at Scale

24

• Large data set

CarFax

25

Baseline MongoDB Comparison Initial Production

• Vehicle History Database

• 11 billion records (growing at 1 billion per year)

• 30-year-old VMS-based RDBMS

• Cumbersome

• Costly

• Performance: 4x faster than baseline, 10x key-value

• Scale out using inexpensive commodity servers

• Built-in redundancy

• Flexible dynamic schema data model

• Strong consistency

• Analytics/aggregation

• MongoDB is primary data store

• 50 servers• 10 shards• 5 node replica sets per

shard

In-depth NoSQL evaluation

26

• 13 billion+ documents– 1.5 billion documents added every year

• 1 vehicle history report is > 200 documents

• 12 Shards

• 9-node replica sets

• Replicas distributed across 3 data centers

CARFAX Sharding and Replication

27

CARFAX Replication

29

• 50M users.

• 6B check-ins to date (6M per day growth).

• 55M points of interest / venues.

• 1.7M merchants using the platform for marketing

• Operations Per Second: 300,000

• Documents: 5.5B (~16.5B with replication).*

Foursquare

30

• 11 MongoDB clusters– 8 are sharded

• Largest cluster for check-ins

• 15 shards (check ins)

• Shard key user_id

Foursquare clusters

31

Facebook / parse.com mobile apps

• Persistent database for 270,000 mobile applications

• 200 M end-user mobile devices

• 250% annual growth in client apps

• 500% growth in requests

• 1.5 M collections

• Key differentiators:

– Document data model

– High perf. & avail.

– Geospatial query and index

• Charity Majors operations: j.mp/X3jVRC

– Understand your database and your data, and build for them.

http://j.mp/X3jVRC

Scalability Exercises in the Cloud with Amazon Web Services

35

• 27x hs1.8xlarge instances

– 16x VCPU

– 24x 2TB SATA drives, RAID0

– 8x mongod microshards

• Modified Yahoo Cloud Serving Benchmark (YCSB)

– Long Integer IDs (>2B)

– Zipfian-distributed integer fields

– Aggregation queries

• Load direct to 216 shards, 10 days, $4K "objects" : 7,170,648,489, "avgObjSize" : 147,438.99952658816, "dataSize" : NumberLong("1,057,240,224,818,640") (commas added)

Petascale Database

CGroup Memory Segregation

for DB in `seq 0 3`; do sudo cgcreate \ -a mongodb:mongodb \ -t mongodb:mongodb \ -g memory:mongodb$D sudo echo 48G > \ /sys/fs/cgroup/memory/mongodb$D/memory.limit_in_bytes cgexec \ -g memory:mongodb$DB \ numactl –interleave=all \ mongod –-config ~/mongod$DB.confdone

37

• Ingest 250-byte stock quotes at 2M/s

• Concurrently run 5 QPS, subsecond/indexed response on timeStamp, accountId, instrumentId, systemKey

• 5x r3.4xlarge– 16x VCPU, 1x 320GB SSD, 122GB RAM, 16x mongod

– 2.1M insert/second direct to shards

• 16x c3.8xlarge– 32x VCPU, 2x 320GB SSD, 60GB RAM, 16x mongod, 4x mongos

– 2.1M insert/second via mongos

Megawrite Ingest

38

• 2 threads on c3.8xl

• 264 bsonsize object, _id index only

• coll.insert() 15,600 ins / sec

• coll.insert(List<DBObject>)listsize = 64: 118,000 ins / sec

• Bulk ops APIsize = 64: 120,000 ins / sec

Java API comparison

BulkWriteOperation bo = null; for(a = 0; a < this.items && stayAlive; a++) { if(bo == null) { bo = collection.initializeUnorderedBulkOperation(); } fillMap(this.m); BasicDBObject dbObject = new BasicDBObject(this.m); bo.insert(dbObject); if(0 == a % listsize) { BulkWriteResult rc = bo.execute(); bo = null; }}

7x Load with BulkOp

How do I Pick A Shard Key?

41

Shard Key characteristics

• A good shard key has:– sufficient cardinality

– distributed writes

– targeted reads ("query isolation")

• Shard key should be in every query if possible– scatter gather otherwise

• Choosing a good shard key is important!– affects performance and scalability

– changing it later is expensive

42

Hashed shard key

• Pros:– Evenly distributed writes

• Cons:– Random data (and index) updates can be IO intensive

– Range-based queries turn into scatter gather

Shard 1

mongos

Shard 2 Shard 3 Shard N

43

Low cardinality shard key

• Induces "jumbo chunks"

• Examples: boolean field

Shard 1

mongos


[ a, b )

44

Ascending shard key

• Monotonically increasing shard key values cause "hot spots" on inserts

• Examples: timestamps, _id

Shard 1

mongos


[ ISODate(…), $maxKey )

Ensuring Success with High Scalability

46

Success Factors

• Storage: random seeks (IOPS)

• RAM: working set based on query patterns

• Query: indexing

• Delete: most expensive operation

• Real-time vs. bulk operations

• Continuity: HA, DR, backup, restore

• Agile process: iterate by powers of 4

• Sharding: shard key and strategy

• Resources: don’t go it alone!

agility and scalability with mongodb

Technology