mongodb best practices
TRANSCRIPT
About Me• Solution Architect
• Part of Sales Organization
• Work with many organizations new to MongoDB
Everyone Loves MongoDB’s Flexibility• Document Model
• Dynamic Schema
• Powerful Query Language
• Secondary Indexes
Everyone Loves MongoDB’s Flexibility• Document Model
• Dynamic Schema
• Powerful Query Language
• Secondary Indexes
Sometimes Organizations Struggle with Performance
Good News!• Poor Performance Usually Due to Common (and often simple) mistakes
Agenda• Quick MongoDB Introduction
• Best Practices
1. Hardware/OS
2. Schema/Queries
3. Loading Data
MongoDB Introduction
Document Data ModelRelational MongoDB
{ first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}
Documents are Rich Data Structures{ first_name: ‘Paul’, surname: ‘Miller’, cell: 447557505611, city: ‘London’, location: [45.123,47.232], Profession: [‘banking’, ‘finance’, ‘trader’], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}
Fields can contain an array of sub-documents
Fields
Typed fields
Fields can contain arrays
String
Number
Geo-Coordinates
Do More With Your Data
{ first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}
Rich QueriesFind everybody in London with a car built between 1970 and 1980
Geospatial Find all of the car owners within 5km of Trafalgar Sq.
Text Search Find all the cars described as having leather seats
Aggregation Calculate the average value of Paul’s car collection
Map Reduce
What is the ownership pattern of colors by geography over time?(is purple trending up in China?)
Automatic Sharding
Three types: hash-based, range-based, location-aware
Increase or decrease capacity as you go
Automatic balancing
Query Routing
Multiple query optimization models
Each sharding option appropriate for different apps
mongos
Replica SetsReplica Set – 2 to 50 copies
Self-healing shard
Data Center Aware
Addresses availability considerations:
High Availability
Disaster Recovery
Maintenance
Workload Isolation: operational & analytics
Assumptions
Assumptions
MongoDB 3.0 or 3.2
Storage Engine Architecture in 3.2
Content Repo
IoT Sensor Backend Ad Service Customer
Analytics Archive
MongoDB Query Language (MQL) + Native Drivers
MongoDB Document Data Model
WT MMAP
Supported in MongoDB 3.2
Man
agem
ent
Sec
urity
In-memory (beta) Encrypted 3rd party
Best Practices
Hardware/Operating System
Servers• Specifications Good Fit For MongoDB?
• Correct Number of Servers?
• Properly Configured?
What Type of Servers• RAM
– 64 256 GB+
• Fast IO Systems– RAID-10/SSDs
• Many cores – Compress/Uncompress– Encrypt/Decrypt– Aggregation queries
What about a SAN?• Mostly Random Disk Access
• IOPS
• Need dedicated IOPS or performance will vary
• Configure your SAN properly
• Suitability of any IO system will depend upon IOPS
How Many Servers Do I Need?• How Many Shards Do I Need?
MongoDB cluster sizing at 30,000 ft• Disk Space
• RAM
• Query Throughput
• Sum of disk space across shards > greater than required storage size
Disk Space: How Many Shards Do I Need?
• Sum of disk space across shards > greater than required storage size
Disk Space: How Many Shards Do I Need?
Example
Data Size = 9 TBWiredTiger Compression Ratio: .33Storage size = 3 TBServer disk capacity = 2 TB
2 Shards Required
• Working set should fit in RAM– Sum of RAM across shards > Working Set
• WorkSet = Indexes plus the set of documents accessed frequently
• WorkSet in RAM – Shorter latency– Higher Throughput
RAM: How Many Shards Do I Need?
• Measuring Index Size – db.coll.stats() – index size of collection
• Estimate frequently accessed documents– Ex: total size of documents accessed
per day
RAM: How Many Shards Do I Need?
• Measuring Index Size – db.coll.stats() – index size of collection
• Estimate frequently accessed documents– Ex: total size of documents accessed
per day
RAM: How Many Shards Do I Need?
Example
Working Set = 428 GBServer RAM = 128 GB
428/128 = 3.34
4 Shards Required
• Measure max sustained query rate of a single server (with replication)– build a prototype and measure
• Assume sharding overhead of 20-30%
Query Rate: How Many Shards Do I Need?
• Measure max sustained query rate of a single server (with replication)– build a prototype and measure
• Assume sharding overhead of 20-30%
Query Rate: How Many Shards Do I Need?
Example
Require: 50K ops/secPrototype performance: 20 ops/sec (1 replica set)
4 Shards Required: 80 ops/sec * .7 = 56K ops/sec
Configure Them Properly• Default OS Settings Often Don’t Provide Optimal Performance
• See MongoDB Production Notes– https://docs.mongodb.org/manual/administration/production-notes
• Also Review:– Amazon EC2: https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/– Azure: https://docs.mongodb.org/ecosystem/platforms/windows-azure/
Server/OS Configuration• Server configuration recommendations
– XFS– Turn off atime and diratime – NOOP scheduler– File descriptor limits– Disable transparent huge pages and NUMA– Read ahead of 32– Separate data volumes for data files, the journal, and the log.– Change the default TCP keepalive time to 300 seconds.
These are important• Ignore them and your performance may suffer
• The first 100 lines of the MongoDB logs identifies suboptimal OS settings
Best Practices
Schema Design
Don’t Use a Relational Schema
Taylor MongoDB Schema to Application Workload
• Design schema to provide good query performance
• Schema design will impact required number of shards!
Application Query Workload
{ Name: “john” Height: 12 Address: {…}}
db.cust.find({…})
db.cust.aggregate({…})
Compare Alternative Schemas• Build a spreadsheet
• Calculate # of shards for each schema
• Estimate query performance– # of documents– # of inserts – # of deletes– Required indexes– Number of documents inspected– Number of documents sent across network
Modeling Decisions• Referencing vs. Embedding
• Aggregating data by device, customer, product, etc.
ReferencingProcedure
{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : 134}
Results
{ “_id” : 134 "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }
EmbeddingProcedure{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }}
Embedding• Advantages
– Retrieve all relevant information in a single query/document– Avoid implementing joins in application code– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations– Large documents mean more overhead if most fields are not relevant– Might mean replicating data– 16 MB document size limit
Referencing• Advantages
– Smaller documents– Less likely to reach 16 MB document limit– Infrequently accessed information not accessed on every query– No duplication of data
• Limitations– Two queries required to retrieve information– Cannot update related information atomically
{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,
…}, { id: 12346, date: 2015-02-15, type: “blood test”,
…}]}
Pat
ient
s
Embed
One-to-Many & Many-to-Many Relationships
{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [12345, 12346]}
{ _id: 12345, date: 2015-02-15, type: “Cat scan”, …} { _id: 12346, date: 2015-02-15, type: “blood test”, …}
Pat
ient
s
Reference
Pro
cedu
res
Schema Alternatives – Do the math?• How complex queries?
• How much hardware/shards will I need?
Vital Sign Monitoring DeviceVital Signs Measured:• Blood Pressure• Pulse• Blood Oxygen Levels
Produces data at regular intervals• Once per minute
We have a hospital(s) of devices
Data From Vital Signs Monitoring Device{ deviceId: 123456, spO2: 88, pulse: 74, bp: [128, 80], ts: ISODate("2013-10-16T22:07:00.000-0500")}
• One document per minute per device
• Relational approach
Document Per Hour (By minute){ deviceId: 123456, spO2: { 0: 88, 1: 90, …, 59: 92}, pulse: { 0: 74, 1: 76, …, 59: 72}, bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}, ts: ISODate("2013-10-16T22:00:00.000-0500")}
• Store per-minute data at the hourly level
• Update-driven workload
• 1 document per device per hour
Characterizing Write Differences• Example: data generated every minute• Recording the data for 1 patient for 1 hour:
Document Per Event60 inserts
Document Per Hour1 insert, 59 updates
Characterizing Read Differences• Want to graph 24 hour of vital signs for a patient:
• Read performance is greatly improved
Document Per Event 1440 reads
Document Per Hour24 reads
Characterizing Memory and Storage Differences
Document Per Minute Document Per HourNumber Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB_id index 1468 GB 24.5 GB{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 BytesDatabase Size 4503 GB 618 GB
• 100K Devices • 1 years worth of data
Characterizing Memory and Storage Differences
Document Per Minute Document Per HourNumber Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB_id index 1468 GB 24.5 GB{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 BytesDatabase Size 4503 GB 618 GB
• 100K Devices • 1 years worth of data
100000 * 365 * 24 *
60
100000 * 365 * 24
Characterizing Memory and Storage Differences
Document Per Minute Document Per HourNumber Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB_id index 1468 GB 24.5 GB{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 BytesDatabase Size 4503 GB 618 GB
• 100K Devices • 1 years worth of data
100000 * 365 * 24 * 60 * 130
100000 * 365 * 24 *
130
Characterizing Memory and Storage Differences
Document Per Minute Document Per HourNumber Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB_id index 1468 GB 24.5 GB{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 BytesDatabase Size 4503 GB 618 GB
• 100K Devices • 1 years worth of data
100000 * 365 * 24 *
60 * 92
100000 * 365 * 24 *
758
Best Practices
Loading Data
Rule of Thumb• To saturate a MongoDB cluster
– loader hardware ~= mongodb hardware
• Many threads
• Many mongos
Loader Architecture
loader
mongos
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
Loader Architecture
loader
mongos
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
Where are the bottlenecks?
Loader Architecture
loader
mongos
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondary
Where are the bottlenecks?
Loader Architecture
loader (8)
mongos (4)
primary
primary
primary
secondary
secondary
secondary
secondary
secondary
secondaryloader (8)
mongos (4)
loader (8)
mongos (4)Use many threads
Use multiple loader servers
When Sharding• If you care about initial performance, you must pre-split
• Otherwise, initial performance will be slow
• (hash sharding automatically presplits collection)
Without presplitting
Shard 1 Shard 2 Shard 3 Shard 4
-∞ … ∞
• sh.shardCollection(“records.patients”, {zipcode : 1})
Without presplitting
Shard 1 Shard 2 Shard 3 Shard 4
-∞ … 11305
• 64K chunks• Splitting will occur quickly• Balancing occurs much more slowly
• The entire query workload Shard 1
11306 … 4450644507 … ∞
Without presplitting
Shard 1 Shard 2 Shard 3 Shard 4
-∞ … 1130511306 … 4450644507 … ∞
Loadermongos
Split collection
Shard 1 Shard 2 Shard 3 Shard 4
• Split and distribute empty chunks before loading any data
• Evenly distribute query load across cluster
-∞ … 0833308334 … 1666716668 … 25000
25001… 3333433335 … 4166841669 … 50000
50001 … 5833458335 … 6666866669 … 75000
75001 … 8333488335 … 9666896669 … 99999
Split collection
Shard 1 Shard 2 Shard 3 Shard 4
-∞ … 0833308334 … 1666716668 … 25000
25001… 3333433335 … 4166841669 … 50000
50001 … 5833458335 … 6666866669 … 75000
75001 … 8333488335 … 9666896669 … 99999
Loadermongos
Summary
Best Practices1. Use servers with specifications that will provide good MongoDB performance
– 64+ GB RAM, many cores, many IOPS (RAID-10/SSDs)
2. Calculate How Many Shards?1. Calculate required RAM and Disk Space2. Build a prototype to determine the ops/sec capacity of a server3. Do the math
3. Configure OS for Optimal MongoDB Performance– See MongoDB Production Notes– Review logs for warnings (Don’t ignore)
Best Practices (cont.)4. Create a Document Schema
– Denormalized
5. Tailor schema to application workload– Use application queries to guide schema design decisions– Consider alternative schemas– Compare cluster size (# of shards) and performance– Build a spreadsheet
Best Practices6. Loading Data
– Loader Hardware ~= MongoDB hardware– Many threads– Many mongos
7. Pre-split– Ensure query workload is evenly distributed across the cluster from the start