mongodb and big data - university of stirling · what is mongodb? modern document-model operational...
TRANSCRIPT
MongoDB and Big Data Presenter: John Page
2
DINGBATS
CUT CUT CUT
CUT
3
DINGBATS
JOB AN
4
DINGBATS
DATA
5
What is ‘Big Data’
“Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”
6
What is ‘Big Data’
“Big Data is problems are where the Volume, Velocity or Variety of data mean traditional data processing techniques are no longer successful.”
7
DINGBATS
8
DINGBATS
C++
PHP
Javascript
Python
Haskell
Algol68
Perl
Go Scheme
Erlang
COBOL
9
Column Family
Key/Value Store
Relational
Document Store
Options for building a Operational Database
Graph Database
10
The World Has Changed
Volume Velocity Variety
Iterative Agile
Short Cycles
Always On Scale Global
Open-Source Cloud
Commodity
Data Time
Risk Cost
11
What is MongoDB?
Modern Document-model operational database.
Designed to back today’s business applications.
• Developer and Operations oriented.
• Easy to scale horizontally.
• Business Critical is the norm.
• Lessons learned from 40 years of RDBMS.
12
Relational Model
ArrtVal AttrName AttrVal 100 1 50mm
200 2 130g
SKU Name Category Brand Stock Count
Price
11574646 Pentax 50mm Lens
500 1500 6531 179.99
SKUAttributeID ItemSKU Attrval 1 11574646 100 2 11574646 200
AttrNameID AttrName 1 Focal Length 2 Weight
CategoryID Department 500 Photography
BrandID Brand 1500 Pentax
13
Relational Model
SKU Name Category Brand Stock Count
Price
11574646 Pentax 50mm Lens
Photography Pentax 6531 179.99
SKUAttributeID SKU AttrVal 1 11574646 100 2 11574646 200
AttrValID AttrName AttrVal 100 FocalLength 50mm
200 Weight 130g
14
Document Model
SKU Name Category Brand Stock Count
Price Attributes
11574646 Pentax 50mm Lens
Photography Pentax 6531 179.99 Focal 50mm
Weight 130g
15
Document Model
SKU Name Category
Brand Stock Count
Price Attributes
1157464 Pentax 50mm Lens
Photography
Pentax 6531 179.99 Focal 50mm Weight 130g
{
_id : 1157464,
Name: ”Pentax 50mm Lens",
Category: ”Photography",
Brand: ”Pentax",
StockCount: 6531,
Price: 179.99,
Attributes: [
{name: ”FocalLen",
value: ”50mm" },
{name: ”Weight",
value: ”130g" }
]
}
16
MongoDB - Agility
SKU Name Category Brand Stock Count
Price Attributes
1157464 Pentax 50mm Lens
Photography
Pentax 6531 179.99
SKU Name Category Brand Stock Count
Price Restricted
11574646 Penknife Camping Victorinox 156 18.99 TRUE
SKU Name Category Price Attributes 228918 3 yr
Warranty Service 179.99
17
Shell Command-line shell for interacting directly with database
MongoDB - Usability
Drivers Drivers for most popular programming languages and frameworks
> db.collection.insert({product:“MongoDB”, type:“Document Database”}) > > db.collection.findOne() {
“_id” : ObjectId(“5106c1c2fc629bfe52792e86”), “product” : “MongoDB” “type” : “Document Database”
}
Java
Python
Perl
Ruby
Haskell
JavaScript
18
MongoDB - Utility • Complex Indexed Queries
• Aggregation.
Age > 65 AND Male living near “LEEDS”
Age Profit Margin 1-17 0 18-35 20
36-50 80 51-65 50 66+ 5
19
MongoDB - Scalability
• High Availability.
• Auto Sharding.
• Data Compression.
• Fine concurrency.
• Enterprise Management.
20
High Availability
• Automated replication and failover
• Multi-data center support
• Improved operational simplicity (e.g., HW swaps)
• Data durability and consistency
21
MongoDB - Scalability
• High Availability.
• Auto Sharding.
• Data Compression.
• Fine concurrency.
• Enterprise Management.
22
Working Set Exceeds Physical Memory
23
Scalability
Auto-Sharding
• Increase capacity as you go
• Commodity and cloud architectures
• Improved operational simplicity and cost visibility
24
Sharding and Replication
25
Routing and Balancing
Shard Shard Shard
Mongos
1
2
3
4Operations Run on Specific Shards. Or in Parallel on many.
26
Partitioning
• User defines shard key • Shard key defines range of data • Key space is like points on a line
• Range is a segment of that line
-∞ +∞Key Space
27
Initially 1 chunk Default max chunk size: 64mb MongoDB automatically splits & migrates chunks when max reached
Data Distribution
Node 1SecondaryConfigServer Shard 1
MongosMongos Mongos
Shard 2
Mongod
28
Aggregation
zrsquare = {'$multiply' : [ '$zr','$zr' ]}; zisquare = {'$subtract' : [ 0 , {'$multiply' : [ '$zi','$zi' ]}]}; zixzrx = { '$multiply' : [{'$multiply' : [ '$zr','$zi' ]},2]}; inciflow = { '$cond' : [ { '$lte' : [ '$zr' , 4 ]} , 1 , 0] }; itterate = { '$project' : { 'cr' : 1, 'ci' : 1 , 'zr' : { '$add' : [ zrsquare , zisquare, '$cr']}, 'zi' : { '$add' : [ zixzrx, '$ci' ] } , 'it' : { '$add' : [ "$it" , inciflow]} } };
29
Aggregation
30
Sharding Aggregation
shard3shard1$match$project$group1
shard2$match$project$group1
shard2$match$project$group1
result
31
When MongoDB should be used.
• When you have high speed access to complex objects • A complex object can be updated in a fast atomic operation. • A complex object can be retrieved in a single quick operation. • A complex object can be queried. • Search capabilities don’t need joins.
• When you want to store larger data structures. • Arrays of 10,000 values or objects • Text up to 16MB • Transparent not opaque BLOBS • Blobs can be stored in with data.
22 [ 2 , 3, 4,] { a: 5 bob : { a { e:3} 22 [ 2 , 3, 4,] { a: 5 bob : { a { e:3} 22 [ 2 , 3, 4,] { a: 5 bob : { a { e:3}
32
When MongoDB should be used.
• When you value rapid development and evolution. • Direct Object Models – lack of Mapping • Application defined Schemas • Rich feature sets and Search
• Where you need to store structures of any shape. • Direct Object Models • Application defined Schemas • Heterogeneous schemas.
33
When MongoDB should be used.
• When you have large data volumes. • When data volumes are growing • Where growth is potentially unlimited. • Where you don’t want to pay for future growth just now.
• When you want distributed data access or high uptime. • Worldwide sites want low access times. • Data must stay at point of origin legally. • Data mirroring should be as live time as possible.
34
9,000,000+ MongoDB Downloads
180,000+ Online Education Registrants
30,000+ MongoDB User Group Members
20,000+ MongoDB Days Attendees
35,000+ MongoDB Management Service (MMS) Users
MongoDB - Global Community
35
Global 360 degree view of customers’ policy portfolio and interactions
Problem Why MongoDB Results
• 70 systems and 20 screens to view customer policies
• Many CSR calls taken just to reroute customer
• Poor customer experience
• Source systems are hard to change
• Dynamic schema: can combine 70 systems easily
• Performance: can handle all data in one DB
• Replication: local reads and high availability
• Sharding: can add data easily by scaling out
• Delivered in 3 months with $3M – previous attempts failed with $25M
• Unified customer view available to all channels
• Shorter and less calls re-routed
• Increased customer satisfaction
Single View Case Study: Tier 1 Global Insurance Provider
36
Mainframe offloading / Mirroring for next generation of applications.
Problem Why MongoDB Results
• Mainframe costly to maintain and won’t handle additional load.
• No way to meet customer needs for mobile and similar apps.
• High degree of scalability.
• Ability to define data formats from the mainframe views as needed.
• Broad range of functional capability.
• New applications online with Mainframe data cached in MongoDB.
• Increase in customer satisfaction across Personal Banking.
Single View Case Study: Large Retail Bank
37
Stores billions of posts in myriad formats with MongoDB
Case Study
Problem Why MongoDB Results
• 1.5M posts per day, different structures
• Inflexible MySQL, lengthy delays for making changes
• Data piling up in production database
• Poor performance
• Flexible document-based model
• Horizontal scalability built in
• Easy to use
• Interface in familiar language
• Initial deployment held over 5B documents and 10TB of data
• Automated failover provides high availability
• Schema changes are quick and easy
38
Stores one of world’s largest record repositories and searchable catalogues in MongoDB
Case Study
Problem Why MongoDB Results
• One of world’s largest record repositories
• Move to SOA required new approach to data store
• RDBMS could not support centralized data mgt and federation of information services
• Fast, easy scalability
• Full query language
• Complex metadata storage
• Will scale to 100s of TB by 2013, PB by 2020
• Searchable catalogue of varied data types
• Decreased SW and support costs