mongodb san francisco 2013: storing ebay's media metadata on mongodb presented by yuri...
DESCRIPTION
This session will be a case study of eBay’s experience running MongoDB for project Zoom, in which eBay stores all media metadata for the site. This includes references to pictures of every item for sale on eBay. This cluster is eBay's first MongoDB installation on the platform and is a mission critical application. Yuri Finkelstein, an Enterprise Architect on the team, will provide a technical overview of the project and its underlying architecture.TRANSCRIPT
Storing eBay's Media Metadata on MongoDB
Yuri FinkelsteinLead Platform Services [email protected]
John FeibuschLead DBA [email protected]
May 2013
About eBay Platform Services
Platform Services is an org within a larger eBay Platform
org which is responsible for developing and operating
common services that are used by Web Application
running on eBay Platform
• Media Storage platform services: image blob and metadata
• Unified Monitoring platform: logs and metrics
• User Behavior Tracking
• Ad Content management and analytics
• Messaging and other middleware services
Platform Services and Media Metadata Service Requirements
Platform Services is a DevOps organization
• We develop, we test, we deploy, we operate, we monitor
• Whatever we are responsible for, we own and understand at the depth of
the entire stack
• Therefore, we require transparency of the components we build on
• Transparency at the level of source code visibility is ideal
Key Requirements
Key requirements of Media Metadata Service
• 99.999% availability
• Strictly defined invocation latency @95 %
• Simultaneous operation in multiple data centers with short replication
latency
• Reliable writes: synchronous writes to at least 2 nodes.
• Read-write workload with reads / write ~= 10/1
• Agility, fluid metadata content; constantly changing business
requirements
• Terabyte scale, billions of small entities to store and query
• Scalability at extreme: number of pictures on eBay is constantly growing
Enters MongoDB
We have been operating MongoDB in this project for over a year now
Sharded cluster in 2 data centers
Service nodes are built in Java and use Morphia and Mongo driver
MongoS runs on the service nodes
1st year we were maturing the cluster for writes only; this year we are taking reads
Reads are from the user-facing web applications with strong SLA requirements
For reads, client first sets SlaveOK=true and if required document is not found flips to SlaveOK=false to read from Primary
---- Shards -----
--- R
eplicas --->
P P P
H H H
--- D
C1--->
-- D
C2-->
S
S
S
S
S
S
S
Morphia
Service Layer
Mongo Driver
MongoS
Metadata Service Node
S – service instance; P – primary mongod; H – hidden member
Centralized MongoDB configuration store
Our MongoDB deployment package is based on
custom-build RPM and contains heavy customization
scripts
One of them is responsible for fetching configuration for
the node it’s running on from a remote configuration
repository at start-up time
Benefits:• Can change MongoDB configuration instantly on arbitrary
large number of nodes
• Can change local system settings affecting MongoDB: read-
ahead –settings on block devices and IO scheduler
• Can relocate replica set members across machines (subject
to data migration)
• Consistent inventory tracking, visibility into config settings
on any Mongo machine
CentralMongoDB
ConfigRepository
P P P
@ startup time
Upstart
Upstart is a replacement for init.d; developed for Ubuntu, also used in
RHEL 6
Can automatically start our monitoring agent whenever mongod starts.
Handles multiple mongod instances well
Example:
sudo start mongod interface=0
Future: Upstart can be controlled by Puppet.
Run multiple MongoD instances on the same machine
Starting to run multiple mongod processes on one node
Instead of using multiple ports we create multiple virtual interfaces on a single host and register them in DNS as if they were real IP addresses
MongoD supports bind_ip which makes it possible to bind to a specific virtual interface
Why virtual interfaces ? So that DB hosts can be moved with just a DNS change
Why do we want to run multiple MongoD on a single host? On large machines with lots of disk IO and storage capacity mongod can not
utilize all IO resources Running multiple shards on the same machine reduces data granularity and
reduces the scope of each write lock. This works well only when multiple MongoD on the same machine have similar
workload
Home grow MongoDB monitoring system
Home grown agent runs on each MongoDB host and collects very specific metrics that are not available in MMS:• Per block-device disk write
latency and disk IOPS• Details of per-collection
MongoDB metrics
Can overlay multiple graphs form RS members on the same chart
GLE latency – very important since we are doing • getLastError ({w:2})
Media Metadata Service: Data Model
2 main collections: Item and Image• Item references multiple Images
Item represents eBay Item:• _id in Item is external ID of the item in eBay site DB• These IDs are already sharded in balanced across N
logical DB hosts using ID ranges• We use MongoDB pre-split points for initial
mapping our N site DB shards to M MongoDB shards
• This ensures good balance between the shards;
Image represents a picture attached to an Item
• _id in Image is based on modified ObjectID of Mongo
• This ensures good distribution across any number of shards
Our choice of document IDs in both collections ensures good balance across Mongo shards
Problem #1: What should be the ID for the documents?
ObjectId is not a good shard key for sharded collection as timestamp occupies the first 4 bytes.
Problem: how should the app generate the ID when this is required?
Requirements:• Even distribution across shards both long term and short
term• Localization of the placement of the indexed _id values in the
B-Tree – minimize the chance of page fault on the index page and increase the chance of collation of the dirty pages in page cache to reduce the amount of random IO when flushing pages to diss
• Compactness in size is always good to preserve space
One possible solution: 6 byte ID in the following order• 1 byte – rotating sequence ID incremented by each writer on
every document• 1 byte – writer ID; assuming number of writers < 256• 4 byte – timestamp in seconds
Works with limitation that each writer can not insert more than 256 documents per second
TTimestamp MachineID SequenceNo
MongoDB ObjectId():
4 4 4
SequenceNo WriterID TTimestamp
1 1 4
Shard-Friendly ID:
Shard Friednly ID details
TimeSeq=0
Seq=16
6-byte ID value
Seq=255
ff …
0f…
00…
55…
aa…
N-th min N-th+1 min
20 contiguous ranges for each
sequence
Let’s say we have 20 writers and 3 shards
Number of contiguous intervals in each shard:256/3 * 20 = 1100
Worse case scenario: each contiguous range requires a separate IO. At 200 IOPS:~5 sec to flush itIn reality it’s much better because of 4 k pages
Rate of writes 256 docs/sec
Number of dirty locations over 1 minute: 256 * 60 * 20 = 307,000 So, if _id was md5 or some other random value generator with ~perfect distribution this would require 300 times more IOPS
Problem #2: md5 lookup problem
Md5 is a digest of the image content; used for de-dupe
Requirement: find image documents with a given md5 val
Option 1: secondary index on the image documents; does not work because:• Large DB, random reads cause disk IO• Image collections is sharded by image ID;
forced to query all shards
Option 2: Stand-alone replica set (cache)• Works since data is compact and fits in RAM;
no disk IO• How do we store md5->image IDs in Mongo?• Option 2.1: As an array
Does not work well since when refs are added documents will grow and relocate.
• Option 2.2: Single Binary Packed into an ID Works; lookup is based on prefix search and
covering index
{ _id:Binary(md5), ref: [ref1, ref2, ref3 …]}
{ _id:Binary(md5|ref) }
Query:Db.coll.find ( { _id: {$gt : Binary(md5|0x0000)} }, { _id : 1})
Problem #3: Item’s main picture size lookup
Image document has image dimensions: width and height
Item document references N pictures; one of them is main
Problem: lookup image dimensions of the item’s main picture for 50 item documents at once with SLA for latency < 20 msec
It’s a variation of Problem #2 except it’s worse because ItemID and image dimensions are in different documents and 50 lookups at once are required
Again we need a dedicated replica set
Option 1: prefix search with $or and $and
Option 2: just query by _id
Option 3: query by id but on another compound index: {_id:1, wh:1}
Winner is option #3! Hint: covering index
{_id:Binary(item|WxH) }
Query:Db.coll.find ({ $or: [{_id: {$gt : Binary(id1|0x0000), {$lt : Binary(id1|0xffff)}},{_id: {$gt : Binary(id2|0x0000), {$lt : Binary(id2|0xffff)}},…]})
{ _id:item, wh:WxH }
Query:Db.coll.find ( { _id : {$in : [item1, item2, .]})
{ _id:item, wh:WxH }
Query:Db.coll.find ( { _id : {$in : [item1, item2, .]}) .hint({_id:1, wh:1})
Problem #4: Periodic export to Hadoop
Problem: daily copy of the new or updated documents to Hadoop
Option 1: service does 2 writes: to mongo and to hadoop• Does not work since Hadoop is not an
online system
Option 2: secondary index on lastUpdated (date); then query on lastUpdated > T• Does not work well since updating indexed
lastUdated is costly; also consuming a large number of docs from a live cluster is disruptive to latency SLAs
Option 3: OpLog replication• Winner:
decouples export from site activity,
Makes lastUpdated index unccessary
P P P
Problem:
P P P
OpLog Listener
??
Problem #5: What’s the fastest way to perform a full scan?
Problem: you have a huge database/collection, with terabytes of data and billions of documents
You need to perform a form of batch processing on all the documents and you want the fastest pipe out of mongo
Option 1: Do it on a live node as it’s serving traffic• Does not work well when the node is busy• Also – data consistency may be an issue
Ok, need to take the node off-line
Option 2: execute a natural-order scan:• Natural order cursor• Works, but slow; lot’s synchronization between two
sides
Option 3: N cursors using range query on _id or any other indexed field• Slow in general case when order of indexed values
on B-Tree and order on disk do not match
Option 4: N natural-order cursors
One cursor:db.collection.find
({}, {$natural: 1})
N cursors:db.collection.find ({}, {$natural: 1}).skip (i*N).limit (N)
Summary
We are running MongoDB in a demanding environment where it’s
exposed to business sensitive online applications
It seems to be reliable – this is what matters
It has lots of features and gives the user lots of option to choose from
It’s the user’s depth of understanding of the product and desire to
have visibility into every aspect of its performance that will determine
when a particular use case will be a success or not
Questions?
Thank you!
Btw, if any of this sounds interesting, we have lots of similar challenges to work on. So, you know the drill: yfinkelstein at ebay dot com