realtime analytics with mongodb - mongodb meetup nyc

38
©Yottaa Confidential. Do Not Distribute. Yottaa Inc. 2 Canal Park 5 th Floor Cambridge MA 02141 http://www.yottaa.com Realtime Analytics with MongoDB & Rails Jared Rosoff @forjared [email protected]

Upload: jared-rosoff

Post on 14-May-2015

14.440 views

Category:

Technology


6 download

DESCRIPTION

How Yottaa used MongoDB & Ruby on Rails to build a scalable realtime analytics platform. This was my presentation for the NYC MongoDB Meetup on 11-16-2010.

TRANSCRIPT

Page 1: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Yottaa Inc. 2 Canal Park 5th FloorCambridge MA 02141http://www.yottaa.com

Realtime Analytics with MongoDB & Rails

Jared Rosoff@forjared

[email protected]

Page 2: Realtime Analytics with MongoDB - MongoDB Meetup NYC

2

Overview

• About Yottaa • Engineering challenges• Approaches we considered• How we did it • How it works

Page 3: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

Who’s driving your website?

3

Is your website slow?http://stop-the-damage.com/2010/08/276/

Page 4: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

We can help you make it faster

4

OMG!! 15 seconds?

WTF?

Page 5: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

Knowing is half the battle

5

San Francisco

Washington DC

London

RFC2616

Page 6: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

Data data everywhere

• We collect lots of data– 14,000+ URLs being tracked– Up to 300 samples per URL per day– Some samples are >1mb (firebug)– Missing a sample isn’t a big deal

• We try to make everything real time– No batch jobs, everything displayed as it

happens– “Check Now” button runs tests on

demand6

Page 7: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.7

Demo!

Page 8: Realtime Analytics with MongoDB - MongoDB Meetup NYC

8

Engineering Challenges

• High write volume from day 1– Sample collection is like having millions of users on the

first day – After 60 days, we have > 150GB of data– Adding about 5gb / day today

• Small engineering team – 1 built data ware house & portal, 1 built monitoring

agents– Bigger team now, but this was how we started

• Must be Agile – We didn’t know exactly what features we’d need– Requirements change daily

• Limited operations budget– No full time operations staff– 100% in the cloud

Page 9: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

Rails default architecture

MySQL

Data Source Collection Server

User Reporting Server

“Just” a Rails App

Performance Bottleneck: Too much load

Page 10: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

Let’s add replication!

MySQLMasterMySQL

MasterMySQLSlave

MySQLMaster

Replication

Data Source Collection Server

User Reporting Server

Off the shelf!Scalable Reads!

Performance Bottleneck: Still can’t scale

writes

Page 11: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

What about sharding?

MySQLMasterMySQL

MasterMySQLMaster

Data Source Collection Server

User Reporting Server

Shar

ding

Shar

ding

Scalable Writes!

Development Bottleneck:

Need to write custom code

Page 12: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

Key Value stores to the rescue?

MySQLMasterMySQL

MasterCassandra

orVoldemort

Data Source Collection Server

User Reporting Server

Scalable Writes!

Development Bottleneck:

Reporting is limited / hard

Page 13: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

Can I Hadoop my way out of this?

MySQLMasterMySQL

MasterCassandra

orVoldemort

Data Source Collection Server

User Reporting Server

Hadoop

MySQLMasterMySQL

MasterMySQLSlave

MySQLMaster

Scalable Writes!

Flexible Reports!

“Just” a Rails App

Development Bottleneck:

Too many systems!

Page 14: Realtime Analytics with MongoDB - MongoDB Meetup NYC

©Yottaa Confidential. Do Not Distribute.

MongoDB!

MySQLMasterMySQL

MasterMongoDB

Data Source Collection Server

User Reporting Server

Scalable Writes!

“Just” a rails app

Flexible Reporting!

Page 15: Realtime Analytics with MongoDB - MongoDB Meetup NYC

MongoD

MongoD

MongoD

Data Source

App Server

CollectionN

ginx

Pass

enge

r

Mon

gos

ReportingUser

Sharding!

High ConcurrencyScale-Out

LoadBalancer

Easy as Rails!

Page 16: Realtime Analytics with MongoDB - MongoDB Meetup NYC

3 Steps to Real Time Analytics

16

1. Collect data 2. Store Data 3. Display Reports

Page 17: Realtime Analytics with MongoDB - MongoDB Meetup NYC

3 Steps to Real Time Analytics

17

1. Collect data 2. Store Data 3. Display Reports

Page 18: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Collecting Data

18

Data Source

Collection ServerData

Source

Data Source

Collection Server

Collection Server

Collection Server

Load Balancer

POST http://collector.com/samples

We use Amazon ELB

We use Amazon EC2

Page 19: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Collecting Data

19

- Sample data is passed in body of POST request - Rails makes it really easy to parse JSON, XML, YML (we use JSON)- We have a bunch of other stuff that happens when data arrives, but

all you really need to do is write the data

Page 20: Realtime Analytics with MongoDB - MongoDB Meetup NYC

A Sample Sample!

20

{ url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234}

Page 21: Realtime Analytics with MongoDB - MongoDB Meetup NYC

A more complicated example

21

Page 22: Realtime Analytics with MongoDB - MongoDB Meetup NYC

22

"{\"location\":\"aws-us-east\",\"timestamp\":\"08/05/2010 07:11:54\",\"http_archive\":{\"log\":{\"creator\":{\"name\":\"Firebug\",\"version\":\"1.4.3\"},\"version\":\"1.1\",\"pages\":[{\"title\":\"\\u4e2d\\u56fd\\u7f51\\u7edc\\u7535\\u89c6\\u53f0-CNTV\",\"id\":\"page_0\",\"startedDateTime\":\"2010-08-05T08:11:51.897 01:00\",\"pageTimings\":{\"onContentLoad\":1883,\"onLoad\":2828}}],\"entries\":[{\"timings\":{\"connect\":null,\"wait\":561,\"blocked\":null,\"receive\":19,\"send\":0,\"dns\":0},\"response\":{\"statusText\":\"OK\",\"headersSize\":-1,\"httpVersion\":\"HTTP/1.1\",\"bodySize\":2067,\"content\":{\"size\":4467,\"mimeType\":\"text/html\"},\"status\":200,\"redirectURL\":\"\"},\"cache\":{},\"pageref\":\"page_0\",\"time\":580,\"startedDateTime\":\"2010-08-05T08:11:51.897 01:00\",\"request\":{\"headersSize\":-1,\"method\":\"GET\",\"url\":\"http://www.cntv.cn/\",\"httpVersion\":\"HTTP/1.1\",\"bodySize\":-1}},{\"timings\":{\"connect\":null,\"wait\":188,\"blocked\":null,\"receive\":1,\"send\":0,\"dns\":0},\"response\":{\"statusText\":\"OK\",\"headersSize\":-1,\"httpVersion\":\"HTTP/1.1\",\"bodySize\":740,\"content\":{\"size\":740,\"mimeType\":\"image/jpeg\"},\"status\":200,\"redirectURL\":\"\"},\"cache\":{},\"pageref\":\"page_0\",\"time\":370,\"startedDateTime\":\"2010-08-05T08:11:52.481 01:00\",\"request\":{\"headersSize\":-1,\"method\":\"GET\",\"url\":\"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_bg.jpg\",\"httpVersion\":\"HTTP/1.1\",\"bodySize\":-1}},{\"timings\":{\"connect\":null,\"wait\":3,\"blocked\":null,\"receive\":1,\"send\":0,\"dns\":1280},\"response\":{\"statusText\":\"OK\",\"headersSize\":-1,\"httpVersion\":\"HTTP/1.1\",\"bodySize\":2933,\"content\":{\"size\":7377,\"mimeType\":\"application/x-javascript\"},\"status\":200,\"redirectURL\":\"\"},\"cache\":{},\"pageref\":\"page_0\",\"time\":1285,\"startedDateTime\":\"2010-08-05T08:11:52.483 01:00\",\"request\":{\"headersSize\":-1,\"method\":\"GET\",\"url\":\"http://www.cctv.com/Library/a2.js\",\"httpVersion\":\"HTTP/1.1\",\"bodySize\":-1}},{\"timings\":{\"connect\":null,\"wait\":171,\"blocked\":null,\"receive\":83,\"send\":0,\"dns\":363},\"response\":{\"statusText\":\"OK\",\"headersSize\":-1,\"httpVersion\":\"HTTP/1.1\",\"bodySize\":76508,\"content\":{\"size\":76508,\"mimeType\":\"image/png\"},\"status\":200,\"redirectURL\":\"\"},\"cache\":{},\"pageref\":\"page_0\",\"time\":716,\"startedDateTime\":\"2010-08-05T08:11:52.489 01:00\",\"request\":{\"headersSize\":-1,\"method\":\"GET\",\"url\":\"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_top.png\",\"httpVersion\":\"HTTP/1.1\",\"bodySize\":-1}},{\"timings\":{\"connect\":null,\"wait\":156,\"blocked\":null,\"receive\":1,\"send\":0,\"dns\":472},\"response\":{\"statusText\":\"OK\",\"headersSize\":-1,\"httpVersion\":\"HTTP/1.1\",\"bodySize\":5351,\"content\":{\"size\":5351,\"mimeType\":\"image/png\"},\"status\":200,\"redirectURL\":\"\"},\"cache\":{},\"pageref\":\"page_0\",\"time\":629,\"startedDateTime\":\"2010-08-05T08:11:52.490 01:00\",\"request\":{\"headersSize\":-1,\"method\":\"GET\",\"url\":\"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_link.png\",\"httpVersion\":\"HTTP/1.1\",\"bodySize\":-1}},{\"timings\":{\"connect\":null,\"wait\":147,\"blocked\":null,\"receive\":0,\"send\":0,\"dns\":470},\"response\":{\"statusText\":\"OK\",\"headersSize\":-1,\"httpVersion\":\"HTTP/1.1\",\"bodySize\":2068,\"content\":{\"size\":2068,\"mimeType\":\"image/png\"},\"status\":200,\"redirectURL\":\"\"},\"cache\":{},\"pageref\":\"page_0\",\"time\":617,\"startedDateTime\":\"2010-08-05T08:11:52.492 01:00\",\"request\":{\"headersSize\":-1,\"method\":\"GET\",\"url\":\"http://www.cntv.cn/nettv/homepage2010/globalhomepage_image/r_bottom.png\",\"httpVersion\":\"HTTP/1.1\",\"bodySize\":-1}},{\"timings\":{\"connect\":null,\"wait\":278,\"blocked\":null,\"receive\":1,\"send\":0,\"dns\":667},\"response\":{\"statusText\":\"OK\",\"headersSize\":-1,\"httpVersion\":\"HTTP/1.1\",\"bodySize\":43,\"content\":{\"size\":43,\"mimeType\":\"image/gif\"},\"status\":200,\"redirectURL\":\"\"},\"cache\":{},\"pageref\":\"page_0\",\"time\":947,\"startedDateTime\":\"2010-08-05T08:11:53.777 01:00\",\"request\":{\"headersSize\":-1,\"method\":\"GET\",\"url\":\"http://cntv.wrating.com/a.gif?a=12a411781af&t=&i=-8a7b8e17f.12a411781b0.0.1a46b8aed32bf8&b=http://www.cntv.cn/&c=860010-1101020100&s=1364x768x16&l=en-us&z=1&j=1&f=-&r=http://cntv.cn/&kw=&ut=30&n=&js=0,1.292&ck=1\",\"httpVersion\":\"HTTP/1.1\",\"bodySize\":-1}}],\"browser\":{\"name\":\"Firefox\",\"version\":\"3.5.8\"}}},\"url\":\"http://cntv.cn\"}

Page 23: Realtime Analytics with MongoDB - MongoDB Meetup NYC

3 Steps to Real Time Analytics

23

1. Collect data 2. Store Data 3. Display Reports

Page 24: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Thinking in rows

24

URL

Location Connect First Byte

Last Byte Timestamp{ url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 }

{ url: ‘www.google.com’, location: “NYC” connect: 23, first_byte: 123, last_byte: 245, timestamp: 2345 }

Page 25: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Thinking in rows

25

URL

Location Connect First Byte

Last Byte Timestamp

What was the average connect time for google on friday?

From SFO?From NYC?Between 1AM-2AM?

Page 26: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Thinking in rows

26

URL

Location Connect First Byte

Last Byte Timestamp

AVG

AVG

AVG

Day 1

Day 2

Day 3

Result

Up to 100’s of samples per

URL per day!!

30 days average query

range

An “average” chart had to hit

600 rows

Page 27: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Thinking in Documents

27

URL www.google.com

Day 9/20/2010

Last Byte

Sum 2312

Count 12

SFO

NYC

Sum 1200

Count 5

Sum 1112

Count 7

This document contains all data for www.google.com collected during 9/20/2010

This tells us the average value for this metric for this url / time period

Average value from SFO

Average value from NYC

Page 28: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Storing a sample

28

Create the document if it doesn’t already exist

Update the location specific value

Update the aggregate value

Which document we’re updating

Atomically update the document

db.metrics.dailies.update( { url: ‘www.google.com’,

day: new Date(2010,9,2)}, { ‘$inc’: { ‘connect.sum’:1234,

‘connect.count’:1, ‘connect.sfo.sum’:1234, ‘connect.sfo.count’:1 } }, true // upsert );

Page 29: Realtime Analytics with MongoDB - MongoDB Meetup NYC

An example document

29

{ "_id": ObjectId("4bb55c59c3666e02fc000001"), "url": ”http://www.google.com/", "date": "Mon Jun 07 2010 00:00:00 GMT", "connect":{ "sum": 999, # sum of all the locations "sum_of_squares": 99999, "count": 99, ”san_francisco":{ "sum": 555, # sum of this location "sum_of_squares": 55555, "count": 55, "values": [ [”Mon Jun 07 2010 20:00:00 GMT", 12], [”Mon Jun 07 2010 20:10:00 GMT", 13], ......... ] },

Page 30: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Putting it together

30

{ url: ‘www.google.com’, location: “SFO” connect: 23, first_byte: 123, last_byte: 245, timestamp: 1234 }

Atomically update the daily

data

1

Atomically update the

weekly data

2

Atomically update the

monthly data

3

Page 31: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Sharding our Data

31

Shard 1

Shard 2

Shard 3

Shard 4

Reporting Server

Collection Server

URL 1

URL 2

URL 3

URL 4

URL 5

URL 6

URL 7

URL 8

Shard by URL

Write load evenlydistributed

Most reads hit a single shard

Page 32: Realtime Analytics with MongoDB - MongoDB Meetup NYC

3 Steps to Real Time Analytics

32

1. Collect data 2. Store Data 3. Display Reports

Page 33: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Drawing connect time graph

33

We just want connect time data. But we can include as many metrics as we want

Data for google

The range of dates for the chart

Compound index to make this query fast

db.metrics.dailies.ensureIndex({url:1,day:-1})

db.metrics.dailies.find( { url: ‘www.google.com’,

day: { “$gte”: new Date(2010,9,1), “$lte”: new Date(2010,9,30)},

{ ‘connect’:true});

Page 34: Realtime Analytics with MongoDB - MongoDB Meetup NYC

More efficient charts

34

URL Day <data>

AVG

AVG

AVG

Day 1

Day 2

Day 3

Result

1 Document per URL per

Day

30 days == 30 documents

Average chart hits 30

documents.

20x fewer

Page 35: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Real Time UpdatesURL Most Recent Data

Single query to fetch all metric data for a URL

Fast enough that browser can poll

constantly for updated data without impacting

server

Page 36: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Evaluation

36

• High write volume– Currently handling 1000’s of db writes per second on a

single MongoDB server – Adding ~5GB per day

• Small Engineering Team – Core system built by 2 engineers in <1 month

• Agile – BDD using Rails

• Limited operations budget– Runs on a handful of EC2 instances– No major issues

Page 37: Realtime Analytics with MongoDB - MongoDB Meetup NYC

Final thoughts

37

• Love MongoDB. (It’s now my default when starting a new project)

• Using MongoMapper as ORM, but think there must a better way, more in tune with document model rather than a port of AR

• There’s magic in documents but it requires thinking about your data in new ways.

Page 38: Realtime Analytics with MongoDB - MongoDB Meetup NYC

38

Q & AThank you for viewing