open analytics | cameron sim

44
Building a scalable analytics platform for personal financial planning May 23, 2013 - Open Analytics Cameron Sim - RoundArchIsobar (www.isobar.com ) Wednesday, May 22, 13

Upload: open-analytics

Post on 12-May-2015

3.010 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Open analytics | Cameron Sim

Building a scalable analytics platform for personal financial planning

May 23, 2013 - Open Analytics

Cameron Sim - RoundArchIsobar (www.isobar.com)

Wednesday, May 22, 13

Page 2: Open analytics | Cameron Sim

Agenda

About LearnVest

Architecture

Data Capture

Packaging

Data Warehousing

Metrics

Finishing up

Wednesday, May 22, 13

Page 3: Open analytics | Cameron Sim

LearnVest Inc.www.learnvest.com

CompanyFounded in 2008 by Alexa Von Tobel, CEO

50+ People and Growing rapidlyBased in NYC

PlatformsWeb & iPhone

Mission Statement“Aiming to make financial planning as accessible as having a gym membership”

Key ProductsAccount Aggregation and Management

(Bank, Credit, Loan, Investment, Mortgage)

Original and Syndicated Newsletter Content

Financial Planning(tiered product offering)

Stack

OperationalWordpress, Backbone.js, Node.jsJava Spring 3, Redis, Memcached,

MongoDB, ActiveMQ, Nginx, MySQL 5.x

AnalyticsMongoDB 2.2.0, Hadoop, Pig, Java 6, Spring 3

pyMongoDjango 1.4

Wednesday, May 22, 13

Page 4: Open analytics | Cameron Sim

LearnVest.comWeb

Wednesday, May 22, 13

Page 5: Open analytics | Cameron Sim

LearnVest.comIPhone

Wednesday, May 22, 13

Page 6: Open analytics | Cameron Sim

Conversion FunnelsWeb IOS Tele-Sale, scheduled call

Account Creation

Free Assessment

Paid Product

Wednesday, May 22, 13

Page 7: Open analytics | Cameron Sim

Component ArchitectureAnalyticsProduction

Wednesday, May 22, 13

Page 8: Open analytics | Cameron Sim

High Level Architecture} } } }Analytics

Services & Event Capture Aggregation & Indexed Search Tools & Dashboards

Production

Production Services

Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13

Page 9: Open analytics | Cameron Sim

High Level Architecture} } } }Analytics

Services & Event Capture Aggregation & Indexed Search Tools & Dashboards

Production

Production Services

Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13

Page 10: Open analytics | Cameron Sim

High Level Architecture} } } }Analytics

Services & Event Capture Aggregation & Indexed Search Tools & Dashboards

Production

Production Services

Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13

Page 11: Open analytics | Cameron Sim

High Level Architecture} } } }Analytics

Services & Event Capture Aggregation & Indexed Search Tools & Dashboards

Production

Production Services

Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13

Page 12: Open analytics | Cameron Sim

High Level Architecture} } } }Analytics

Services & Event Capture Aggregation & Indexed Search Tools & Dashboards

Production

Production Services

Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13

Page 13: Open analytics | Cameron Sim

Philosophy For Data CollectionCapture Everything• User-Driven events over web and mobile• System-level exceptions• Everything else

Temporary Data• Be ‘ok’ with approximate data• Operational Databases are the system of record

Aggregate events as they come in• Remove the overhead of basic metrics (counts, sums) on core events• Group by user unique id and increment counts per event, over time-dimensions

(day, week-ending, month, year)

Wednesday, May 22, 13

Page 14: Open analytics | Cameron Sim

Philosophy For Data CollectionLogical Separation

Events• Core use cases (forms, conversion paths) • UI Actions (button clicks, swipes, views, forms)• HttpRequest level analysis (user-agent, ios version upgrades etc)

User• Has a status/rating (Account Creation, Linked Bank Account, Paid Products)• Source and Conversion Path (how was the user acquired) • Quantified Actions (User completed x, y, z conversion actions when & how?)• Social Interactions (Facebook, Twitter)• Email Interactions (stats & emails for [email protected])

Wednesday, May 22, 13

Page 15: Open analytics | Cameron Sim

Data CaptureIOS

- (void) sendAnalyticEventType:(NSString*)eventType object:(NSString*)object name:(NSString*)name page:(NSString*)page source:(NSString*)source;{ NSMutableDictionary *eventData = [NSMutableDictionary dictionary];

if (eventType!=nil) [params setObject:eventType forKey:@"eventType"]; if (object!=nil) [eventData setObject:object forKey:@"object"]; if (name!=nil) [eventData setObject:name forKey:@"name"]; if (page!=nil) [eventData setObject:page forKey:@"page"]; if (source!=nil) [eventData setObject:source forKey:@"source"]; if (eventData!=nil) [params setObject:eventData forKey:@"eventData"]; [[LVNetworkEngine sharedManager] analytics_send:params];}

Wednesday, May 22, 13

Page 16: Open analytics | Cameron Sim

Data CaptureWEB (JavaScript)

function internalTrackPageView() { var cookie = { userContext: jQuery.cookie('UserContextCookie'), };

var trackEvent = { eventType: "pageView", eventData: { page: window.location.pathname + window.location.search } }; // AJAX jQuery.ajax({ url: "/api/track", type: "POST", dataType: "json", data: JSON.stringify(trackEvent), // Set Request Headers beforeSend: function (xhr, settings) { xhr.setRequestHeader('Accept', 'application/json'); xhr.setRequestHeader('User-Context', cookie.userContext); if(settings.type === 'PUT' || settings.type === 'POST') { xhr.setRequestHeader('Content-Type', 'application/json'); } } });}

Wednesday, May 22, 13

Page 17: Open analytics | Cameron Sim

Bus Event Packaging1.Spring 3 RESTful service layer, controller methods define the eventCode via @tracking

annotation

2.Custom Intercepter class extends HandlerInterceptorAdapter and implements postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher

3.EventPublisher publishes to common event bus queue with multiple subscribers, one of which packages the eventPayload Map<String, Object> object and forwards to Analytics Rest Service

Wednesday, May 22, 13

Page 18: Open analytics | Cameron Sim

Bus Event Packaging1) Spring RestController Methods

Interface

@RequestMapping(value = "/user/login", method = RequestMethod.POST, headers="Accept=application/json")public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request);

Concrete/Impl Class

@Override@Tracking("user.login")public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request){

//Implementation

return event;}

Wednesday, May 22, 13

Page 19: Open analytics | Cameron Sim

Bus Event Packaging2) Custom Intercepter class extends HandlerInterceptorAdapter

protected void handleTracking(String trackingCode, Map<String, Object> modelMap, HttpServletRequest request) {

Map<String, Object> responseModel = new HashMap<String, Object>();

// remove non-serializables & copy over data from modelMap try { this.eventPublisher.publish(trackingCode, responseModel, request); } catch (Exception e) { log.error("Error tracking event '" + trackingCode + "' : " + ExceptionUtils.getStackTrace(e)); }}

Wednesday, May 22, 13

Page 20: Open analytics | Cameron Sim

Bus Event Packaging2) Custom Intercepter class extends HandlerInterceptorAdapter

public void publish (String eventCode, Map<String,Object> eventData, HttpServletRequest request) {

Map<String,Object> payload = new HashMap<String,Object>(); String eventId=UUID.randomUUID().toString(); Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request); //Normalize message payload.put("eventType", eventData.get("eventType")); payload.put("eventData", eventData.get("eventType")); payload.put("version", eventData.get("eventType")); payload.put("eventId", eventId); payload.put("eventTime", new Date()); payload.put("request", requestMap); . . . //Send to the Analytics Service for MongoDB persistence}

public void sendPost(EventPayload payload){ HttpEntity request = new HttpEntity(payload.getEventPayload(), headers); Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class);}

Wednesday, May 22, 13

Page 21: Open analytics | Cameron Sim

Bus Event PackagingThe Serialized Json (User Action)

{“eventCode” : “user.login”,“eventType” : “login”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : { “” : “”, “” : “”, “” : “” },“request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, }

}

Wednesday, May 22, 13

Page 22: Open analytics | Cameron Sim

Bus Event PackagingThe Serialized Json (Generic Event)

{“eventCode” : “generic.ui”,“eventType” : “pageView”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” },“request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, }

}

Wednesday, May 22, 13

Page 23: Open analytics | Cameron Sim

Bus Event PackagingThe Serialized Json (Generic Event)

{“eventCode” : “generic.ui”,“eventType” : “pageView”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” },“request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, }

}

Wednesday, May 22, 13

Page 24: Open analytics | Cameron Sim

Event Data WarehousingMongoDB Information• v2.2.0• 3-node replica-set• 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines• Each with single 500GB EBS volumes mounted to /opt/data

MongoDB Config Filedbpath = /opt/data/mongodb/datarest = truereplSet = voyager

Volumes~IM events daily on web, ~600K on mobile2-3 GB per day at start, slowed to ~1GB per dayCurrently at 78GB (collecting since August 2012)

Future Scaling Strategy• Setup 2nd Replica-Set in a new AWS region• Not intending to shard - data is archived 12 months in lieu

Wednesday, May 22, 13

Page 25: Open analytics | Cameron Sim

Event Data WarehousingApproach

1. Persist all events, bucketed by source:- WEB MOBILE

2. Persist all events, bucketed by source, event code and time:- WEB/MOBILE user.login time (day, week-ending, month, year)

3. Insert into collection e_web / e_mobile

4. Also insert into Daily, weekly and monthly collections for main payload and http request payload

• e_web_05232013• e_web_request_05232013

4. Predictable model for scaling and measuring business growth

Wednesday, May 22, 13

Page 26: Open analytics | Cameron Sim

Event Data WarehousingPersist all events

> db.e_web.findOne(){ "_id" : ObjectId("50e4a1ab0364f55ed07c2662"), "created_datetime" : ISODate("2013-01-02T21:07:55.656Z"), "created_date" : ISODate("2013-01-02T00:00:00.000Z"),"request" : { "content-type" : "application/json", "connection" : "keep-alive", "accept-language" : "en-US,en;q=0.8", "host" : "localhost:8080", "call-source" : "WEB", "accept" : "*/*", "user-context" : "c4ca4238a0b923820dcc509a6f75849b", "origin" : "chrome-extension://fdmmgilgnpjigdojojpjoooidkmcomcm", "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11", "accept-charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "cookie" : "size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" : "255", "accept-encoding" : "gzip,deflate,sdch" }, "eventType" : "flick", "eventData" : { "object" : "button", "name" : "split transaction button", "page" : "#inbox/79876/", "section" :

Wednesday, May 22, 13

Page 27: Open analytics | Cameron Sim

Event Data WarehousingAccess Pattern

• No reads off primary node, insert only

• Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large Instance and 3.75GB on Medium instances

• Split datetime in two fields and compound index on date with other fields like eventType and user unique id (user-context)

Wednesday, May 22, 13

Page 28: Open analytics | Cameron Sim

Event Data WarehousingIndexing Strategy

> db.e_web.getIndexes()[ { "v" : 1, "key" : { "request.user-context" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "request.user-context_1_created_date_1" }, { "v" : 1, "key" : { "eventData.name" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "eventData.name_1_created_date_1" }]

Wednesday, May 22, 13

Page 29: Open analytics | Cameron Sim

User Data WarehousingElastic Search (http://www.elasticsearch.org/)

• Open-source lucene cluster• Mature query language, accessed via RestAPI• Unstructured schema and feature rich• Strong API support

Configuration

• Single instance for user• Deployed over 3 EC2 Medium AML instances• Updated by a Java process checking a redis cache for uuids• Accessed by multiple applications for canonical user objects

Wednesday, May 22, 13

Page 30: Open analytics | Cameron Sim

User Data WarehousingBuilding the User Object

For each userid in the redis cache, retrieve the following infomration:- • ODS Slave (Learnvest data)• Jotform.com (eform submissions)• FullSlate.com (calendar appointments)• Stripe.com (payments) • Desk.com (emails)

Build a canonical JSON Object and save in the elasticsearch cluster

Map<String, String> user = new HashMap<String,String>(); source.put(...);

client.execute(new Index.Builder(source).index(“Users”);

Wednesday, May 22, 13

Page 31: Open analytics | Cameron Sim

MetricsObjective• Show historic and intraday stats on core use cases (logins, conversions)• Show user funnel rates on conversion pages• Show general usability - how do users really use the Web and IOS platforms?

Non-Functionals• Intraday doesn’t need to be “real-time”, polling is good enough for now• Overnight batch job for historic must scale horizontally

General Implementation Strategy• Do all heavy lifting & object manipulation, UI should just display graph or table• Modularize the service to be able to regenerate any graphs/tables without a full load

Wednesday, May 22, 13

Page 32: Open analytics | Cameron Sim

MetricsJava Batch Service

Java Mongo library to query key collections and return user counts and sum of events

DBCursor webUserLogins = c.find( new BasicDBObject("date", sdf.format(new Date())));

private HashMap<String, Object> getSumAndCount(DBCursor cursor){ HashMap<String, Object> m = new HashMap<String, Object>(); int sum=0; int count=0; DBObject obj; while(cursor.hasNext()){ obj=(DBObject)cursor.next(); count++; sum=sum+(Integer)obj.get("count"); } m.put("sum", sum); m.put("count", count); m.put("average", sdf.format(new Float(sum)/count)); return m;}

Wednesday, May 22, 13

Page 33: Open analytics | Cameron Sim

MetricsJava Batch Service

Use Aggregation Framework where required on core collections (e_web) and external data

//create aggregation objectsDBObject project = new BasicDBObject("$project", new BasicDBObject("day_value", fields) );DBObject day_value = new BasicDBObject( "day_value", "$day_value");DBObject groupFields = new BasicDBObject( "_id", day_value);

//create the fields to group by, in this case “number”groupFields.put("number", new BasicDBObject( "$sum", 1));

//create the group DBObject group = new BasicDBObject("$group", groupFields);

//executeAggregationOutput output = mycollection.aggregate( project, group ); for(DBObject obj : output.results()){ ..}

Wednesday, May 22, 13

Page 34: Open analytics | Cameron Sim

MetricsJava Batch Service

MongoDB Command Line example on aggregation over a time period, e.g. month

> db.e_web.aggregate( [ { $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}}, { $project : { day_value : {"day" : { $dayOfMonth : "$created_date" }, "month":{ $month : "$created_date" }} }}, { $group : { _id : {day_value:"$day_value"} , number : { $sum : 1 } } }, { $sort : { day_value : -1 } } ])

Wednesday, May 22, 13

Page 35: Open analytics | Cameron Sim

MetricsJava Batch Service

Persisting events into graph and table collections

>db.homeGraphs.find()

{ "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54, "accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" : "12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0", "users_avg_linked" : "3.43", "users_linked" : 7 }

{ "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144, "accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" : "11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0", "users_avg_linked" : "4", "users_linked" : 16 }

{ "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119, "accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" :

Wednesday, May 22, 13

Page 36: Open analytics | Cameron Sim

MetricsDjango and HighCharts

Extract data (pyMongo)

def getHomeChart(dt_from, dt_to): """Called by home method to get latest 30 day numbers""" try: conn = pymongo.Connection('localhost', 27017) db = conn['lvanalytics']

cursor = db.accountmetrics.find( {"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date") return buildMetricsDict(cursor)

except Exception as e: logger.error(e.message)

Return the graph object (as a list or a dict of lists) to the view that called the method

pagedata={}pagedata['accountsGraph']=mongodb_home.getHomeChart()

return render_to_response('home.html',{'pagedata': pagedata}, context_instance=RequestContext(request))

Wednesday, May 22, 13

Page 37: Open analytics | Cameron Sim

MetricsDjango and HighCharts

Populate the series.. (JavaScript with Django templating)

seriesOptions[0] = { id: 'naturalAccounts', name: "Natural Accounts", data: [ {% for a in pagedata.metrics.accounts_natural %} {% if not forloop.first %}, {% endif %} [Date.UTC({{a.0}}),{{a.1}}] {% endfor %} ], tooltip: { valueDecimals: 2 } };

Wednesday, May 22, 13

Page 38: Open analytics | Cameron Sim

MetricsDjango and HighCharts

And Create the Charts and Tables...

Wednesday, May 22, 13

Page 39: Open analytics | Cameron Sim

MetricsDjango and HighCharts

And Create the Charts and Tables...

Wednesday, May 22, 13

Page 40: Open analytics | Cameron Sim

Data Science ToolsIPython Notebook• Deployed on an EC2 Large AML Medium Instance• Configured for Python 2.7.3• Loaded with MatPlotLib, PyLab, SciPy, Numpi, pyMongo, MySQL-python

Insights• Write wrapper methods to access user data• Accessible to anyone through a browser• Very effective way to scale quickly with little overhead

Applications• Decision tree analysis over website and ios - showed common paths• Session level analysis on IOS devices• Multi-page form conversion retention rates• Quicly coduct segment analysis via a programming aPI

Wednesday, May 22, 13

Page 41: Open analytics | Cameron Sim

Data Science ToolsPIG• Executed using ruby scripts• Pulled data from MongoDB• Forwarded to AWS EMR cluster for analysis• MR functions written in Python and occasionally Java

Insights• Used for ad-hoc analysis involving large datasets

Applications• Daily, Weekly, Monthly conversion metrics on page views and forms• Identified trends in spending over 1M rows• Used lightly at Learnvest, growing in capability

Wednesday, May 22, 13

Page 42: Open analytics | Cameron Sim

Things that didn’t workMongoDB UpsertsQuickly becomes read-heavy and slows down the db

MongoDB Aggregation FrameworkFine for adhoc analysis but you might be better off with establishing a repeatable framework to run MR algos

Django-noRelUnstable, use Django and configure MongoDB as a datastore only

Wednesday, May 22, 13

Page 43: Open analytics | Cameron Sim

Lessons Learned• Date Time managed as two fields, Datetime and Date

• Real-time Map-Reduce in pyMongo - too slow, don’t do this.

• Memcached on Django is good enough (at the moment) - use django-celery with rabbitmq to pre-cache all data after data loading

• HighCharts is buggy - considering D3 & other libraries

• Don’t need to retrieve data directly from MongoDB to Django, perhaps provide all data via a service layer (at the expense of ever-additional features in pyMongo)

• Make better use of EMR upfront if resources are limited and data is vast.

Wednesday, May 22, 13

Page 44: Open analytics | Cameron Sim

Thanks!...Questions?

Wednesday, May 22, 13