open analytics | cameron sim
TRANSCRIPT
Building a scalable analytics platform for personal financial planning
May 23, 2013 - Open Analytics
Cameron Sim - RoundArchIsobar (www.isobar.com)
Wednesday, May 22, 13
Agenda
About LearnVest
Architecture
Data Capture
Packaging
Data Warehousing
Metrics
Finishing up
Wednesday, May 22, 13
LearnVest Inc.www.learnvest.com
CompanyFounded in 2008 by Alexa Von Tobel, CEO
50+ People and Growing rapidlyBased in NYC
PlatformsWeb & iPhone
Mission Statement“Aiming to make financial planning as accessible as having a gym membership”
Key ProductsAccount Aggregation and Management
(Bank, Credit, Loan, Investment, Mortgage)
Original and Syndicated Newsletter Content
Financial Planning(tiered product offering)
Stack
OperationalWordpress, Backbone.js, Node.jsJava Spring 3, Redis, Memcached,
MongoDB, ActiveMQ, Nginx, MySQL 5.x
AnalyticsMongoDB 2.2.0, Hadoop, Pig, Java 6, Spring 3
pyMongoDjango 1.4
Wednesday, May 22, 13
LearnVest.comWeb
Wednesday, May 22, 13
LearnVest.comIPhone
Wednesday, May 22, 13
Conversion FunnelsWeb IOS Tele-Sale, scheduled call
Account Creation
Free Assessment
Paid Product
Wednesday, May 22, 13
Component ArchitectureAnalyticsProduction
Wednesday, May 22, 13
High Level Architecture} } } }Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
High Level Architecture} } } }Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
High Level Architecture} } } }Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
High Level Architecture} } } }Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
High Level Architecture} } } }Analytics
Services & Event Capture Aggregation & Indexed Search Tools & Dashboards
Production
Production Services
Event CaptureUpdate UserRun AggregationsReports, Stats & Data ScienceWednesday, May 22, 13
Philosophy For Data CollectionCapture Everything• User-Driven events over web and mobile• System-level exceptions• Everything else
Temporary Data• Be ‘ok’ with approximate data• Operational Databases are the system of record
Aggregate events as they come in• Remove the overhead of basic metrics (counts, sums) on core events• Group by user unique id and increment counts per event, over time-dimensions
(day, week-ending, month, year)
Wednesday, May 22, 13
Philosophy For Data CollectionLogical Separation
Events• Core use cases (forms, conversion paths) • UI Actions (button clicks, swipes, views, forms)• HttpRequest level analysis (user-agent, ios version upgrades etc)
User• Has a status/rating (Account Creation, Linked Bank Account, Paid Products)• Source and Conversion Path (how was the user acquired) • Quantified Actions (User completed x, y, z conversion actions when & how?)• Social Interactions (Facebook, Twitter)• Email Interactions (stats & emails for [email protected])
Wednesday, May 22, 13
Data CaptureIOS
- (void) sendAnalyticEventType:(NSString*)eventType object:(NSString*)object name:(NSString*)name page:(NSString*)page source:(NSString*)source;{ NSMutableDictionary *eventData = [NSMutableDictionary dictionary];
if (eventType!=nil) [params setObject:eventType forKey:@"eventType"]; if (object!=nil) [eventData setObject:object forKey:@"object"]; if (name!=nil) [eventData setObject:name forKey:@"name"]; if (page!=nil) [eventData setObject:page forKey:@"page"]; if (source!=nil) [eventData setObject:source forKey:@"source"]; if (eventData!=nil) [params setObject:eventData forKey:@"eventData"]; [[LVNetworkEngine sharedManager] analytics_send:params];}
Wednesday, May 22, 13
Data CaptureWEB (JavaScript)
function internalTrackPageView() { var cookie = { userContext: jQuery.cookie('UserContextCookie'), };
var trackEvent = { eventType: "pageView", eventData: { page: window.location.pathname + window.location.search } }; // AJAX jQuery.ajax({ url: "/api/track", type: "POST", dataType: "json", data: JSON.stringify(trackEvent), // Set Request Headers beforeSend: function (xhr, settings) { xhr.setRequestHeader('Accept', 'application/json'); xhr.setRequestHeader('User-Context', cookie.userContext); if(settings.type === 'PUT' || settings.type === 'POST') { xhr.setRequestHeader('Content-Type', 'application/json'); } } });}
Wednesday, May 22, 13
Bus Event Packaging1.Spring 3 RESTful service layer, controller methods define the eventCode via @tracking
annotation
2.Custom Intercepter class extends HandlerInterceptorAdapter and implements postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher
3.EventPublisher publishes to common event bus queue with multiple subscribers, one of which packages the eventPayload Map<String, Object> object and forwards to Analytics Rest Service
Wednesday, May 22, 13
Bus Event Packaging1) Spring RestController Methods
Interface
@RequestMapping(value = "/user/login", method = RequestMethod.POST, headers="Accept=application/json")public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request);
Concrete/Impl Class
@Override@Tracking("user.login")public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request){
//Implementation
return event;}
Wednesday, May 22, 13
Bus Event Packaging2) Custom Intercepter class extends HandlerInterceptorAdapter
protected void handleTracking(String trackingCode, Map<String, Object> modelMap, HttpServletRequest request) {
Map<String, Object> responseModel = new HashMap<String, Object>();
// remove non-serializables & copy over data from modelMap try { this.eventPublisher.publish(trackingCode, responseModel, request); } catch (Exception e) { log.error("Error tracking event '" + trackingCode + "' : " + ExceptionUtils.getStackTrace(e)); }}
Wednesday, May 22, 13
Bus Event Packaging2) Custom Intercepter class extends HandlerInterceptorAdapter
public void publish (String eventCode, Map<String,Object> eventData, HttpServletRequest request) {
Map<String,Object> payload = new HashMap<String,Object>(); String eventId=UUID.randomUUID().toString(); Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request); //Normalize message payload.put("eventType", eventData.get("eventType")); payload.put("eventData", eventData.get("eventType")); payload.put("version", eventData.get("eventType")); payload.put("eventId", eventId); payload.put("eventTime", new Date()); payload.put("request", requestMap); . . . //Send to the Analytics Service for MongoDB persistence}
public void sendPost(EventPayload payload){ HttpEntity request = new HttpEntity(payload.getEventPayload(), headers); Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class);}
Wednesday, May 22, 13
Bus Event PackagingThe Serialized Json (User Action)
{“eventCode” : “user.login”,“eventType” : “login”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : { “” : “”, “” : “”, “” : “” },“request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, }
}
Wednesday, May 22, 13
Bus Event PackagingThe Serialized Json (Generic Event)
{“eventCode” : “generic.ui”,“eventType” : “pageView”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” },“request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, }
}
Wednesday, May 22, 13
Bus Event PackagingThe Serialized Json (Generic Event)
{“eventCode” : “generic.ui”,“eventType” : “pageView”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” },“request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, }
}
Wednesday, May 22, 13
Event Data WarehousingMongoDB Information• v2.2.0• 3-node replica-set• 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines• Each with single 500GB EBS volumes mounted to /opt/data
MongoDB Config Filedbpath = /opt/data/mongodb/datarest = truereplSet = voyager
Volumes~IM events daily on web, ~600K on mobile2-3 GB per day at start, slowed to ~1GB per dayCurrently at 78GB (collecting since August 2012)
Future Scaling Strategy• Setup 2nd Replica-Set in a new AWS region• Not intending to shard - data is archived 12 months in lieu
Wednesday, May 22, 13
Event Data WarehousingApproach
1. Persist all events, bucketed by source:- WEB MOBILE
2. Persist all events, bucketed by source, event code and time:- WEB/MOBILE user.login time (day, week-ending, month, year)
3. Insert into collection e_web / e_mobile
4. Also insert into Daily, weekly and monthly collections for main payload and http request payload
• e_web_05232013• e_web_request_05232013
4. Predictable model for scaling and measuring business growth
Wednesday, May 22, 13
Event Data WarehousingPersist all events
> db.e_web.findOne(){ "_id" : ObjectId("50e4a1ab0364f55ed07c2662"), "created_datetime" : ISODate("2013-01-02T21:07:55.656Z"), "created_date" : ISODate("2013-01-02T00:00:00.000Z"),"request" : { "content-type" : "application/json", "connection" : "keep-alive", "accept-language" : "en-US,en;q=0.8", "host" : "localhost:8080", "call-source" : "WEB", "accept" : "*/*", "user-context" : "c4ca4238a0b923820dcc509a6f75849b", "origin" : "chrome-extension://fdmmgilgnpjigdojojpjoooidkmcomcm", "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11", "accept-charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "cookie" : "size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" : "255", "accept-encoding" : "gzip,deflate,sdch" }, "eventType" : "flick", "eventData" : { "object" : "button", "name" : "split transaction button", "page" : "#inbox/79876/", "section" :
Wednesday, May 22, 13
Event Data WarehousingAccess Pattern
• No reads off primary node, insert only
• Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large Instance and 3.75GB on Medium instances
• Split datetime in two fields and compound index on date with other fields like eventType and user unique id (user-context)
Wednesday, May 22, 13
Event Data WarehousingIndexing Strategy
> db.e_web.getIndexes()[ { "v" : 1, "key" : { "request.user-context" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "request.user-context_1_created_date_1" }, { "v" : 1, "key" : { "eventData.name" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "eventData.name_1_created_date_1" }]
Wednesday, May 22, 13
User Data WarehousingElastic Search (http://www.elasticsearch.org/)
• Open-source lucene cluster• Mature query language, accessed via RestAPI• Unstructured schema and feature rich• Strong API support
Configuration
• Single instance for user• Deployed over 3 EC2 Medium AML instances• Updated by a Java process checking a redis cache for uuids• Accessed by multiple applications for canonical user objects
Wednesday, May 22, 13
User Data WarehousingBuilding the User Object
For each userid in the redis cache, retrieve the following infomration:- • ODS Slave (Learnvest data)• Jotform.com (eform submissions)• FullSlate.com (calendar appointments)• Stripe.com (payments) • Desk.com (emails)
Build a canonical JSON Object and save in the elasticsearch cluster
Map<String, String> user = new HashMap<String,String>(); source.put(...);
client.execute(new Index.Builder(source).index(“Users”);
Wednesday, May 22, 13
MetricsObjective• Show historic and intraday stats on core use cases (logins, conversions)• Show user funnel rates on conversion pages• Show general usability - how do users really use the Web and IOS platforms?
Non-Functionals• Intraday doesn’t need to be “real-time”, polling is good enough for now• Overnight batch job for historic must scale horizontally
General Implementation Strategy• Do all heavy lifting & object manipulation, UI should just display graph or table• Modularize the service to be able to regenerate any graphs/tables without a full load
Wednesday, May 22, 13
MetricsJava Batch Service
Java Mongo library to query key collections and return user counts and sum of events
DBCursor webUserLogins = c.find( new BasicDBObject("date", sdf.format(new Date())));
private HashMap<String, Object> getSumAndCount(DBCursor cursor){ HashMap<String, Object> m = new HashMap<String, Object>(); int sum=0; int count=0; DBObject obj; while(cursor.hasNext()){ obj=(DBObject)cursor.next(); count++; sum=sum+(Integer)obj.get("count"); } m.put("sum", sum); m.put("count", count); m.put("average", sdf.format(new Float(sum)/count)); return m;}
Wednesday, May 22, 13
MetricsJava Batch Service
Use Aggregation Framework where required on core collections (e_web) and external data
//create aggregation objectsDBObject project = new BasicDBObject("$project", new BasicDBObject("day_value", fields) );DBObject day_value = new BasicDBObject( "day_value", "$day_value");DBObject groupFields = new BasicDBObject( "_id", day_value);
//create the fields to group by, in this case “number”groupFields.put("number", new BasicDBObject( "$sum", 1));
//create the group DBObject group = new BasicDBObject("$group", groupFields);
//executeAggregationOutput output = mycollection.aggregate( project, group ); for(DBObject obj : output.results()){ ..}
Wednesday, May 22, 13
MetricsJava Batch Service
MongoDB Command Line example on aggregation over a time period, e.g. month
> db.e_web.aggregate( [ { $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}}, { $project : { day_value : {"day" : { $dayOfMonth : "$created_date" }, "month":{ $month : "$created_date" }} }}, { $group : { _id : {day_value:"$day_value"} , number : { $sum : 1 } } }, { $sort : { day_value : -1 } } ])
Wednesday, May 22, 13
MetricsJava Batch Service
Persisting events into graph and table collections
>db.homeGraphs.find()
{ "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54, "accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" : "12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0", "users_avg_linked" : "3.43", "users_linked" : 7 }
{ "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144, "accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" : "11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0", "users_avg_linked" : "4", "users_linked" : 16 }
{ "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119, "accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" :
Wednesday, May 22, 13
MetricsDjango and HighCharts
Extract data (pyMongo)
def getHomeChart(dt_from, dt_to): """Called by home method to get latest 30 day numbers""" try: conn = pymongo.Connection('localhost', 27017) db = conn['lvanalytics']
cursor = db.accountmetrics.find( {"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date") return buildMetricsDict(cursor)
except Exception as e: logger.error(e.message)
Return the graph object (as a list or a dict of lists) to the view that called the method
pagedata={}pagedata['accountsGraph']=mongodb_home.getHomeChart()
return render_to_response('home.html',{'pagedata': pagedata}, context_instance=RequestContext(request))
Wednesday, May 22, 13
MetricsDjango and HighCharts
Populate the series.. (JavaScript with Django templating)
seriesOptions[0] = { id: 'naturalAccounts', name: "Natural Accounts", data: [ {% for a in pagedata.metrics.accounts_natural %} {% if not forloop.first %}, {% endif %} [Date.UTC({{a.0}}),{{a.1}}] {% endfor %} ], tooltip: { valueDecimals: 2 } };
Wednesday, May 22, 13
MetricsDjango and HighCharts
And Create the Charts and Tables...
Wednesday, May 22, 13
MetricsDjango and HighCharts
And Create the Charts and Tables...
Wednesday, May 22, 13
Data Science ToolsIPython Notebook• Deployed on an EC2 Large AML Medium Instance• Configured for Python 2.7.3• Loaded with MatPlotLib, PyLab, SciPy, Numpi, pyMongo, MySQL-python
Insights• Write wrapper methods to access user data• Accessible to anyone through a browser• Very effective way to scale quickly with little overhead
Applications• Decision tree analysis over website and ios - showed common paths• Session level analysis on IOS devices• Multi-page form conversion retention rates• Quicly coduct segment analysis via a programming aPI
Wednesday, May 22, 13
Data Science ToolsPIG• Executed using ruby scripts• Pulled data from MongoDB• Forwarded to AWS EMR cluster for analysis• MR functions written in Python and occasionally Java
Insights• Used for ad-hoc analysis involving large datasets
Applications• Daily, Weekly, Monthly conversion metrics on page views and forms• Identified trends in spending over 1M rows• Used lightly at Learnvest, growing in capability
Wednesday, May 22, 13
Things that didn’t workMongoDB UpsertsQuickly becomes read-heavy and slows down the db
MongoDB Aggregation FrameworkFine for adhoc analysis but you might be better off with establishing a repeatable framework to run MR algos
Django-noRelUnstable, use Django and configure MongoDB as a datastore only
Wednesday, May 22, 13
Lessons Learned• Date Time managed as two fields, Datetime and Date
• Real-time Map-Reduce in pyMongo - too slow, don’t do this.
• Memcached on Django is good enough (at the moment) - use django-celery with rabbitmq to pre-cache all data after data loading
• HighCharts is buggy - considering D3 & other libraries
• Don’t need to retrieve data directly from MongoDB to Django, perhaps provide all data via a service layer (at the expense of ever-additional features in pyMongo)
• Make better use of EMR upfront if resources are limited and data is vast.
Wednesday, May 22, 13
Thanks!...Questions?
Wednesday, May 22, 13