appboy analytics - nyc mug 11/19/13

Appboy Analytics Jon Hyman NY MongoDB User Group, November 19, 2013 eBay NYC

@appboy @jon_hyman

A LITTLE BIT ABOUT US & APPBOY

Jon Hyman CIO :: @jon_hyman !

Appboy is a mobile relationship management platform for apps

(who we are and what we do)

Harvard Bridgewater

Appboy improves engagement by helping you understand your app users• IDENTIFY - Understand demographics,

social and behavioral data

• SEGMENT - Organize customers into

groups based on behaviors, events, user

attributes, and location

• ENGAGE - Message users through

push notifications, emails, and multiple

forms of in-app messages

Use Case: Customer engagement begins with onboarding

Urban Outfitters textPlus Shape Magazine

Agenda

• How to quickly store time series data in MongoDB using flexible schemas

• Learn how flexible schemas can easily provide breakdowns across dimensions

• Counting quickly: statistical analysis on top of MongoDB queries

What kinds of analytics does Appboy track?• Lots of time series data

• App opens over time

• Events over time

• Revenue over time

• Marketing campaign stats and efficacy over time

What kinds of analytics does Appboy track?

• Breakdowns* • Device types

• Device OS versions

• Screen resolutions

• Revenue by product

* We also care about this over time!

What kinds of analytics does Appboy track?

• User segment membership • How many users are in each segment?

• How many can be emailed or reached via push notifications?

• What is the average revenue per user in the segment?

• Per paying user?

Pre-aggregated Analytics:

APP OPENS OVER TIME

Typical time series collection

Log a new row for each open received !{! timestamp: 2013-11-14 00:00:00 UTC,! app_id: App identifier!}!!db.app_opens.find({app_id: A, timestamp: {$gte: date}})!

Con: You need to aggregate the data before drawing the chart; lots of documents read into memory, lots of dirty pages

Pro: Really, really simple. Easy to add attribution to users.

Fewer documents with pre-aggregation iteration 1

Create a document that groups by the time period ! {! app_id: App identifier,! date: Date of the document,! hour: 0-23 based hour this document represents,! opens: Number of opens this hour! }!!db.app_opens.update({date: D, app_id: A, hour: 0}, {$inc: {opens:1}})

Con: We never care about an hour by itself. We lose attribution.

Pro: Really easy to draw histograms

Fewer documents with pre-aggregation iteration 2Create a document by day and have each hour be a field ! {! app_id: App identifier,! date: Date of the document,! total_opens: Total number of opens this day,! 0: Number of opens at midnight,! 1: Number of opens at 1am,! ...! 23: Number of opens at 11pm! }!! db.app_opens.update(! {date: D, app_id: A}, ! {$inc: {“0”:1, total:1}}! )

Pro: Document count is low, easy to use aggregation framework for longer spans, fast: document should be in working set


• What about looking at different dimensions?

• App opens by device type (e.g., how do iPads

compare to iPhones?)

• Demographics (gender, age group)

Solution!

FLEXIBLE SCHEMAS!


!{! app_id; App identifier,! date: Date of the document,! totals: {! app_opens: Total number of opens this day,! devices: {! "iPad Air": Total number of opens on the iPad Air,! "iPhone 4": Total number of opens on the iPhone 4,! },! genders: {! male: Total number of opens from male users,! female: Total number of opens from female users! },! ...! },! 0: {! app_opens: Number of opens at midnight,! devices: {! "iPad Air": Number of opens on the iPad Air at midnight,! "iPhone 4": Number of opens on the iPhone 4 at midnight,! },! ...! },! ...!}!!db.app_opens.update({date: D, app_id: A}, {$inc: {“0”:1, total:1}})

Dynamically add dimensions in the document

Pre-aggregated analytics

• Pros • Easily extensible to add other dimensions

• Still only using one document, therefore you can create

charts very quickly

• You get breakdowns over a time period for free

!

• Cons • Pre-aggregated data has no attribution

• Have to know questions ahead of time

Follow up: What if we wanted to look at a graph by age group?

Pre-aggregated analytics summary

• Get started tracking time series data quickly

• You get breakdowns for free

• Adding dimensions is super simple

• No attribution, need to know questions ahead of time

• Don’t just rely on pre-aggregated analytics

Counting quickly:

USER SEGMENTATION & STATISTICAL ANALYSIS

User Segmentation

• A group of users who match some set of filters

Counting quickly

Appboy shows you segment membership in real-time as you add/edit/remove filters. !

How do we do it quickly? !

We estimate the population sizes of segments when using our web UI.

Counting quickly

Goal: Quickly get the count() of an arbitrary query !

Problem: MongoDB counts are slow, especially unindexed ones

Counting quickly

{! favorite_color: “blue”,! age: 27,! gender: “M”,! favorite_food: “pizza”,! city: “NYC”,! shoe_size: 11,! attractiveness: 10,! ...! } !

10 million documents that represent people:

Counting quickly

{! favorite_color: “blue”,! age: 27,! gender: “M”,! favorite_food: “pizza”,! city: “NYC”,! shoe_size: 11,! attractiveness: 10,! ...! } !

10 million documents that represent people:

• How many people like blue? • How many live in NYC and love pizza? • How many men have a shoe size less than 10?

Big Question: How do you estimate counts?

Answer: The same way news

networks do it.

!

With confidence.

Counting quickly

{! random: 4583,! favorite_color: “blue”,! age: 27,! gender: “M”,! favorite_food: “pizza”,! city: “NYC”,! shoe_size: 11,! attractiveness: 10,! ...! } !

Add a random number in a known range to each document. Say, between 0 and 9999.

Add an index on the random number: !db.users.ensureIndex({random:1})

Counting quickly

Step 1: Get a random sample !I have 10 million documents. Of my 10,000 random “buckets”, I should expect each “bucket” to hold about 1,000 users. !E.g., !db.users.find({random: 123}).count() == ~1000!db.users.find({random: 9043}).count() == ~1000!db.users.find({random: 4982}).count() == ~1000

Counting quickly

Step 1: Get a random sample !Let’s take a random 100,000 users. Grab a random range that “holds” those users. These all work: !db.users.find({random: {$gt: 0, $lt: 101})!db.users.find({random: {$gt: 503, $lt: 604})!db.users.find({random: {$gt: 8938, $lt: 9039})!db.users.find({$or: [! {random: {$gt: 9955}}, ! {random: {$lt: 56}}!])

Tip: Limit $maxScan to 100,000 just to be safe

Counting quicklyStep 2: Learn about that random sample !db.users.find(! {! random: {$gt: 0, $lt: 101},! gender: “M”,! favorite_color: “blue”,! size_size: {$gt: 10}! }, !)!._addSpecial(“$maxScan”, 100000)!.explain()

Explain Result: !{! nscannedObjects: 100000,! n: 11302,! ...!} !

Counting quickly

Step 3: Do the math !Population: 10,000,000 !Sample size: 100,000 !Num matches: 11,302 !Percentage of users who matched: 11.3% !Estimated total count: 1,130,000 +/- 0.2% with 95% confidence

Counting quickly

Step 4: Optimize !• Limit $maxScan to (100,000/numShards) to be even faster !

• Cache the random range for a few hours !

• Add more RAM (or shards) !

• Cache results to not hit the database for the same query

Counting quickly

Step 5: Improve !• Get more than one count: use the aggregation framework on top of the population’s sample size

• Work around all sorts of Mongo bugs :-(

Summarize

• Pre-aggregated analytics

• Create a document that represents event occurrences

in some time period

• Takes full advantage of MongoDB’s flexible schemas

• Not a catch-all for analytics, you should still store event

data

Summarize

• Counting quickly

• Estimate results of arbitrary queries using population

sample sizes

• Depending on your app, this could be a great way to

keep response time predictable as you scale

Thanks! [email protected]

@appboy @jon_hyman

appboy analytics - nyc mug 11/19/13

Technology

app opens

app users

total number

id app identifier

app messages

time series data app

preaggregation iteration

document count