validating big data at scale

17
Validating Data at Scale Spenser Skates CEO at Amplitude

Upload: amplitude-mobile-analytics

Post on 22-Jun-2015

351 views

Category:

Data & Analytics


0 download

DESCRIPTION

When you're collecting data from hundreds of millions of devices simultaneously, things get noisy. We go over key problems and solutions for collecting and validating data at scale.

TRANSCRIPT

Page 1: Validating big data at scale

Validating Data at Scale Spenser Skates

CEO at Amplitude

Page 2: Validating big data at scale

Doing things at scale is noisy

u  Code is supposed to run the same way, but what if you run the same loop a million times on a million different machines- how confident are you it will always run the same?

Page 3: Validating big data at scale

Data from phones is noisier

u  Running on tens of thousands of different platforms with hundreds of thousands of different software configurations on hundreds of millions of phones

u  Platforms have the craziest settings

Page 4: Validating big data at scale

How data can get messed up

u  HTTP requests get mangled in transit

u  Phone might not get the acknowledgement from the server

u  People’s clocks are off

u  People are running weird versions of Android

u  Memory/disk corruption

u  Gamma ray events

Page 5: Validating big data at scale

You can’t trust data from the client

Page 6: Validating big data at scale

Problem: Data gets mangled in transit

u  Parameters from post requests get dropped

u  Within a parameter, a chunk of data may not actually reach the server

Page 7: Validating big data at scale

Solution: Checksumming

u  Send a checksum that’s a function of all the fields

u  If the checksum is wrong/not present, you know that you haven’t got all the data. Tell the phone the upload wasn’t successful

u  The phone will attempt to reupload the data

Page 8: Validating big data at scale

Problem: Client sends the same data twice

u  How does the phone know that the server has received the data so it doesn’t reupload the same piece of data twice? It gets an acknowledgement back

u  How does the server know that the phone has received the acknowledgement? It doesn’t!

u  Equivalent to the two generals problem

u  Requests that are successfully received by the server fail to successfully send an acknowledgement to the phone 5% of the time

u  That means all counts are inflated by about 5%!

Page 9: Validating big data at scale

Solution: Deduplication

u  Your system must be idempotent on the event level- it must be able to receive an event it’s received before and not change its state

u  Create a unique key for every event that has been sent

u  When you see an event, check your list of keys if the key is already present, discard the event

Page 10: Validating big data at scale

Problem: Clocks are off

u  Phones are often offline, so an analytics SDK needs to cache data locally before uploading, including the time the event occurred

u  But people’s clocks are often off, occasionally by years!

u  We can’t timestamp to the upload time, 5% of data is uploaded >24 hours after an event happened

Page 11: Validating big data at scale

Solution: Get an estimate of the actual time an event was logged

u  Timestamp the upload from the phone

u  For each event, let’s compare:

u  The difference between the phone event timestamp and the server upload time

u  The difference between the phone upload timestamp and the server upload time

Page 12: Validating big data at scale
Page 13: Validating big data at scale
Page 14: Validating big data at scale

Solution: Get an estimate of the actual time an event was logged

u  For each event timestamp, subtract the difference between the phone’s upload time and the server’s upload time

Page 15: Validating big data at scale

Other Problems

u  People are running weird versions of Android u  MD5 library

u  Memory/disk corruption

u  Gamma ray events

Page 16: Validating big data at scale

Clean Data

Page 17: Validating big data at scale

Questions?

Always happy to talk about analytics problems!

[email protected]

blog.amplitude.com

twitter: @amplitudemobile

MOBILE ANALYTICS FOR DECISION MAKERS