big data meetup budapest adding data schemas to snowplow
TRANSCRIPT
![Page 1: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/1.jpg)
Adding Data Schemas to Snowplow
Big Data Budapest Meetup -‐ 5 June 2014
![Page 2: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/2.jpg)
Agenda today
1. Introduc;on to Snowplow
2. Evolu;on of Snowplow
3. The answer: schema all the things!
4. Snowplow roadmap
5. Ques;ons
![Page 3: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/3.jpg)
Introduc8on to Snowplow
![Page 4: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/4.jpg)
Snowplow is an open-‐source web and event analy8cs pla<orm, first version released in early 2012
• Co-‐founders Alex Dean and Yali Sassoon met at OpenX, the open-‐source ad technology business in 2008
• ASer leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analy;cs consultancy
• We released Snowplow as a skunkworks prototype at start of 2012:
github.com/snowplow/snowplow
• We started working full ;me on Snowplow in summer 2013
![Page 5: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/5.jpg)
We wanted to take a fresh approach to web analy8cs
• Your own web event data -‐> in your own data warehouse • Your own event data model • Slice / dice and mine the data in highly bespoke ways to answer your specific business ques;ons
• Plug in the broadest possible set of analysis tools to drive value from your data
Data warehouse Data pipeline
Analyse your data in any analysis tool
![Page 6: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/6.jpg)
By spring 2013 we had arrived at a rela8vely stable batch-‐based processing architecture
Website / webapp
Snowplow Hadoop data pipeline
CloudFront-‐based event collector
Scalding-‐based
enrichment on Hadoop
JavaScript event tracker
Amazon RedshiS / PostgreSQL
Amazon S3
or
Clojure-‐based event collector
![Page 7: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/7.jpg)
Evolu8on of Snowplow
![Page 8: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/8.jpg)
Snowplow is evolving from a web analy8cs pla<orm into a general event analy8cs pla<orm
Data warehouse
Collect event data from any connected
device
![Page 9: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/9.jpg)
Web analysts work with a small number of event types – outside of web, the number of possible event types is… infinite
Web events
All events
• Page view • Order • Add to basket • Page ac;vity
• Game saved • Machine broke • Car started
• Spellcheck run • Screenshot taken • Fridge empty
• App crashed • Disk full • SMS sent
• Screen viewed • Tweet draSed • Player died
• Taxi arrived • Phonecall ended • Cluster started
• Till opened • Product returned ∞
![Page 10: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/10.jpg)
There are two historic approaches to dealing with the explosion of possible event types
Web analy8cs vendors Mobile and app analy8cs vendors
Custom Variables Schema-‐less JSONs
![Page 11: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/11.jpg)
Custom variables are very restric8ve
1. Take a standard web event, like a page view:
2. and add custom variables un;l it becomes something totally different:
= a “taxi arrived” event, kind of!
Page View
Page View vehicle=taxi23 status=arrived + +
![Page 12: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/12.jpg)
Schema-‐less JSONs are beWer, but they have a different set of problems
Issues with the event name: • Separate from the event proper;es • Not versioned • Not unique – HBO video played
versus Brightcove video played
Lots of unanswered ques;ons about the proper;es: • Is length required, and is it always a
number? • Is id required, and is it always a string? • What other op;onal proper;es are
allowed for a video play?
Other issues: • What if the developer
accidentally starts sending “len” instead of “length”? The data will end up split across two separate fields
• Why does the analyst need to keep an implicit schema in their head to analyze video played events?
![Page 13: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/13.jpg)
The answer: schema all the things!
![Page 14: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/14.jpg)
When a developer or analyst defines a new event in JSON, let’s ask them to create a JSON Schema for that event
Addi;onal op;onal field we might not know about otherwise
No other fields allowed
Yes length should always be a number
![Page 15: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/15.jpg)
But we need to let our event defini8ons evolve, so let’s add versioning – we’re calling this SchemaVer
MODEL-REVISION-ADDITION!
• Start versioning at 1-‐0-‐0 – so 1-‐0-‐0, 1-‐0-‐1, 1-‐0-‐2, 1-‐1-‐0 etc • Try to s;ck to backwards-‐compa;ble ADDITION upgrades as much as possible
![Page 16: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/16.jpg)
Where are our schemas going to live? We need a schema repository/registry
Schema repo {}!
Enrichment Manager
Raw events in JSON format
Enriched events in ThriS or Arvo format
Shredder
1. Test instrumenta;on
2. Validate events
3. Define structure
4. Drive shredding
Enriched events in TSV ready for loading into db
5. Define structure
![Page 17: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/17.jpg)
We need to namespace our schemas properly to prevent clashes and confusion in our schema repository
iglu:com.channel2.vod/video_played/jsonschema/1-0-0!
We are calling our schema methodology “Iglu”
The vendor of this event
Event name
Schema format
Schema version
![Page 18: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/18.jpg)
Bringing it all together, let’s now make the event JSONs self-‐describing, with a schema header and data body
![Page 19: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/19.jpg)
And for good measure, let’s add in our schema informa8on into the JSON Schema itself
![Page 20: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/20.jpg)
Snowplow roadmap
![Page 21: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/21.jpg)
Self-‐describing JSON Schemas are coming in the next release of Snowplow
![Page 22: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/22.jpg)
We are also star8ng to define third-‐party events for Snowplow integra8on, star8ng with Zendesk customer support events
![Page 23: Big data meetup budapest adding data schemas to snowplow](https://reader034.vdocuments.site/reader034/viewer/2022052522/5549dfafb4c9051e488b4780/html5/thumbnails/23.jpg)
Ques8ons?
hlp://snowplowanaly;cs.com hlps://github.com/snowplow/snowplow
@snowplowdata
To chat – @alexcrdean on Twiler or alex@snowplowanaly;cs.com