data science at scale: using apache spark for data science at bitly
TRANSCRIPT
Overview
• About me/Bitly• Spark overview• Using Spark for data science• When it works, it’s great! When it works…
About me
• Data scientist at Bitly• NYC Python/PyGotham co-organizer• O’Reilly Media author• @sarah_guido
About this talk
• This talk is:– Description of my workflow– Exploration of within-Spark tools
• This talk is not:– In-depth exploration of algorithms– Building new tools on top of Spark– Any sort of ground truth for how you should be
using Spark
A bit of background
• Need for big data analysis tools• MapReduce for exploratory data analysis == • Iterate/prototype quickly• Overall goal: understand how people use not
only our app, but the Internet!
What is Spark?
• Large-scale distributed data processing tool• SQL and streaming tools• Faster than Hadoop• Python API
How does Spark work?
• Partitions your data to operate over in parallel– A partition by default is 64 MB
• Capability to add map/reduce features• Lazy – only operates when method is called– Ex. collect() or writing to a file
Why Spark?
• Fast. Really fast.• SQL layer – kind of like Hive• Distributed scientific tools• Python! Sometimes.• Cutting edge technology
Let’s set the stage…
• Understanding user behavior• How do I extract, explore, and model a subset
of our data using Spark?
Data
{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}
Exploratory data analysis
• Problem: what’s going on with my decodes?• Solution: DataFrames!– Similar to Pandas: describe, drop, fill, aggregate
functions– You can actually convert to a Pandas DataFrame!
Exploratory data analysis
• Get a sense of what’s going on in the data• Look at distributions, frequencies• Mostly categorical data here
Topic modeling
• Problem: we have so many links but no way to classify them into certain kinds of content
• Solution: LDA (latent Dirichlet allocation)– Sort of – compare to other solutions
Topic modeling
• LDA in Spark– Generative model– Several different methods– Term frequency vector as input
• “Note: LDA is a new feature with some missing functionality...”
Topic modeling
• Term frequency vector
TERMDOCUMENT
python data hot dogs baseball zoo
doc_1 1 3 0 0 0
doc_2 0 0 4 1 0
doc_3 4 0 0 0 5
Architecture
• Right now: not in production– Buy-in
• Streaming applications for parts of the app• Python or Scala?– Scala by force (LDA, GraphX)
Some issues
• Hadoop servers• JVM• gzip• 1.4• Resource allocation• Really only got it to this stage very recently
Where to go next?
• Spark in production!• Use for various parts of our app• Use for R&D and prototyping purposes, with
the potential to expand into the product
Current/future projects
• Trend detection• Device prediction• User affinities– GraphX!
• A/B testing