data science at scale by sarah guido

Post on 16-Apr-2017

1.282 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Science at Scale:Using Apache Spark for Data Science at Bitly

Sarah GuidoSpark Summit Europe 2015

Overview

• About me/Bitly• Spark overview• Using Spark for data science• When it works, it’s great! When it works…

About me

• Data Scientist at Bitly• NYC Python/PyGotham co-organizer• O’Reilly Media author• @sarah_guido

About this talk

• This talk is:– Description of my workflow– Exploration of within-Spark tools

• This talk is not:– In-depth exploration of algorithms– Building new tools on top of Spark– Any sort of ground truth for how you should be

using Spark

A bit of background

• Need for big data analysis tools• MapReduce for exploratory data analysis == • Iterate/prototype quickly• Overall goal: understand how people use not

only our app, but the Internet!

Bitly data!

• Legit big data• 1 hour of decodes is 10 GB• 1 day is 240 GB• 1 month is ~7 TB

Why Spark?

• Fast. Really fast.• Distributed scientific tools• Python! (Sometimes.)• Cutting edge technology• AWS/EMR/S3

Setting up the workflow

• Spark journey– Hadoop server: 1.2 – Python – EMR: 1.3 – Python – EMR: 1.4 – Python/Scala– EMR: 1.5 – Scala

Let’s set the stage…

• Understanding user behavior• How do I extract, explore, and model a subset

of our data using Spark?

Data{"a": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10", "c": "US", "nk": 0, "tz": "America/Los_Angeles", "g": "1HfTjh8", "h": "1HfTjh7", "u": "http://www.nytimes.com/2015/03/22/opinion/sunday/why-health-care-tech-is-still-so-bad.html?smid=tw-share", "t": 1427288425, "cy": "Seattle"}

Data processing

• Problem: I want to retrieve NYT decodes• Solution: well, there are two…• Spark 1.3

Data processing

Data processing

Data processing

• SparkSQL: 8 minutes• Pure Spark: 4 minutes!!!

Data processing

Topic modeling

• Problem: we have so many links but no way to classify them into certain kinds of content

• Solution: LDA (latent Dirichlet allocation)– Sort of – compare to other solutions

• Spark 1.4

Topic modeling

• LDA in Spark– Generative model– Several different methods– Term frequency vector as input

• “Note: LDA is still an experimental feature under active development...”

Topic modeling

Topic modeling

• Term frequency vector

TERMDOCUMENT

python data hot dogs baseball zoo

doc_1 1 3 0 0 0

doc_2 0 0 4 1 0

doc_3 4 0 0 0 5

Topic modeling

Topic modeling

Trend Detection

• Tell our clients when a particular piece of content is trending

• Transition to Scala• Workflow improvement• EMR + Spark 1.5 + Jupyter + Scala!

Trend Detection

Trend Detection

Architecture

• Right now: not in production– Buy-in

• Streaming applications for parts of the app• Python or Scala?– Scala by force

Some issues

• Hadoop servers• JVM• gzip• 1.4/resource allocation/EMR• Lack of documentation

Where to go next?

• Spark in production!• Use for various parts of our app• Use for R&D and prototyping purposes, with

the potential to expand into the product

Resources/Source Material

• spark.apache.org - documentation• Databricks blog• Cloudera blog• Other Spark users!

Thanks!!

@sarah_guido

top related