how linkedin uses scalding for data driven product development
DESCRIPTION
Slides from the Cascading meetup May 29, 2014 http://www.meetup.com/cascading/events/177491292/TRANSCRIPT
Using Scalding for Data-Driven Product Development
Sasha OvsankinLinkedIn
http://linkedin.com/in/sashao
• Studied Mathematical Physics at Moscow University
• Software Engineering background• Work at LinkedIn on Email Experience• Publish open source at
https://github.com/SashaOv• Publish music at SoundCloud
/home
Scalding is a must-have tool in your arsenal of Hadoop development.– Hadoop ecosystem at LinkedIn– Hadoop development tools– Scalding: why and how– What we do with Scalding, code examples.
/linkedin/hadoop/overview
Online Apps
Databases
NoSQL Data Stores
ETL
Hadoop
HDFS
Hadoop Flows
Tracking/logging
Analytics
Data Products
Messaging
Message delivery
/linkedin/hadoop/practices
• All online data end up in HDFS– Mostly encoded in Avro
• Production Process– CI/Automatic Build
• More info forthcoming– Production Review– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production• More info at http://lnkd.in/big-data-ecosystem
/linkedin/hadoop/dev-tools
• PIG• Java MR• Scalding• +many others, will not talk about them today
/hadoop/dev-tools/PIG
• Relatively mature tool– first official release 2008
• Easy to learn• Availability of experienced people• Extendable via UDF
/hadoop/dev-tools/Java
• Java MR– Maximum flexibility with Hadoop API– Verbose
• Cascading– Retain (some) Java flexibility– Less verbose
/hadoop/dev-tools/Scaldinghttp://github.com/twitter/scalding• Scala-based DSL• Built on Cascading, stable and mature framework• Uses API similar to Scala collections:
class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""\s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) )}
• Succinct and powerful• High level of abstraction
…/tools/comparisonPIG Java/Scala
Debugging: stack traces No* Yes
Code reuse Macros, jobs Classes, packages, modules, frameworks…
Custom data structures/algorithms
UDF Native
Packaging Fat jars Thin jars
Avro support Partial Native
Unit testing PigUnit (in Java) Standard unit testing frameworks: JUNIT/TestMg/MRUnit, Scalding tests
PIG Java MR Scalding
LOC count Small* Large Small
…/tools/buyers-guideIf you need… Then use…Quick-and-dirty simple scripts, existing UDFs
PIG, Hive
Complex flows, full access to Avro, debugging, unit testing, productization
Scalding
Full flexibility of Hadoop API but not too complex processing
Java MR
/linkedin/email-experience• Goal
– Improve messaging users’ experience• Plan
– Track– Experiment– Optimize– Personalize
• Implementation– Generate messages offline– Apply sophisticated relevance algorithms– Shorten the release cycle to facilitate fast iteration
/linkedin/email-experience/overview
Content sources(PIG)
HDFS
Content sources(Scalding)
Content sources(Crunch)
Targeting, Relevance
(Scalding, Java )
Email/Message production(Java MR)
Framework(Java)
Online Delivery System
…/email-experience/why-scalding
• Scala + Map Reduce = match made in heavenscala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }res20: Int = 333833500
• Stack traces (yeah!)• Native Avro support• Integrates well with CI/build system
…/email-experience/code
…/email-experience/code/2
/linkedin/…/scalding/status
• Started >1 year ago• Thousands of production LOC written in Scalding by our
team– Pretty happy with readability and maintainability
• ~10 flows are currently in production, and counting• Currently ~12 people are coding in Scalding• Created Scalding user group• Growing interest• Learning:
– Scala[Scalding] < Scala[ _ ]
/linkedin/…/scalding/users
• Data science• Enterprise services• Email experience• Content
/linkedin/…/scalding/what-to-improve
• Better Scala language IDE tools• One-click development (->
demo)• Monitoring and troubleshooting
– Counters – implemented in 0.9– Better troubleshooting of the
ser/de process• Better tools for tuning of jobs
– setting #of mappers and reducers• Best practices
/home
Scalding is a must-have tool in your arsenal of Hadoop development.– Hadoop ecosystem at LinkedIn– Hadoop development tools– Scalding: why and how– What we do with Scalding, code examples.
/linkedin/join-us• Work on unique and interesting problems• Be part of great engineering community• Use latest tools and technologies• Help connect the world’s professionals to help them
become more productive and successful• We are looking for amazing people interested in Data
Science and Software Engineering
Questions?