hadoop, pig, and python (pydata nyc 2012)

42
PyData NYC 2012 Hadoop, Pig, and Python

Upload: mortardata

Post on 06-May-2015

5.165 views

Category:

Technology


1 download

DESCRIPTION

Mortar CEO K Young's talk on using Python with Hadoop and Pig (vs. Jython and MapReduce), including NumPy, SciPy, and NLTK.

TRANSCRIPT

Page 1: Hadoop, Pig, and Python (PyData NYC 2012)

PyData NYC 2012Hadoop, Pig, and Python

Page 2: Hadoop, Pig, and Python (PyData NYC 2012)

OF THIS SESSIONOverview

Why Python on Hadoop?Fast Hadoop overviewJythonPythonMrJobPig(How they work, challenges, efficiency, how to start)

Page 3: Hadoop, Pig, and Python (PyData NYC 2012)

FOR ONE MACHINEToo much data

Data doubles every 18 mo

Page 4: Hadoop, Pig, and Python (PyData NYC 2012)

ETL / Munging

CleanseFormatSimple calculations

Page 5: Hadoop, Pig, and Python (PyData NYC 2012)

Social Graph

Page 6: Hadoop, Pig, and Python (PyData NYC 2012)

Predict

Page 7: Hadoop, Pig, and Python (PyData NYC 2012)

Detect

Page 8: Hadoop, Pig, and Python (PyData NYC 2012)

Genetics

Page 10: Hadoop, Pig, and Python (PyData NYC 2012)

RAPID OVERVIEWHadoop

Page 11: Hadoop, Pig, and Python (PyData NYC 2012)

RAPID OVERVIEWHadoop

Hadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more

Page 12: Hadoop, Pig, and Python (PyData NYC 2012)

PROBLEMSHadoop

DifficultNot much PythonBatch only (...or it was)

Page 13: Hadoop, Pig, and Python (PyData NYC 2012)

FUTUREHadoop

YarnMapReduce optionalGeneric management + distributed appsImpala

Page 14: Hadoop, Pig, and Python (PyData NYC 2012)

AND PYTHONHadoop

Page 15: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOP (MAP)Jython

Page 16: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOP (REDUCE; 1ST HALF)Jython

Page 17: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOP (REDUCE; 2ND HALF)Jython

Page 18: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOPJython

Page 19: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOPPython

Streaming

(Works with any language, not just

Page 20: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOPMrJob (Python)

Streaming + local / EMR / your Hadoop

Page 21: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOPMrJob (Python)

Multi-step jobs

Page 22: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOPPig

Less codeExpressive code

Page 23: Hadoop, Pig, and Python (PyData NYC 2012)

BRIEF, EXPRESSIVEPig

(thanks: twitter hadoop world presentation)

Page 24: Hadoop, Pig, and Python (PyData NYC 2012)

FOR SERIOUSThe Same Script, In

Page 25: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOPPig

Less code Expressive codeCompiles to MRInsulates from APIPopular (LinkedIn, Twitter, Salesforce, Yahoo, Stanford

Page 26: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOPPig

Works with JythonNot PythonStream, no typesUDF read stdinUDF deserialize, no typesSerialize for PigWrite to stdoutExceptions

Page 27: Hadoop, Pig, and Python (PyData NYC 2012)

ON HADOOPPig + Python

Page 28: Hadoop, Pig, and Python (PyData NYC 2012)
Page 29: Hadoop, Pig, and Python (PyData NYC 2012)
Page 30: Hadoop, Pig, and Python (PyData NYC 2012)

Hadoop won’t magically parallelize your algorithm

NOT ACTUALLY MAGICHadoop + Python

Page 31: Hadoop, Pig, and Python (PyData NYC 2012)

Don’t stream Java-based languages•Jython•Pig + Jython

Streaming has ~30% overhead•Python•MrJob•Pig + Python

EFFICIENCYHadoop + Python

Page 32: Hadoop, Pig, and Python (PyData NYC 2012)

Well... 90-95% of time isn’t spent on algos

EXCITED?Hadoop + Python

Page 33: Hadoop, Pig, and Python (PyData NYC 2012)

Get Hadoop runningSoftware where it needs to beProcesses communicatingData available

HARD STUFF: SETUPHadoop + Python

Page 34: Hadoop, Pig, and Python (PyData NYC 2012)

LearnProject structure, modularityDev environment like Production

HARD STUFF: DEVELOPHadoop + Python

Page 35: Hadoop, Pig, and Python (PyData NYC 2012)

Syntax checkPackages availableData readableData writableWithout long waits for failure

HARD STUFF: VALIDATEHadoop + Python

Page 36: Hadoop, Pig, and Python (PyData NYC 2012)

Distributed execution is hard to debug

HARD STUFF: DEBUGHadoop + Python

Page 37: Hadoop, Pig, and Python (PyData NYC 2012)

Data processing is hard to testBut critical

HARD STUFF: TESTHadoop + Python

Page 38: Hadoop, Pig, and Python (PyData NYC 2012)

Environments identicalCode correctly deployed Configuration changesNon-disruptive

HARD STUFF: DEPLOYHadoop + Python

Page 39: Hadoop, Pig, and Python (PyData NYC 2012)

Stats about prior runsWhat code was run?What’s changed?

HARD STUFF: HISTORYHadoop + Python

Page 40: Hadoop, Pig, and Python (PyData NYC 2012)

Distributed logs hard to make sense ofHadoop logs hard to understandEphemeral clusters lose logs

HARD STUFF: LOGSHadoop + Python

Page 41: Hadoop, Pig, and Python (PyData NYC 2012)

Setup: PaaS, pip installation, connectorsDevelop: learning, structure, instant dev envValidate: fast validateDebug: printf, more comingTest: Rails-like test suitesDeploy: one-button deploy

HARD STUFF: MORTAR’S APPROACHHadoop + Python

Page 42: Hadoop, Pig, and Python (PyData NYC 2012)

K Young

@kky