surviving hadoop on aws

21
SURVIVING HADOOP ON AWS IN PRODUCTION

Upload: soren-macbeth

Post on 23-Jun-2015

900 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Surviving Hadoop on AWS

SURVIVING HADOOP ONAWS IN PRODUCTION

Page 2: Surviving Hadoop on AWS

DISCLAIMER:I AM A BAD PERSON.

Page 3: Surviving Hadoop on AWS

ABOUT MEChief Data Scientist at Yieldbot, Co-Founder at

StockTwits.@sorenmacbeth

Page 4: Surviving Hadoop on AWS

YIELDBOT“Yieldbot's technology creates a marketplace where search

advertisers buy real-time consumer intent on premiumpublishers.”

Page 5: Surviving Hadoop on AWS

WHERE WE ARE TODAYMapR M3 on EMRAll data read from and written to S3

Page 6: Surviving Hadoop on AWS

CLOJURE FOR DATA PROCESSINGAll of our MapReduce jobs are written in .

This gives us speed, flexability and testability.

More importantly, Clojure and Cascalog are fun to write.

Cascalog

Page 7: Surviving Hadoop on AWS

CASCALOG EXAMPLE

(ns lucene-cascalog.core (:gen-class) (:use cascalog.api) (:import org.apache.lucene.analysis.standard.StandardAnalyzer org.apache.lucene.analysis.TokenStream org.apache.lucene.util.Version org.apache.lucene.analysis.tokenattributes.TermAttribute))

(defn tokenizer-seq "Build a lazy-seq out of a tokenizer with TermAttribute" [^TokenStream tokenizer ^TermAttribute term-att] (lazy-seq (when (.incrementToken tokenizer) (cons (.term term-att) (tokenizer-seq tokenizer term-att)))))

Page 8: Surviving Hadoop on AWS

HADOOP IS COMPLEX

Page 9: Surviving Hadoop on AWS

“Fact: There are more Hadoop configuration options than

there are stars our galaxy.”

Page 10: Surviving Hadoop on AWS

EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT OFTUNING TO GET A HADOOP CLUSTER RUNNING WELL.

There are large companies that make money soley byconfiguring and supporting hadoop clusters for

enterprise customers.

Page 11: Surviving Hadoop on AWS

RUNNING HADOOP ON AWS

Page 12: Surviving Hadoop on AWS

SO WHY RUN ON AWS?$$$

Page 13: Surviving Hadoop on AWS

HADOOP ON AWS:AN PERSONAL HISTORY

Page 14: Surviving Hadoop on AWS

PIG AND ELASTICMAPREDUCESlow development cycle; writing Java sucks.

Page 15: Surviving Hadoop on AWS

CASCALOG AND ELASTICMAPREDUCELearning Emacs, Clojure, and Cascalog was hard, butwas worth it.The way our jobs were designed sucked and didn'twork well with ElasticMapReduce

Page 16: Surviving Hadoop on AWS

CASCALOG AND SELF-MANAGED HADOOPCLUSTER

We used a hacked up version of a cloudera python

script to launch and bootstrap a cluster.

We ran on spot instances

Cluster boot up time SUCKED and often failed. We

paid for instances during bootstrap and configuration

Our jobs weren't designed to tolerate things like spot

instances going away in the middle of a job.

Drinking heavily dulled the pain a little.

Page 17: Surviving Hadoop on AWS

CASCALOG AND ELASTICMAPREDUCEAGAIN

Rebuilt data processing pipeline from scratch (onlytook nine months!)Data pipelines were broken out into a handful of fault-tolerant jobflow steps; each steps writes output to S3.EMR supported spot instances at this point.

Page 18: Surviving Hadoop on AWS

WEIRD BUGS THAT WE'VE HITBootstrap script errorsRandom cluster fuckedupednessAMI version changesVendor issuesMy personal favourite: Invisible S3 write failures.

Page 19: Surviving Hadoop on AWS

IF YOU MUST RUN ON AWSBreak your processing pipelines into stages; write outto S3 after each stage.Bake in (a lot) of variability into your expected jobflowrun times.Compress the data your are reading and writing fromS3 as much as possible.Drinking helps.

Page 20: Surviving Hadoop on AWS

QUESTIONS?

Page 21: Surviving Hadoop on AWS

YIELDBOT IS HIRING!http://yieldbot.com/jobs