surviving hadoop on aws

21

SURVIVING HADOOP ON AWS IN PRODUCTION

Upload: soren-macbeth

Post on 23-Jun-2015

900 views

Category:

Documents

3 download

Report

Download

Tags:

Embed Size (px):

TRANSCRIPT

Page 1: Surviving Hadoop on AWS

SURVIVING HADOOP ONAWS IN PRODUCTION

Page 2: Surviving Hadoop on AWS

DISCLAIMER:I AM A BAD PERSON.

Page 3: Surviving Hadoop on AWS

ABOUT MEChief Data Scientist at Yieldbot, Co-Founder at

StockTwits.@sorenmacbeth

Page 4: Surviving Hadoop on AWS

YIELDBOT“Yieldbot's technology creates a marketplace where search

advertisers buy real-time consumer intent on premiumpublishers.”

Page 5: Surviving Hadoop on AWS

WHERE WE ARE TODAYMapR M3 on EMRAll data read from and written to S3

Page 6: Surviving Hadoop on AWS

CLOJURE FOR DATA PROCESSINGAll of our MapReduce jobs are written in .

This gives us speed, flexability and testability.

More importantly, Clojure and Cascalog are fun to write.

Cascalog

Page 7: Surviving Hadoop on AWS

CASCALOG EXAMPLE

(ns lucene-cascalog.core (:gen-class) (:use cascalog.api) (:import org.apache.lucene.analysis.standard.StandardAnalyzer org.apache.lucene.analysis.TokenStream org.apache.lucene.util.Version org.apache.lucene.analysis.tokenattributes.TermAttribute))

(defn tokenizer-seq "Build a lazy-seq out of a tokenizer with TermAttribute" [^TokenStream tokenizer ^TermAttribute term-att] (lazy-seq (when (.incrementToken tokenizer) (cons (.term term-att) (tokenizer-seq tokenizer term-att)))))

Page 8: Surviving Hadoop on AWS

HADOOP IS COMPLEX

Page 9: Surviving Hadoop on AWS

“Fact: There are more Hadoop configuration options than

there are stars our galaxy.”

Page 10: Surviving Hadoop on AWS

EVEN IN THE BEST CASE SCENARIO, IT TAKES A LOT OFTUNING TO GET A HADOOP CLUSTER RUNNING WELL.

There are large companies that make money soley byconfiguring and supporting hadoop clusters for

enterprise customers.

Page 11: Surviving Hadoop on AWS

RUNNING HADOOP ON AWS

Page 12: Surviving Hadoop on AWS

SO WHY RUN ON AWS?$$$

Page 13: Surviving Hadoop on AWS

HADOOP ON AWS:AN PERSONAL HISTORY

Page 14: Surviving Hadoop on AWS

PIG AND ELASTICMAPREDUCESlow development cycle; writing Java sucks.

Page 15: Surviving Hadoop on AWS

CASCALOG AND ELASTICMAPREDUCELearning Emacs, Clojure, and Cascalog was hard, butwas worth it.The way our jobs were designed sucked and didn'twork well with ElasticMapReduce

Page 16: Surviving Hadoop on AWS

CASCALOG AND SELF-MANAGED HADOOPCLUSTER

We used a hacked up version of a cloudera python

script to launch and bootstrap a cluster.

We ran on spot instances

Cluster boot up time SUCKED and often failed. We

paid for instances during bootstrap and configuration

Our jobs weren't designed to tolerate things like spot

instances going away in the middle of a job.

Drinking heavily dulled the pain a little.

Page 17: Surviving Hadoop on AWS

CASCALOG AND ELASTICMAPREDUCEAGAIN

Rebuilt data processing pipeline from scratch (onlytook nine months!)Data pipelines were broken out into a handful of fault-tolerant jobflow steps; each steps writes output to S3.EMR supported spot instances at this point.

Page 18: Surviving Hadoop on AWS

WEIRD BUGS THAT WE'VE HITBootstrap script errorsRandom cluster fuckedupednessAMI version changesVendor issuesMy personal favourite: Invisible S3 write failures.

Page 19: Surviving Hadoop on AWS

IF YOU MUST RUN ON AWSBreak your processing pipelines into stages; write outto S3 after each stage.Bake in (a lot) of variability into your expected jobflowrun times.Compress the data your are reading and writing fromS3 as much as possible.Drinking helps.

Page 20: Surviving Hadoop on AWS

QUESTIONS?

Page 21: Surviving Hadoop on AWS

YIELDBOT IS HIRING!http://yieldbot.com/jobs

Hadoop: Setting up Hadoop 2.7.3 (single node) on AWS EC2 ...heleicui2/doc/Setup-Hadoop-2.7.3-(single-node... · If you don’t have admin role. ... Hadoop: Setting up Hadoop 2.7.3

BIG DATA ANALYTICS & CLOUD SERVICES - … DATA ANALYTICS & CLOUD SERVICES . ... • AWS infrastructure and operations support ... Cassandra, Hadoop, SOLR, Cascading

Hadoop 3.X more examplestorlone.dia.uniroma3.it/bigdata/E2-Cluster-20.pdf · 2020-04-01 · Hadoop 3 Running on AWS Connection to hadoop of EMR cluster: $:~ ssh [email protected]

2012年上半期　AWSパートナーアワード受賞社資料：Hadoopによるバッチ処理の導入on AWS　（ノーチラス・テクノロジーズ様）

Martyn Hadoop Aws

Big Data In The Cloud - Splunk · Big Data In The Cloud AWS EMR and Splunk Analytics for Hadoop Patrick Shumate | Solutions Architect, AWS Simon O’Brien | Staff Sales Engineer,

Power Hadoop Cluster with AWS Cloud

MapReduce, Hadoop and Amazon AWS - Donald Bren School of

The Pace of Innovation at AWS - Amazon S3and...The Pace of Innovation at AWS Werner Vogels! VP & CTO, ... Hadoop! Real time! Data ... 3yr server & 10 yr infrastructure amortization!

Hadoop , Hadoop , Hadoop !!!

MapReduce, Hadoop and Amazon AWS• Hadoop was inspired by Google's MapReduce and Google File System (GFS). • Hadoop is a top-level Apache project being built and used by a global

VMware Cloud Provider Partner Portfolio Comparison: AWS ...VMware Cloud Provider Partner Portfolio Comparison: AWS-Edition Analytics, Big Data, IoT and Machine Learning Hosted Hadoop

MapReduce, Hadoop and Amazon AWS · 2012-02-06 · What is Hadoop? • A software framework that supports data-intensive distributed applications. • It enables applications to work

Apache™ Hadoop®in the Datacenter and Cloud - …hortonworks.com/.../4-Tech-Track-Hadoop-In-the-Datacenter-and-Cloud...Focused offerings for AWS that enable us to engage and position

Ansible + Hadoop - Fierce Softwarefiercesw.com/wp-content/uploads/2017/02/ansible-nova-hortonworks...Prepare Infrastructure ... ansible-hadoop ... Easily created an AWS environment

Hadoop: Setting up Hadoop 2.7.3 (single node) on AWS EC2 ...helei.pro/doc/Setup-Hadoop-2.7.3-(single-node)-on-AWS-EC2-Ubuntu-… · •core-site.xml •hadoop-env.sh •yarn-site.xml

1. Product Documentation / User · Control File to assign parameters for AWS S3, Hadoop, Security, Streaming and more .... Amazon AWS S3 Security Credentials and Key Codes Setup all

Surviving The Workplace While Surviving Cancer

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University

Adobe Audience Manager Security Overview About Amazon Web Services (AWS) ... Hadoop, Hive, HBase S3 HBase Cluster ... including processing and storage. The AWS infrastructure includes

Amazon emr cluster hadoop pronto para usar na nuvem aws

10.5 & 10.5 · AWS S3 Microsoft Azure Storage Local File Shares Big Data File Shares supported by GeoAnalytics Server Apache Hadoop HDFS Apache Hive Local File Shares6 Amazon AWS

AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Data on AWS

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

Hadoop AWS infrastructure cost evaluation

How KKBOX use mrjob to link python, hadoop, aws

Apache Hadoop Installation and Cluster setup on AWS EC2

Hadoop in the cloud with AWS' EMR

(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS re:Invent 2014

Choosing a Provider from the Hadoop Ecosystem - … a Provider from the Hadoop Ecosystem ... How can we create a Hadoop infrastructure for big-data ... since it has some design #aws

Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …

Hadoop: Setting up Hadoop 2.7.3 (single node) on AWS EC2 ...single-node)-on-AWS-EC2... · •core-site.xml •hadoop-env.sh •yarn-site.xml •hdfs-site.xml •mapred-site.xml 8

Hadoop 2.X on a cluster environment - Roma Tre Universitytorlone/bigdata/E2-Cluster.pdf · Hadoop 2 Running on AWS Authorization of the . pem ﬁle Upload ﬁles (data and your personal

IOANNIS MAGNISALIS - International Hellenic Universitydorg.ihu.edu.gr/wp-content/uploads/imagnisalis_europass_cv_updated... · Cloud computing (Azure, GoogleCloud, AWS, Hadoop/MapReduce,