hadoop pycon2011uk

Respect for the elephant – Hadoop

Aditya Sakhuja [email protected]

Whoami

•  So=ware Engineer @ Yahoo Inc.

•  Web Search -‐> Cloud PlaHorms -‐> Display Ads Serving •  hKp://linkedin.com/in/adityasakhuja

PyCon UK 2011 9/24/11

Agenda •  MoVvaVon •  History •  Ecosystem •  Daemon processes / High Level View •  Map Reduce Data Flow •  HDFS Architecture / ReplicaVon •  Can / Cannot •  Ge[ng started yourself •  Demo •  Companies Involved •  Q&A

PyCon UK 2011 9/24/11

MoVvaVon

•  ‘TradiVonal’ large-‐scale compuVng systems -‐ problems

•  Desired features in an improved system •  How Hadoop addresses them

PyCon UK 2011 9/24/11

‘TradiVonal’ large-‐scale compuVng systems -‐ problems

•  CPU intensive over Data intensive •  MPI , PVM, RPCs – Parallel ComputaVon Frameworks

•  Programming for tradiVonal distributed systems is complex – Data exchange requires synchronizaVon –  Temporal dependencies are complicated –  It is difficult to deal with parVal failures of the system

•  Data typically stored on SAN •  Data brought to compute nodes @ runVme

PyCon UK 2011 9/24/11

Desired Features in a Large Scale Data Systems

•  Data Driven – A new improved system should avoid data boKlenecks

•  Scalable •  Consistent •  Recoverable ( Data / Processor ) •  ParVal Failure Support

PyCon UK 2011 9/24/11

What Hadoop offers

•  Provides a high level programming model – No worries for Locking/Temporal Dependencies, Sockets ..

•  and the list of features in the desired list J ( previous slide )

PyCon UK 2011 9/24/11

History

•  Hadoop is based on work done by Google in the late 1990s/early 2000s

•  Specifically, on papers describing the Google File System (GFS)published in 2003, and Map/Reduce published in 2004

•  Hadoop MapReduce NextGeneraVon – 2011 – hKp://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-‐nextgen/

PyCon UK 2011 9/24/11

Apache Hadoop Ecosystem •  Hadoop Common: The common uVliVes that support the other Hadoop subprojects. •  Hadoop Distributed File System (HDFS™): A distributed file system that provides high-‐

throughput access to applicaVon data. •  Hadoop MapReduce: A so=ware framework for distributed processing of large data sets

on compute clusters.

Other Hadoop-‐related projects at Apache include: •  Cassandra™: A scalable mulV-‐master database with no single points of failure. •  HBase™: A scalable, distributed database that supports structured data storage for large

tables. •  Hive™: A data warehouse infrastructure that provides data summarizaVon and ad hoc

querying. •  Mahout™: A Scalable machine learning and data mining library. •  Pig™: A high-‐level data-‐flow language and execuVon framework for parallel

computaVon.

Source : hKp://hadoop.apache.org/ PyCon UK 2011 9/24/11

Hadoop Key Daemon Processes

•  Namenode •  Secondary NameNode •  DataNode •  JobTracker •  TaskTracker

PyCon UK 2011 9/24/11

High level Hadoop cluster view

9/24/11 PyCon UK 2011

MapReduce Data Flow

PyCon UK 2011 9/24/11

HDFS Architecture

PyCon UK 2011 9/24/11

HDFS ReplicaVon

PyCon UK 2011 9/24/11

Map Reduce Program Components

•  MapReduce programs generally consist of three porVons –  The Mapper –  The Reducer – The driver code

•  AddiVonal components : – Combiner (o=en the same code as the Reducer) – Custom ParVVoner

9/24/11 PyCon UK 2011

Hadoop Is / Is Not

•  High Bandwidth, High Latency System •  Not a subsVtute for a DBMS, not alone at-‐least •  HDFS is not yet a Highly Available FS. NameNode is a SPOF

•  Is a “Share nothing” Architecture – Mappers do not talk, neither do Reducers

PyCon UK 2011 9/24/11

Ge[ng started yourself

Requirements : •  Java SE SDK [download JDK 6 or higher ) •  Download and Install

Hadoop Common : 0.20.203.X -‐ current stable version Hadoop HDFS : 0.21 – stable version Hadoop MapReduce : 0.21 – stable version

•  Subscribe to mailing lists for Hadoop subprojects, depending on your role

•  AddiVonally/AlternaVvely one can setup VMs from Cloudera / Yahoo •  Details :

•  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop •  hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic

PyCon UK 2011 9/24/11

Simple Demo

•  Using – Pig – Map/Reduce

PyCon UK 2011 9/24/11

Streaming Jobs •  Any language that can read from stdin and write to stdout •  hadoop jar $HADOOP_HOME/hadoop-‐streaming.jar \

-‐input myInputDirs \ -‐output myOutputDir \ -‐mapper myMapScript.py \ -‐reducer myReduceScript.py \ -‐file myMapScript.py \ -‐file myReduceScript.py

9/24/11 PyCon UK 2011

Companies involved •  Yahoo -‐ 4500 nodes cluster ( 2*4 cores, 4*1 TBs Disk , 16GB RAM ) – ( AdServer, Search )

•  HortonWorks , Cloudera •  Facebook •  A9 ( Amazon Product Search ) •  EBay -‐ 532 node cluster – ( 8 * 532 cores , 5.3 PB ) •  Last.fm, TwiKer … •  …… a lot more can be found on the link below : hKp://wiki.apache.org/hadoop/PoweredBy

PyCon UK 2011 9/24/11

Useful Links •  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop -‐ Ge[ng Started

•  hKp://hadoop.apache.org/common/docs/current/cluster_setup.html -‐ Cluster Setup

•  hKp://developer.yahoo.com/hadoop/tutorial/module4.html -‐ MapReduce

•  hKp://developer.yahoo.com/hadoop/tutorial/pigtutorial.html -‐ PIG

•  hKp://hadoop.apache.org/common/docs/current/api/index.html -‐ APIs •  hKp://developer.yahoo.com/hadoop/tutorial/ -‐ YDN resource on Hadoop

PyCon UK 2011 9/24/11

Q&C

Contact InformaFon : Aditya Sakhuja [email protected] hKp://twiKer.com/sakhuja hKp://linkedin.com/in/adityasakhuja

PyCon UK 2011 9/24/11

hadoop pycon2011uk

Technology

scale compuvng

ng started

improved system

hadoop subprojects

hadoop

hkp

high

reducer