hadoop pycon2011uk
DESCRIPTION
TRANSCRIPT
Whoami
• So=ware Engineer @ Yahoo Inc.
• Web Search -‐> Cloud PlaHorms -‐> Display Ads Serving • hKp://linkedin.com/in/adityasakhuja
PyCon UK 2011 9/24/11
Agenda • MoVvaVon • History • Ecosystem • Daemon processes / High Level View • Map Reduce Data Flow • HDFS Architecture / ReplicaVon • Can / Cannot • Ge[ng started yourself • Demo • Companies Involved • Q&A
PyCon UK 2011 9/24/11
MoVvaVon
• ‘TradiVonal’ large-‐scale compuVng systems -‐ problems
• Desired features in an improved system • How Hadoop addresses them
PyCon UK 2011 9/24/11
‘TradiVonal’ large-‐scale compuVng systems -‐ problems
• CPU intensive over Data intensive • MPI , PVM, RPCs – Parallel ComputaVon Frameworks
• Programming for tradiVonal distributed systems is complex – Data exchange requires synchronizaVon – Temporal dependencies are complicated – It is difficult to deal with parVal failures of the system
• Data typically stored on SAN • Data brought to compute nodes @ runVme
PyCon UK 2011 9/24/11
Desired Features in a Large Scale Data Systems
• Data Driven – A new improved system should avoid data boKlenecks
• Scalable • Consistent • Recoverable ( Data / Processor ) • ParVal Failure Support
PyCon UK 2011 9/24/11
What Hadoop offers
• Provides a high level programming model – No worries for Locking/Temporal Dependencies, Sockets ..
• and the list of features in the desired list J ( previous slide )
PyCon UK 2011 9/24/11
History
• Hadoop is based on work done by Google in the late 1990s/early 2000s
• Specifically, on papers describing the Google File System (GFS)published in 2003, and Map/Reduce published in 2004
• Hadoop MapReduce NextGeneraVon – 2011 – hKp://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-‐nextgen/
PyCon UK 2011 9/24/11
Apache Hadoop Ecosystem • Hadoop Common: The common uVliVes that support the other Hadoop subprojects. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-‐
throughput access to applicaVon data. • Hadoop MapReduce: A so=ware framework for distributed processing of large data sets
on compute clusters.
Other Hadoop-‐related projects at Apache include: • Cassandra™: A scalable mulV-‐master database with no single points of failure. • HBase™: A scalable, distributed database that supports structured data storage for large
tables. • Hive™: A data warehouse infrastructure that provides data summarizaVon and ad hoc
querying. • Mahout™: A Scalable machine learning and data mining library. • Pig™: A high-‐level data-‐flow language and execuVon framework for parallel
computaVon.
Source : hKp://hadoop.apache.org/ PyCon UK 2011 9/24/11
Hadoop Key Daemon Processes
• Namenode • Secondary NameNode • DataNode • JobTracker • TaskTracker
PyCon UK 2011 9/24/11
High level Hadoop cluster view
9/24/11 PyCon UK 2011
MapReduce Data Flow
PyCon UK 2011 9/24/11
HDFS Architecture
PyCon UK 2011 9/24/11
HDFS ReplicaVon
PyCon UK 2011 9/24/11
Map Reduce Program Components
• MapReduce programs generally consist of three porVons – The Mapper – The Reducer – The driver code
• AddiVonal components : – Combiner (o=en the same code as the Reducer) – Custom ParVVoner
9/24/11 PyCon UK 2011
Hadoop Is / Is Not
• High Bandwidth, High Latency System • Not a subsVtute for a DBMS, not alone at-‐least • HDFS is not yet a Highly Available FS. NameNode is a SPOF
• Is a “Share nothing” Architecture – Mappers do not talk, neither do Reducers
PyCon UK 2011 9/24/11
Ge[ng started yourself
Requirements : • Java SE SDK [download JDK 6 or higher ) • Download and Install
Hadoop Common : 0.20.203.X -‐ current stable version Hadoop HDFS : 0.21 – stable version Hadoop MapReduce : 0.21 – stable version
• Subscribe to mailing lists for Hadoop subprojects, depending on your role
• AddiVonally/AlternaVvely one can setup VMs from Cloudera / Yahoo • Details :
• hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop • hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic
PyCon UK 2011 9/24/11
Simple Demo
• Using – Pig – Map/Reduce
PyCon UK 2011 9/24/11
Streaming Jobs • Any language that can read from stdin and write to stdout • hadoop jar $HADOOP_HOME/hadoop-‐streaming.jar \
-‐input myInputDirs \ -‐output myOutputDir \ -‐mapper myMapScript.py \ -‐reducer myReduceScript.py \ -‐file myMapScript.py \ -‐file myReduceScript.py
9/24/11 PyCon UK 2011
Companies involved • Yahoo -‐ 4500 nodes cluster ( 2*4 cores, 4*1 TBs Disk , 16GB RAM ) – ( AdServer, Search )
• HortonWorks , Cloudera • Facebook • A9 ( Amazon Product Search ) • EBay -‐ 532 node cluster – ( 8 * 532 cores , 5.3 PB ) • Last.fm, TwiKer … • …… a lot more can be found on the link below : hKp://wiki.apache.org/hadoop/PoweredBy
PyCon UK 2011 9/24/11
Useful Links • hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop -‐ Ge[ng Started
• hKp://hadoop.apache.org/common/docs/current/cluster_setup.html -‐ Cluster Setup
• hKp://developer.yahoo.com/hadoop/tutorial/module4.html -‐ MapReduce
• hKp://developer.yahoo.com/hadoop/tutorial/pigtutorial.html -‐ PIG
• hKp://hadoop.apache.org/common/docs/current/api/index.html -‐ APIs • hKp://developer.yahoo.com/hadoop/tutorial/ -‐ YDN resource on Hadoop
PyCon UK 2011 9/24/11
Q&C
Contact InformaFon : Aditya Sakhuja [email protected] hKp://twiKer.com/sakhuja hKp://linkedin.com/in/adityasakhuja
PyCon UK 2011 9/24/11