hadoop introduction

Hadoop IntroductionBackground && Installation && Hello world && related

Outline

• Background• Hello world• Installation• Related

23/4/12 2

Background

• Why Hadoop?• Accessible: AWS• Robust : handle most such failures• Scalable: linearly • Simple: 1 == 1 w

• Key Points:• Scale-out• Moving code to data

23/4/12 3

Background: History

• Apache Top Project: Doug Cutting• Lucence -> Nutch -> Hadoop(2004)• Yahoo (1w)• Facebook (Hive, Hbase,…)• HULU (Hbase)• Baidu (3000TB, one week)• Twitter (sweat data)

23/4/12 4

Background

• Comparing SQL database and Hadoop• Structure: • SQL(structure data, Specific Pattern)• Hadoop(Key-value, like Text, Picture)

• Scale-out <- scale-up• Key-Value <- Relation Tables• Functional Programming <- Declarative Queries• Offline batch processing <- Online (Once Write ,

Read many times)23/4/12 5

Background – Understanding

• Word Count• File Size ++ ， Memory Leak• Disk-Hash Table (More complex) • Distributed:

• Phase 1: Part Processing• Phase 2: Merge Results

• Shuffle the partitions the appropriate machines(AlphaBeta)

• Now, We have already finish a minimal Hadoop.

23/4/12 6

Hello World: Word Count

• Two Phase:• Mapping: 获取输入数据，并将其装载到 mapper 中• Reducing: 处理来自 mapper 的所有输出，产生最终结果。

• 1.1 list(filename, file content)• 1.2 list(word, 1)• 2.1 list(word, list(word))• 2.2 list(word, count)

23/4/12 7

Hello World

• mapper.py • Reducer.py

23/4/12 8

Installation

• Mode:• 单机模式（ default)• 伪分布模式推荐开发和调试模式• 全分布模式

• Configuration:• 基本配置• Ssh 配置• Ubuntu 配置

23/4/12 9

Hadoop Framework

• HDFS:• NameNode : 跟踪，指导，记录• DataNode ：底层 IO 操作• Secondary NameNode

• Map Reduce ：• Job Tracker• Task Tracker

23/4/12 10

Related

• Programming:• Java• Python • Jython （ Translate Python ）• Hadoop Streaming （ stdin , stdout ）• Dumbo• Happy

23/4/12 11

Related

• Pig: 高级数据流语言• Hive: SQL 数据仓库• Hbase ： Google BigTable ，面向列的数据库• ZookKeeper: 共享状态的协同系统• Chukwa ：数据收集系统• Mahout ：数据挖掘与机器学习• Hama: 矩阵计算

23/4/12 12

Resource

• Book:• Hadoop In action• Hadoop 实战（第二版）

• Video && Google Course• URL:• 资源收藏

23/4/12 13

thanks

23/4/12 14

hadoop introduction

Documents

hadoop structure

picture scale

hulu hbase baidu

w facebook hive

w key points

sqlstructure data

hadoop2004 yahoo

lucence nutch