hadoop introduction

Post on 25-May-2015

322 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Hadoop IntroductionBackground && Installation && Hello world && related

Outline

• Background• Hello world• Installation• Related

23/4/12 2

Background

• Why Hadoop?• Accessible: AWS• Robust : handle most such failures• Scalable: linearly • Simple: 1 == 1 w

• Key Points:• Scale-out• Moving code to data

23/4/12 3

Background: History

• Apache Top Project: Doug Cutting• Lucence -> Nutch -> Hadoop(2004)• Yahoo (1w)• Facebook (Hive, Hbase,…)• HULU (Hbase)• Baidu (3000TB, one week)• Twitter (sweat data)

23/4/12 4

Background

• Comparing SQL database and Hadoop• Structure: • SQL(structure data, Specific Pattern)• Hadoop(Key-value, like Text, Picture)

• Scale-out <- scale-up• Key-Value <- Relation Tables• Functional Programming <- Declarative Queries• Offline batch processing <- Online (Once Write ,

Read many times)23/4/12 5

Background – Understanding

• Word Count• File Size ++ , Memory Leak• Disk-Hash Table (More complex) • Distributed:

• Phase 1: Part Processing• Phase 2: Merge Results

• Shuffle the partitions the appropriate machines(AlphaBeta)

• Now, We have already finish a minimal Hadoop.

23/4/12 6

Hello World: Word Count

• Two Phase:• Mapping: 获取输入数据,并将其装载到 mapper 中• Reducing: 处理来自 mapper 的所有输出,产生最终结果。

• 1.1 list(filename, file content)• 1.2 list(word, 1)• 2.1 list(word, list(word))• 2.2 list(word, count)

23/4/12 7

Hello World

• mapper.py • Reducer.py

23/4/12 8

Installation

• Mode:• 单机模式( default)• 伪分布模式 推荐开发和调试模式• 全分布模式

• Configuration:• 基本配置• Ssh 配置• Ubuntu 配置

23/4/12 9

Hadoop Framework

• HDFS:• NameNode : 跟踪,指导,记录• DataNode :底层 IO 操作• Secondary NameNode

• Map Reduce :• Job Tracker• Task Tracker

23/4/12 10

Related

• Programming:• Java• Python • Jython ( Translate Python )• Hadoop Streaming ( stdin , stdout )• Dumbo• Happy

23/4/12 11

Related

• Pig: 高级数据流语言• Hive: SQL 数据仓库• Hbase : Google BigTable , 面向列的数据库• ZookKeeper: 共享状态的协同系统• Chukwa : 数据收集系统• Mahout :数据挖掘与机器学习• Hama: 矩阵计算

23/4/12 12

Resource

• Book:• Hadoop In action• Hadoop 实战 (第二版)

• Video && Google Course• URL:• 资源收藏

23/4/12 13

thanks

23/4/12 14

top related