hadoop introduction

14
Hadoop Introduction Background && Installation && Hello world && related

Upload: tianwei-liu

Post on 25-May-2015

321 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hadoop introduction

Hadoop IntroductionBackground && Installation && Hello world && related

Page 2: Hadoop introduction

Outline

• Background• Hello world• Installation• Related

23/4/12 2

Page 3: Hadoop introduction

Background

• Why Hadoop?• Accessible: AWS• Robust : handle most such failures• Scalable: linearly • Simple: 1 == 1 w

• Key Points:• Scale-out• Moving code to data

23/4/12 3

Page 4: Hadoop introduction

Background: History

• Apache Top Project: Doug Cutting• Lucence -> Nutch -> Hadoop(2004)• Yahoo (1w)• Facebook (Hive, Hbase,…)• HULU (Hbase)• Baidu (3000TB, one week)• Twitter (sweat data)

23/4/12 4

Page 5: Hadoop introduction

Background

• Comparing SQL database and Hadoop• Structure: • SQL(structure data, Specific Pattern)• Hadoop(Key-value, like Text, Picture)

• Scale-out <- scale-up• Key-Value <- Relation Tables• Functional Programming <- Declarative Queries• Offline batch processing <- Online (Once Write ,

Read many times)23/4/12 5

Page 6: Hadoop introduction

Background – Understanding

• Word Count• File Size ++ , Memory Leak• Disk-Hash Table (More complex) • Distributed:

• Phase 1: Part Processing• Phase 2: Merge Results

• Shuffle the partitions the appropriate machines(AlphaBeta)

• Now, We have already finish a minimal Hadoop.

23/4/12 6

Page 7: Hadoop introduction

Hello World: Word Count

• Two Phase:• Mapping: 获取输入数据,并将其装载到 mapper 中• Reducing: 处理来自 mapper 的所有输出,产生最终结果。

• 1.1 list(filename, file content)• 1.2 list(word, 1)• 2.1 list(word, list(word))• 2.2 list(word, count)

23/4/12 7

Page 8: Hadoop introduction

Hello World

• mapper.py • Reducer.py

23/4/12 8

Page 9: Hadoop introduction

Installation

• Mode:• 单机模式( default)• 伪分布模式 推荐开发和调试模式• 全分布模式

• Configuration:• 基本配置• Ssh 配置• Ubuntu 配置

23/4/12 9

Page 10: Hadoop introduction

Hadoop Framework

• HDFS:• NameNode : 跟踪,指导,记录• DataNode :底层 IO 操作• Secondary NameNode

• Map Reduce :• Job Tracker• Task Tracker

23/4/12 10

Page 11: Hadoop introduction

Related

• Programming:• Java• Python • Jython ( Translate Python )• Hadoop Streaming ( stdin , stdout )• Dumbo• Happy

23/4/12 11

Page 12: Hadoop introduction

Related

• Pig: 高级数据流语言• Hive: SQL 数据仓库• Hbase : Google BigTable , 面向列的数据库• ZookKeeper: 共享状态的协同系统• Chukwa : 数据收集系统• Mahout :数据挖掘与机器学习• Hama: 矩阵计算

23/4/12 12

Page 13: Hadoop introduction

Resource

• Book:• Hadoop In action• Hadoop 实战 (第二版)

• Video && Google Course• URL:• 资源收藏

23/4/12 13

Page 14: Hadoop introduction

thanks

23/4/12 14