雲端運算虛擬技術 -- 雲端計算資料處理技術 -- hadoop -- mapreduce 賴智錦 /...

雲端運算虛擬技術雲端運算虛擬技術---- 雲端計算資料處理技術雲端計算資料處理技術 -- Hadoop-- Hadoop-- MapReduce-- MapReduce

賴智錦 / 詹奇峰國立高雄大學電機工程學系

2009/08/05

雲端計算資料處理技術 What is large data?

From the point of view of the infrastructure required to do analytics, data comes in three sizes: Small data Medium data Large data

Source: http://blog.rgrossman.com/

雲端計算資料處理技術 Small data:

Small data fits into the memory of a single machine. Example: a small dataset is the dataset for the Netflix

Prize. (The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on

their movie preferences.) The Netflix Prize dataset consists of over 100 million

movie rating files by 480 thousand randomly-chosen, anonymous Netflix customers that rated over 17 thousand movie titles.

This dataset is just 2 GB of data and fits into the memory of a laptop.


雲端計算資料處理技術 Medium data:

Medium data fits into a single disk or disk array and can be managed by a database.

It is becoming common today for companies to create 1 to 10 TB or large data warehouses.


雲端計算資料處理技術 Large data:

Large data is so large that it is challenging to manage it in a database and instead specialized systems are used.

Scientific experiments, such as the Large Hadron Collider (LHC, the world's largest and highest-energy particle accelerator), produce large datasets.

Log files produced by Google, Yahoo and Microsoft and similar companies are also examples of large datasets.


雲端計算資料處理技術 Large data sources:

Most large datasets were produced by the scientific and defense communities.

Two things have changed: Large datasets are now being produced by a third

community: companies that provide internet services, such as search, on-line advertising and social media.

The ability to analyze these datasets is critical for advertising systems that produce the bulk of the revenue for these companies.


雲端計算資料處理技術 Large data sources:

Two things have changed: This provides a measure by which to measure the

effectiveness of analytic infrastructure and analytic models.

Using this metric, Google settled upon analytic infrastructure that was quite different than the grid-based infrastructure that is generally used by the scientific community.


雲端計算資料處理技術 What is a large data cloud?

A good working definition is that a large data cloud provides storage services and compute services that are layered over the storage

services that scale to a data center and that have the reliability associated with a data center.


雲端計算資料處理技術 What are some of the options for working with la

rge data? The most mature large data cloud application is the op

en source Hadoop system, which consists of the Hadoop Distributed File System (HDFS) and Hadoop’s implementation of MapReduce.

An important advantage of Hadoop is that it has a very robust community supporting it and there are a large number of Hadoop projects, including Pig, which provides simple database-like operations over data managed by HDFS.


雲端計算資料處理技術雲端源自平行運算，但比網格更擅長資料運算

-- 中研院網格計算 (ASGC) 主持人林誠謙博士雲端運算源自平行運算的技術，不脫離網格運算的哲學，但是雲端運算更專注在資料的處理

單次資料處理量小，讓雲端運算發展出不同於網格運算的實作方式

-- 國家高速網路與計算中心企業與計畫管理組計畫主持人黃維誠博士

雲端計算資料處理技術雲端運算適合的任務，多半是資料處理次數頻率高，而每一次要處理的資料量小。


資料來源 : http://www.ithome.com.tw/itadm/article.php?c=49410&s=2

雲端計算資料處理技術雲端運算 vs.網格運算

雲端運算網格運算

主要推動者資訊供應商（如 Google 、 Yahoo 、 IBM 、 Amazon 等）

學術機構（如歐洲粒子研究中心 CERN 、中研院、國家高速網路與計算中心）

標準化程度無標準化，各家採用的技術架構也不同。有標準化的協定和信任機制

開源幅度部分開源，目前有開源 Hadoop 框架，但 Google GFS 和資料庫系統 BigTable 則未開源。完全開源

網域限制企業內部網域可跨企業、跨管理網域

單一運算叢集可支援的硬體

相同標準規格的個人電腦（如 x86 處理器、硬碟、 4GB 記憶體、 Linux 等）

可混合異質性伺服器（不同處理器、不同作業系統、不同編譯器版本等）

擅長處理的資料特性

單次運算資料量小（可於單臺個人電腦上執行），但需要重複大量處理次數的應用。

單次運算資料量大的應用。例如單筆數 GB 的衛星訊號分析。

資料來源： iThome 整理， 2008 年 6 月

雲端計算資料處理技術搜尋網頁 : 每一次要比對的網頁，其實檔案都不大，

所需耗費的處理器資源不多，所以用大量的個人電腦就可以來執行網頁搜尋的運算，但是，要用個人電腦來架設網格運算就比較難，因為網格運算所需的處理資源較大。

實作的差異就是，雲端運算可以組合大量的個人電腦來提供服務，而網格運算則需要依賴能提供大量運算資源的高效能電腦。



雲端計算資料處理技術雲端運算 (Cloud Computing) ： Google 提出的分散

式運算技術，讓開發人員很容易開發出全球性的應用服務，雲端運算技術可以自動管理大量標準化（非異質性）電腦間的溝通、任務分配和分散式儲存等。

網格運算 (Grid Computing) ：在網路上，透過標準化協定與信任機制，整合跨網域中的異質伺服器，建立運算叢集系統來共享運算資源、儲存資源等。

服務在雲端 (In-the-Cloud) 或雲端服務 (Cloud Service) ：供應商透過網際網路提供服務，使用者只需透過瀏覽器就能使用，不需了解供應商的伺服器如何運作。

雲端計算資料處理技術 MapReduce 模式： Google 運用在雲端運算中的關

鍵技術，讓開發者開發大量資料的處理程式。先透過Map 程式將資料切割成不相關的區塊，分配給大量電腦處理，再透過 Reduce 程式將結果彙整，輸出開發者需要的結果。

Hadoop ：使用 Java 開發的開源雲端運算框架，也是採用 Google 雲端運算技術實作的框架，但所用的分散式檔案系統與 Google 不同。 2006 年 Yahoo 成為該計畫最主要的貢獻者和使用者。


雲端計算資料處理技術資料處理：是指當有大量資料時，如何進行平行化的

切割，計算，合併，以便讓處理人員可以直接獲取這些大量資料的總結。

平行化資料分析語言： Google 的 Sawzall 專案與 Yahoo 的 Pig 專案，都是屬於

平行處理大量資料的高階程式語言。 Google 的 Sawzall 建制於 MapReduce 之上， Yahoo 的

Pig 建於 Hadoop之上 (Hadoop 為 MapReduce 的 clone) ，兩者幾乎系出同門。

Hadoop: Why?

Need to process 100TB datasets with multi-day jobs

On 1-node: Scanning at 50 MB/s = 23 days

On 1000 node cluster: Scanning at 50 MB/s = 33 min

Need framework for distribution Efficient, reliable, usable

Hadoop: Where?

Batch data processing, not real-time/user facing Log Processing Document Analysis and Indexing Web Graphs and Crawling

Highly parallel, data intensive, distributed applications Bandwidth to data is a constraint Number of CPUs is a constraint

Very large production deployments (GRID) Several clusters of 1000s of nodes LOTS of data (Trillions of records, 100 TB+ data sets)

What is Hadoop?

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.

The project includes: Core: provides the Hadoop Distributed Filesystem (HDFS)

and support for the MapReduce distributed computing framework.

MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines.

Chukwa: a data collection system for managing large distributed systems. Chukwa is built on top of the HDFS and MapReduce framework and inherits Hadoop's scalability and robustness.

What is Hadoop?

HBase: builds on Hadoop Core to provide a scalable, distributed database.

Hive: a data warehouse infrastructure built on Hadoop Core that provides data summarization, adhoc querying and analysis of datasets.

Pig: a high-level data-flow language and execution framework for parallel computation. It is build on top of Hadoop Core.

ZooKeeper: a highly available and reliable coordinate system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.

Hadoop History 2004 - Initial versions of what is now Hadoop Distributed

File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella

December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.

January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started t

o support the standalone development of Map-Reduce and HDFS.

March 2006 - Formation of the Yahoo! Hadoop team April 2006 - Sort benchmark run on 188 nodes in 47.9 ho

urs

Hadoop History May 2006 - Yahoo sets up a Hadoop research cluster -

300 nodes May 2006 - Sort benchmark run on 500 nodes in 42

hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100

nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 hrs

April 2007 - Research clusters - 2 clusters of 1000 nodesSource: http://hadoop.openfoundry.org/slides/Hadoop_OSDC_08.pdf

Hadoop Components Hadoop Distributed Filesystem (HDFS)

is a distributed file system designed to run on commodity hardware.

is highly fault-tolerant and is designed to be deployed on low-cost hardware.

provides high throughput access to application data and is suitable for applications that have large data sets.

relaxes a few POSIX requirements to enable streaming access to file system data. (POSIX: Portable Operating System Interface [for Unix"])

was originally built as infrastructure for the Apache Nutch web search engine project.

is part of the Apache Hadoop Core project.


Source: http://hadoop.apache.org/core/

Hadoop Components HDFS Assumptions and Goals

Hardware failure Hardware failure is the norm rather than the exception.

An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data.

There are a huge number of components and each component has a non-trivial probability of failure means that some component of HDFS is always non-functional.

Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.


Streaming Data Access Applications that run on HDFS need streaming access to their

data sets.

They are not general purpose applications that typically run on general purpose file systems.

HDFS is designed more for batch processing rather than interactive use by users.

The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS.


Large Data Sets Applications that run on HDFS have large data sets.

A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files.

It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.


Simple Coherency Model HDFS applications need a write-once-read-many access

model for files.

A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access.

A Map/Reduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.


"Moving Computation is Cheaper than Moving Data" A computation requested by an application is much more

efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge.

This minimizes network congestion and increases the overall throughput of the system.

It is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.


Portability Across Heterogeneous Hardware and Software Platforms HDFS has been designed to be easily portable from one

platform to another.

This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.

Hadoop Components HDFS: Namenode and Datanode

HDFS has a master/slave architecture An HDFS cluster consists of a single NameNode, a master serve

r that manages the file system namespace and regulates access to files by clients.

In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.

Hadoop Components HDFS: Namenode and Datanode

HDFS has a master/slave architecture The NameNode executes file system namespace operations like

opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.


Source: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html

Hadoop Components HDFS: The File System Namespace

HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories.

The file system namespace hierarchy is similar to most other existing file systems. one can create and remove files, move a file from one directory to an

other, or rename a file. The NameNode maintains the file system namespace. Any chan

ge to the file system namespace or its properties is recorded by the NameNode.

An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.

Hadoop Components Hadoop Distributed Processing Framework

Using MapReduce Metaphor Map/Reduce is a software framework for easily writing

applications which process vast amounts of data in-parallel on large clusters of commodity hardware.

A simple programming model that applies to many large-scale computing problems

Hide messy details in MapReduce runtime library: Automatic parallelization Load balancing Network and disk transformation optimization Handling of machine failures Robustness

Hadoop Components A Map/Reduce job usually splits the input data-set into indepe

ndent chunks which are processed by the map tasks in a completely parallel manner.

The framework sorts the outputs of the maps, which are then input to the reduce tasks. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

The Map/Reduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tas

ks on the slaves, monitoring them and re-executing the failed tasks.

The slaves execute the tasks as directed by the master.

Hadoop Components Although the Hadoop framework is implemented in J

avaTM, Map/Reduce applications need not be written in Java. Hadoop Streaming is a utility which allows users to create a

nd run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.

Hadoop Pipes is a SWIG- compatible C++ API to implement Map/Reduce applications (non JNITM [Java Native Interface] based).

Hadoop Components MapReduce concepts

Definition: Map function: Take a set of (key, value) pairs and generate a set

of intermediate (key, value) pairs by applying some function to all these pairs. Eg., (k1, v1) list(k2, v2)

Reduce function: Merge all pairs with same key applying a reduction function on the values. E.g., (k2, list(v2)) list(k3, v3)

Input and Output types of a Map/Reduce job:

Read a lot of data Map: extract something meaningful from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results

(input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output) (input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output)

Hadoop Components MapReduce concepts

Input Map Shuffle & Sort Reduce Output

the fox ate the mouse

the small mouse

the quick brown

fox

MapMap

MapMap

MapMap

ReduceReduce

ReduceReduce

the, 1fox, 1the, 1

ate, 1mouse, 1

the, 1mouse, 1

small, 1

the, 1brown, 1

fox, 1quick, 1

the, 3fox, 2brown, 1small, 1

the, 1ate, 1mouse, 2quick, 1

Hadoop Components Consider the problem of counting the number of

occurrences of each word in a large collection of documents:

map(String key, String value):

// key: document name // value: document contents

for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word // values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result)); The map function emits each word plus an associated count of occurrences ("1" in this example).

The reduce function sums together all the counts emitted for a particular word.

Hadoop Components MapReduce Execution Overview

1. The MapReduce library in the user program first shards the input files into M pieces of typically 16-64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines.

2. One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined map function. The intermediate key/value pairs produced by the map function are buffered in memory.

Source: J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107-113, 2008.


4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used.



6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's reduce function. The output of the reduce function is appended to a final output file for this reduce partition.

7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.


Hadoop Components MapReduce Examples

Distributed Grep (global search regular expression and print out the line): The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the

supplied intermediate data to the output.

Count of URL Access Frequency: The map function processes logs of web page requests and o

utputs <URL, 1>. The reduce function adds together all values for the same UR

L and emits a <URL, total count> pair.


Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a t

arget URL found in a page named "source". The reduce function concatenates the list of all source URLs associ

ated with a given target URL and emits the pair: <target, list(source)>.

Inverted Index: The map function parses each document, and emits a sequence of

<word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the cor

responding document IDs and emits a <word, list(document ID)> pair.

The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.


Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for

each input document (where the hostname is extracted from the URL of the document).

The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair.

• MapReduce Programs in Google's Source Tree

• New MapReduce Programs per Month

Source: http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf

Who Uses Hadoop Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet (now Microsoft) Quantcast Veoh Yahoo! More at http://wiki.apache.org/hadoop/PoweredBy

Hadoop Resource http://hadoop.apache.org http://developer.yahoo.net/blogs/hadoop/ http://code.google.com/intl/zh-TW/edu/submissions/ uwspr2007_clustercourse/listing.html http://developer.amazonwebservices.com/connect/entry.j

spa?externalID=873

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Communications of the ACM, 51(1):107-113, 2008.

T. White, Hadoop: The Definitive Guide (MapReduce for the Cloud), O'Reilly, 2009.

Hadoop Download 國網中心 http://ftp.twaren.net/Unix/Web/apache/hadoop/core/

HTTP http://ftp.stut.edu.tw/var/ftp/pub/OpenSource/apache/hadoop/core/ http://ftp.twaren.net/Unix/Web/apache/hadoop/core/ http://ftp.mirror.tw/pub/apache/hadoop/core/ http://apache.cdpa.nsysu.edu.tw/hadoop/core/ http://ftp.tcc.edu.tw/pub/Apache/hadoop/core/ http://apache.ntu.edu.tw/hadoop/core/

FTP ftp://ftp.stut.edu.tw/pub/OpenSource/apache/hadoop/core/ ftp://ftp.stu.edu.tw/Unix/Web/apache/hadoop/core/ ftp://ftp.twaren.net/Unix/Web/apache/hadoop/core/ ftp://apache.cdpa.nsysu.edu.tw/Unix/Web/apache/hadoop/core/

Hadoop Virtual Imagehttp://code.google.com/intl/zh-TW/edu/parallel/tools/hadoopvm/http://code.google.com/intl/zh-TW/edu/parallel/tools/hadoopvm/

Setting up a Hadoop cluster can be an all day job. A virtual machine image has created with a preconfigured single node instance of Hadoop.

A virtual machine encapsulates one operating system within another.(http://developer.yahoo.com/hadoop/tutorial/module3.html)

Hadoop Virtual Imagehttp://code.google.com/intl/zh-TW/edu/parallel/tools/hadoopvm/http://code.google.com/intl/zh-TW/edu/parallel/tools/hadoopvm/

While this doesn't have the power of a full cluster, it does allow you to use the resources on your local machine to explore the Hadoop platform.

The virtual machine image is designed to be used with the free VMware Player.

Hadoop can be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

Setting Up the Image The image is packaged as a directory archive. To begin set u

p deflate the image in the directory of your choice (you need at least 10GB, the disk image can grow to 20GB).

The VMware image package contains: image.vmx -- The VMware guest OS profile, a configuration file

that describes the virtual machine characteristics (virtual CPU(s), amount of memory, etc.).

20GB.vmdk -- A VMware virtual disk used to store the content of the virtual machine hard disk; this file grows as you store data on the virtual image. It is configured to store up to 20GB.

The archive contains two other files, image.vmsd and nvram, these are not critical for running the image but are created by the VMware player on startup.

As you run the virtual machine log files (vmware-x.log) will be created.

Setting Up the Image The system image is based on Ubuntu (version 7.04) an

d contains a Java machine (Sun JRE 6 - DLJ License v1.1) and the latest Hadoop distribution (0.13.0).

A new window will appear which will print a message indicating the IP address allocated to the guest OS. This is the IP address you will use to submit jobs from the command line or the Eclipse environment.

The guest OS contains a running Hadoop infrastructure which is configured with: A GFS (HDFS) infrastructure using a single data node (no replication) A single MapReduce worker

Setting Up the Image The guest OS can be reached from the provided console or

via SSH using the IP address indicated above. Log into the guest OS with: guest log in: guest, guest password: guest administrator log in: root, administrator password: root

Once the image is loaded, you can log in with the guest account. Hadoop will be installed in the guest home directory(/home/guest/hadoop). Three scripts are provided for Hadoop maintenance purposes: start-hadoop -- Starts file-system and MapReduce daemons. stop-hadoop -- Stops all Hadoop daemons. reset-hadoop -- Restarts new Hadoop environment with entirely e

mpty file system.

Hadoop 0.20 Install 前言架設環境 Install Hadoop

下載並更新軟體套件安裝 java ssh 安裝設定安裝 hadoop

Hadoop範例測試

前言 Ubuntu 是一套自由的作業系統，是從 Debian 衍生出來的 GNU/Linux 分支。是由第一位自費上太空的非裔企業家 Mark Shuttleworth 所創立的 Canonical Ltd. 所維護的一個發行版。於 04 年正式推出第一個版本 (Ubuntu 4.10 Warty Warthog) ，一推出馬上受到熱烈迴響，從 05 年開始到現在，都是最熱門的 GNU/Linux 發行版。目前最新版本是： Ubuntu 9.04 。

開發 hadoop 需要用到許多的物件導向語法，包括繼承關係、介面類別，而且需要匯入正確的 classpath 。

Ubuntu Operating System 最低需求：

300MHz 的 x86 處理器 64MB 的主記憶體　 (LiveCD 需要有 256MB 的主記憶體才能執行 ) 4GB 的硬碟空間 (含分配給置換空間的部份 ) 能夠顯示 640x480 的 VGA 的顯示晶片光碟機或網路卡

建議需求： 700MHz 的 x86 處理器 384MB 的主記憶體 8GB 的硬碟空間 (含分配給置換空間的部份 ) 能夠顯示 1024x768 的 VGA 的顯示晶片音效卡可連上網際網路

架設環境 Live CD ubuntu 9.04 sun-java-6 hadoop 0.20.0

目錄說明使用者 (user) ： cfong

使用者家目錄 (user's home directory) ： /home/cfong

專案目錄 (project directory) ： /home/cfong/workspace

hadoop 目錄： /opt/hadoop

Install Hadoop

安裝方式模式可以有很多種，可以在 WMware Player下模擬，也可在 Ubuntu Operating System或 Cent Operating System下安裝。在此將在 Live CD ubuntu 9.04 Operating System下安裝。

在安裝過程中要清楚每個套件安裝在哪個目錄底下。

Update Package

$sudo –i

# 切入 super user

$sudo apt-get update

#updata package lists

$sudo apt-get upgrade

#upgrade all install package

Download Package

Download hadoop-0.20.0.tar放在 /opt/ 下 http://apache.cdpa.nsysu.edu.tw/hadoop/core/hadoop-0.20.0/

hadoop-0.20.0.tar.gz

Download Java SE Development Kit (JDK) JDK 6 Update 14 (jdk-6u10-docs.zip) 放在 /tmp/ 下

https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/ViewProductDetail-Start?ProductRef=jdk-6u10-docs-oth-JPR@CDS-CDS_Developer

Install Java

安裝 java 基本套件 $ sudo apt-get install java-common sun-java6-bin sun-java6-jdk

安裝 sun-java6-doc $ sudo apt-get install sun-java6-doc

SSH安裝設定 $ apt-get install ssh

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ ssh localhost

Install Hadoop

$ cd /opt

$ sudo tar -zxvf hadoop-0.20.0.tar.gz

$ sudo chown -R cfong:cfong /opt/hadoop-0.20.0

$ sudo ln -sf /opt/hadoop-0.20.0 /opt/hadoop

Environment Variables Set up nano /opt/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-sun

export HADOOP_HOME=/opt/hadoop

export PATH=$PATH:/opt/hadoop/bin

Environment Variables Setup

nano /opt/hadoop/conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop/hadoop-${user.name}</value> </property></configuration>


nano /opt/hadoop/conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property></configuration>


nano /opt/hadoop/conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property></configuration>


nano /opt/hadoop/conf/mapred-site.xml

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

</configuration>

啟動 Hadoop $ cd /opt/hadoop

$ source /opt/hadoop/conf/hadoop-env.sh

$ hadoop namenode -format

$ start-all.sh

$ hadoop fs -put conf input

$ hadoop fs -ls

Hadoop Examples Example 1:

$cd /opt/hadoop$bin/hadoop version

Hadoop 0.20.0Subversion https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504Compiled by ndaley on Thu Apr 9 05:18:40 UTC 200Compiled by hadoopqa on Thu May 15 07:22:55 UTC 2008

Hadoop 0.20.0Subversion https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504Compiled by ndaley on Thu Apr 9 05:18:40 UTC 200Compiled by hadoopqa on Thu May 15 07:22:55 UTC 2008

Hadoop Examples Example 2: $opt/hadoop/bin/hadoop jar hadoop-0.20.0-

examples.jar pi 4 10000Number of Maps = 4Samples per Map = 10000Wrote input for Map #0Wrote input for Map #1Wrote input for Map #2Wrote input for Map #3Starting Job09/08/01 06:56:41 INFO mapred.FileInputFormat: Total input paths to process : 409/08/01 06:56:42 INFO mapred.JobClient: Running job: job_200908010505_000209/08/01 06:56:43 INFO mapred.JobClient: map 0% reduce 0%09/08/01 06:56:53 INFO mapred.JobClient: map 50% reduce 0%09/08/01 06:56:56 INFO mapred.JobClient: map 100% reduce 0%09/08/01 06:57:05 INFO mapred.JobClient: map 100% reduce 100%09/08/01 06:57:07 INFO mapred.JobClient: Job complete: job_200908010505_000209/08/01 06:57:07 INFO mapred.JobClient: Counters: 1809/08/01 06:57:07 INFO mapred.JobClient: Job Counters09/08/01 06:57:07 INFO mapred.JobClient: Launched reduce tasks=109/08/01 06:57:07 INFO mapred.JobClient: Launched map tasks=409/08/01 06:57:07 INFO mapred.JobClient: Data-local map tasks=409/08/01 06:57:07 INFO mapred.JobClient: FileSystemCounters09/08/01 06:57:07 INFO mapred.JobClient: FILE_BYTES_READ=9409/08/01 06:57:07 INFO mapred.JobClient: HDFS_BYTES_READ=47209/08/01 06:57:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=33409/08/01 06:57:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=21509/08/01 06:57:07 INFO mapred.JobClient: Map-Reduce Framework09/08/01 06:57:07 INFO mapred.JobClient: Reduce input groups=809/08/01 06:57:07 INFO mapred.JobClient: Combine output records=009/08/01 06:57:07 INFO mapred.JobClient: Map input records=409/08/01 06:57:07 INFO mapred.JobClient: Reduce shuffle bytes=11209/08/01 06:57:07 INFO mapred.JobClient: Reduce output records=009/08/01 06:57:07 INFO mapred.JobClient: Spilled Records=1609/08/01 06:57:07 INFO mapred.JobClient: Map output bytes=7209/08/01 06:57:07 INFO mapred.JobClient: Map input bytes=9609/08/01 06:57:07 INFO mapred.JobClient: Combine input records=009/08/01 06:57:07 INFO mapred.JobClient: Map output records=809/08/01 06:57:07 INFO mapred.JobClient: Reduce input records=8Job Finished in 25.84 secondsEstimated value of Pi is 3.14140000000000000000

Number of Maps = 4Samples per Map = 10000Wrote input for Map #0Wrote input for Map #1Wrote input for Map #2Wrote input for Map #3Starting Job09/08/01 06:56:41 INFO mapred.FileInputFormat: Total input paths to process : 409/08/01 06:56:42 INFO mapred.JobClient: Running job: job_200908010505_000209/08/01 06:56:43 INFO mapred.JobClient: map 0% reduce 0%09/08/01 06:56:53 INFO mapred.JobClient: map 50% reduce 0%09/08/01 06:56:56 INFO mapred.JobClient: map 100% reduce 0%09/08/01 06:57:05 INFO mapred.JobClient: map 100% reduce 100%09/08/01 06:57:07 INFO mapred.JobClient: Job complete: job_200908010505_000209/08/01 06:57:07 INFO mapred.JobClient: Counters: 1809/08/01 06:57:07 INFO mapred.JobClient: Job Counters09/08/01 06:57:07 INFO mapred.JobClient: Launched reduce tasks=109/08/01 06:57:07 INFO mapred.JobClient: Launched map tasks=409/08/01 06:57:07 INFO mapred.JobClient: Data-local map tasks=409/08/01 06:57:07 INFO mapred.JobClient: FileSystemCounters09/08/01 06:57:07 INFO mapred.JobClient: FILE_BYTES_READ=9409/08/01 06:57:07 INFO mapred.JobClient: HDFS_BYTES_READ=47209/08/01 06:57:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=33409/08/01 06:57:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=21509/08/01 06:57:07 INFO mapred.JobClient: Map-Reduce Framework09/08/01 06:57:07 INFO mapred.JobClient: Reduce input groups=809/08/01 06:57:07 INFO mapred.JobClient: Combine output records=009/08/01 06:57:07 INFO mapred.JobClient: Map input records=409/08/01 06:57:07 INFO mapred.JobClient: Reduce shuffle bytes=11209/08/01 06:57:07 INFO mapred.JobClient: Reduce output records=009/08/01 06:57:07 INFO mapred.JobClient: Spilled Records=1609/08/01 06:57:07 INFO mapred.JobClient: Map output bytes=7209/08/01 06:57:07 INFO mapred.JobClient: Map input bytes=9609/08/01 06:57:07 INFO mapred.JobClient: Combine input records=009/08/01 06:57:07 INFO mapred.JobClient: Map output records=809/08/01 06:57:07 INFO mapred.JobClient: Reduce input records=8Job Finished in 25.84 secondsEstimated value of Pi is 3.14140000000000000000

Hadoop Examples Example 3: $opt/hadoop/bin/start-all.sh

localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-cfong-desktop.outlocalhost: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-root-secondarynamenode-cfong-desktop.outstarting jobtracker, logging to /opt/hadoop/logs/hadoop-root-jobtracker-cfong-desktop.outlocalhost: starting tasktracker, logging to /opt/hadoop/logs/hadoop-root-tasktracker-cfong-desktop.out

localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-cfong-desktop.outlocalhost: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-root-secondarynamenode-cfong-desktop.outstarting jobtracker, logging to /opt/hadoop/logs/hadoop-root-jobtracker-cfong-desktop.outlocalhost: starting tasktracker, logging to /opt/hadoop/logs/hadoop-root-tasktracker-cfong-desktop.out

Hadoop Examples Example 4: $opt/hadoop/bin/stop-all.sh

stopping jobtrackerlocalhost: stopping tasktrackerstopping namenodelocalhost: stopping datanodelocalhost: stopping secondarynamenode

stopping jobtrackerlocalhost: stopping tasktrackerstopping namenodelocalhost: stopping datanodelocalhost: stopping secondarynamenode

Hadoop Examples Example 5:$ cd /opt/hadoop$ jps

20911 JobTracker20582 DataNode27281 Jps20792 SecondaryNameNode21054 TaskTracker20474 NameNode

20911 JobTracker20582 DataNode27281 Jps20792 SecondaryNameNode21054 TaskTracker20474 NameNode

Hadoop Examples Example 6: $ sudo netstat -plten | grep java

tcp6 0 0 :::50145 :::* LISTEN 0 141655 20792/javatcp6 0 0 :::45538 :::* LISTEN 0 142200 20911/javatcp6 0 0 :::50020 :::* LISTEN 0 143573 20582/javatcp6 0 0 127.0.0.1:9000 :::* LISTEN 0 139970 20474/javatcp6 0 0 127.0.0.1:9001 :::* LISTEN 0 142203 20911/javatcp6 0 0 :::50090 :::* LISTEN 0 143534 20792/javatcp6 0 0 :::53866 :::* LISTEN 0 140629 20582/javatcp6 0 0 :::50060 :::* LISTEN 0 143527 21054/javatcp6 0 0 127.0.0.1:37870 :::* LISTEN 0 143559 21054/javatcp6 0 0 :::50030 :::* LISTEN 0 143441 20911/javatcp6 0 0 :::50070 :::* LISTEN 0 143141 20474/javatcp6 0 0 :::50010 :::* LISTEN 0 143336 20582/javatcp6 0 0 :::50075 :::* LISTEN 0 143536 20582/javatcp6 0 0 :::50397 :::* LISTEN 0 139967 20474/java

tcp6 0 0 :::50145 :::* LISTEN 0 141655 20792/javatcp6 0 0 :::45538 :::* LISTEN 0 142200 20911/javatcp6 0 0 :::50020 :::* LISTEN 0 143573 20582/javatcp6 0 0 127.0.0.1:9000 :::* LISTEN 0 139970 20474/javatcp6 0 0 127.0.0.1:9001 :::* LISTEN 0 142203 20911/javatcp6 0 0 :::50090 :::* LISTEN 0 143534 20792/javatcp6 0 0 :::53866 :::* LISTEN 0 140629 20582/javatcp6 0 0 :::50060 :::* LISTEN 0 143527 21054/javatcp6 0 0 127.0.0.1:37870 :::* LISTEN 0 143559 21054/javatcp6 0 0 :::50030 :::* LISTEN 0 143441 20911/javatcp6 0 0 :::50070 :::* LISTEN 0 143141 20474/javatcp6 0 0 :::50010 :::* LISTEN 0 143336 20582/javatcp6 0 0 :::50075 :::* LISTEN 0 143536 20582/javatcp6 0 0 :::50397 :::* LISTEN 0 139967 20474/java

雲端運算虛擬技術 -- 雲端計算資料處理技術 -- hadoop -- mapreduce 賴智錦 /...

Documents