Download - Hadoop tech share

Dylan Valerio

Hadoop Tech Share

Agenda

• Overview

• Demo

• Applications

• Configuration

Data!

• NYSE = 1 TB/day (10^12)

• FB = 10B photos = 2.5 PB (10^15)

• Ancestry.com = 2.5 PB

• Large Hadron Collider = 15 PB/year

Hadoop: The Definitive Guide

Storage and Analysis

• Storage capacity has increased, but IO has not increased proportionally.

• Disk failure.

• Analysis needs a large chunk of data.

• Bandwidth is largely limited esp. for BIG DATA.


Hadoop

• Distributed computing framework to process large amounts of data.

– Accessible – data is replicated on large clusters of commodity machines

– Robust – assume disk failures

– Scalable – add nodes

– Simple – simple MapReduce code

Hadoop in Action

HDFS

• Name Node – master of the HDFS. Directs I/O.“The Index”

– Secondary Name Node – Backup

• Data Node – where actual data is stored and replicated

Hadoop In Action

MapReduce

• MR is a programming framework that processes data by keys and values.

• The mapper code processes, while the reducer compiles.

• Mappers and reducers do not directly communicate with each other.


Huy Vo, NYU

Jobs

• Map tasks and reduce tasks are assignedthroughout the cluster.

• Job Tracker managesthe status of the job(s).

• Task Tracker manages each task assigned to them.


Architecture

Hadoop in Action

Demo

• Word Count – the “Hello World” of Map Reduce.

• Distributed GREP – Sampling Pattern

• Top Child Star – Summarization Pattern

Bit of History

• Doug Cutting and the Apache Lucene Team

• Google File Systm (2003) and MapReduce (2004)

• Cutting joined Yahoo! (2006).

• Yahoo announced its search index was being processed by a 10,000-core Hadoop cluster. (2008)


Hadoop Stack

• Core – I/O, serialization, Java RPC

• Avro – data serialization

• MapReduce

• HDFS

• Pig – higher language to explore HDFS & MR clusters

• HBase – distributed column-oriented DB

• Zoo Keeper – distributed coordination service

• Hive – distributed data warehouse + SQL-like query

• Chukwa – data collection and reports

• Mahout – collection of ML algorithms for HDFS clusters


Configuration Checklist

• Rack management

• Java Installation

• Hadoop download and shell environment tweaking

• SSH + VIM

• Default configuration files:

– Core-site.xml, hdfs-site.xml, mapred-site.xml

• Formatting the HDFS

• Start-all.sh

Hadoop Shell Commands

• Hadoop fs –ls

• Hadoop jar <jar file> <main method> <input params>

Web-Based Cluster UI

• Localhost:50070 – Job administration

• Localhost:50030 – DFS administration

Hadoop for other languages

• Hadoop streaming uses Unix standard streams.

– So you can use bash scripts, Ruby, python, etc.

• Hadoop pipes is a C++ interface to MR.

Benefits to AC Technologies Discussion

Report Generation

• Suppose we have HBase for:

– High Availability: Distributed DB

– Partition Tolerance: Auto-Sharding

– Scalability: Horizontal Scaling

• Then, common scenarios will be:

– Service Management & Monitoring:

• Partitioning by month

• Binning by functional category

• Sampling by file status

– Harvest DB

• Top harvested files per day, per site

Log Mining

• Suppose we have a common repository for all log files (Zenoss)

– Exception counting (WARN – FATAL level)

– Info-level reporting

Analytics-Driven Decision Making

• Application Influence Mining through Akamai Logs

– Prevalent and isolated applications over the whole client base

– Prevalent and isolated applications over a single organization

– Projections of Application Influence over time

Download - Hadoop tech share

Top Related