Download - Hadoop & Ecosystem - An Overview

US: +1 408 556 9645

India: +91 20 661 43 400

Web: http://www.clogeny.com

Email: [email protected]

Innovation → Execution → Solution → Delivered

Hadoop & Ecosystem

http://clogeny.com/

http://www.clogeny.com © 2012 Clogeny Technologies

Need for Hadoop?

Store terabytes and petabytes of data

• Reliable, Inexpensive, Accessible, Fault-tolerant

• Distributed File System (HDFS)

Distributed applications are hard to develop and manage

• MapReduce to make distributed computing tractable

Datasets too large for RDBMS

Scale, Scale, Scale – with cheap(er) hardware

SoCoMo

Hadoop = MapReduce + HDFS

http://www.clogeny.com/

http://www.linkedin.com/company/clogeny

http://twitter.com/clogeny

http://blogs.clogeny.com/


What types of problems does Hadoop Solve?

• Discover fraud patterns

• Collateral optimization, Risk modeling Risk Modeling

• Mine web click logs to detect patterns

• Build and test various customer behavioral models

• Reporting and post-analysis E-Commerce

• Collaborative filtering: User, Item, Content

• Product, show, movie, job recommendations

Recommendation Engines

• Clickstream analysis, forecasting, optimization

• Aggregate various data sources - social graph for eg.

• Report generation and post-analysis application Ad Targeting

• Improve search quality using pattern recognition

• Spam filtering Text Analysis






MapReduce

Basic functional operations:

• Map

• Reduce

Do not modify data, they generate new data

Original data remains unmodified






HDFS

Based on Google File System

Write Once Read Many Access Model

Splits input data into blocks – 64MB/128MB blocks (small file IO is bad, yes)

Blocks are replicated across nodes to handle node failures

NameNode: filesystem metadata, block replication, RW Access

DataNode: Store actual data (client reads from nearest datanode

Data pipelining: Clients writes to first datanode, datanode replicates to next datanode in pipeline






HDFS Architecture






Hadoop Architecture

Job Tracker: Farms out MR tasks to nodes in the cluster – ideally those having data or atleast in the same rack. Talks to NameNode for location of the data.

Task Tracker: A node in the cluster which accepts tasks from a JobTracker– Map, Reduce, Shuffle






Example

WORDCOUNT

Read text files and count how often words occur.

• The input is text files

• The output is a text file

each line: word, tab, count

Map: Produce pairs of (word, count)

Reduce: For each word, sum up the counts.

GREP

Search input files for a given pattern

Map: emits a line if pattern is matched

Reduce: Copies results to output






Wordcount Mapper

14 public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

15 private final static IntWritable one = new IntWritable(1); 16 private Text word = new Text(); 17 18 public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19 String line = value.toString(); 20 StringTokenizer tokenizer = new StringTokenizer(line); 21 while (tokenizer.hasMoreTokens()) { 22 word.set(tokenizer.nextToken()); 23 output.collect(word, one); 24 } 25 } 26 }






Wordcount Reducer

28 public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

29 30 public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

31 int sum = 0; 32 while (values.hasNext()) { 33 sum += values.next().get(); 34 } 35 output.collect(key, new IntWritable(sum)); 36 } 37 }






Apache Hive

Data Warehousing System for Hadoop

Facilitates data summarization, ad-hoc queries and analysis of large datasets

Query using an SQL like language called HiveQL. Can also use custom Mappers and reducers

Hive tables can be defined directly on HDFS files via SerDe, customized formats

Tables can be partitioned and data loaded separately in each partition for scale.

Tables can be clustered based on certain columns for query performance.

The schema is stored in an RDBMS. Has complex column types like map, array, struct in addition to atomic types.






Apache Hive

Storage Format Example

CREATE TABLE mylog (

uhash BIGINT,

page_url string,

unix_time INT)

STORED AS TEXTFILE;

LOAD DATA INPATH ‘/user/log.txt’ INTO TABLE mylog;

Supports text file, sequence files, extensible for other formats.

Serialized Formats: Delimited, Thrift protocol

Deserialized: Java Integer/String/Array/HashMap, Hadoop Writable Classes, Thrift – user defined classes

CREATE TABLE mylog (

uhash BIGINT, page_url STRING, unix_time INT)

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe’

STORED as RCFILE;






Apache Hive

HiveQL – JOIN

INSERT OVERWRITE TABLE pv_users

SELECT pv.pageid, u,age_bkt

FROM pageview pv

JOIN user u

ON (pv.uhash = u.uhash)

HiveQL – GROUP BY

SELECT pageid, age_bkt, count(1)

FROM pv_users

GROUP BY pageid, age_bkt






Apache Hive

Group By in Map Reduce






Apache Pig

High level interface for Hadoop – Pig Latin is an declarative SQL like language for analyzing large data sets

Pig compiler produces sequences of Map-Reduce programs

Makes it easy to develop parallel processing programs – much easier than developing MR code directly

Interactive Shell - Grunt

User Defined Functions (UDFs) – specify custom processing in Java, Python, Ruby

Implement EVAL, AGGREGATE, FILTER functions






Apache Pig – Word Count

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

-- Extract words from each line and put them into a pig bag

-- datatype, then flatten the bag to get one word on each row

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- filter out any words that are just white spaces

filtered_words = FILTER words BY word MATCHES '\\w+';

-- create a group for each word

word_groups = GROUP filtered_words BY word;

-- count the entries in each group

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

-- order the records by count

ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';






Sqoop

Tool to transfer data between Hadoop and relational databases like MySQL or Oracle.

Sqoop automates the process and takes the schema from the RDBMS.

Use Map-Reduce to transfer the data – making it a parallel and fault-tolerant process.

$ sqoop import --connect jdbc:mysql://database.example.com/employees

$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared

$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp

Supports incremental import of data – you can schedule imports

Can also import data into Hive, HBase






Flume

Collecting data from various sources? When? With ugly unmaintainable scripts?

Flume - Distributed Data Collection Service

Scalable, Configurable, Robust

Collect data from various formats (generally to HDFS)

Flows: each flow corresponds to a type of data source

Different flows have different compression, reliability config

Flows are comprised of nodes chained together

Each node receives data at its source and sends it to a sink






Flume

Example flows within a Flume service

Agents & Collectors






HCatalog

So much data – that managing the metadata is difficult

Multiple tools & formats – MR, Pig, Hive How to share data easily?

Ops – Needs to manage data storage, cluster, data expiry, data replication, import, export

Register data through Hcatalog

Uses Hive’s metastore

A = load 'raw' using HCatLoader(); B = filter A by ds='20110225' and region='us' and property='news';

PIG no longer cares where “raw” is. So its location

can be changed. No data model needed.






Oozie – Workflow Scheduler

Workflow scheduler system to manage Hadoop jobs

Create a Directed Acyclic Graphs of actions

Supports various types of Hadoop jobs like Java MR, Streaming MR, Pig, Hive, Sqoop as well as shell scripts and Java programs.

Create a flow-chart of your

data analysis process &

allow Oozie to co-ordinate

Time & data dependencies

XML based config

Simple GUI to track jobs and workflows






Oozie – Workflow Scheduler

$ oozie job -oozie http://localhost:8080/oozie -config examples/apps/map-reduce/job.properties -run

.

job: 14-20090525161321-oozie-tucu

Check the workflow job status:

$ oozie job -oozie http://localhost:8080/oozie -info 14-20090525161321-oozie-tucu

Workflow Name : map-reduce-wf

App Path : hdfs://localhost:9000/user/tucu/examples/apps/map-reduce

Status : SUCCEEDED

Run : 0

User : tucu

Group : users

Created : 2009-05-26 05:01 +0000

Started : 2009-05-26 05:01 +0000

Ended : 2009-05-26 05:01 +0000






Schedulers

All teams want access to the Hadoop clusters: Analytics team starts a query and takes down the prod cluster

How to allocate resources within a cluster across teams and users

How to assign priority to jobs?

Fair Scheduler:

Jobs are grouped into pools

Each pool has a guaranteed minimum share

Excess capacity is split between jobs

Capacity Scheduler:

Queues with allocation of part of available capacity (M & R slots)

Priorities for jobs, share available capacity across queues

For eg: queue per team

No preemption






Ambari

Deployment, monitoring and management of Hadoop clusters is hard!

Ambari provides configuration management and deployment using Puppet

Cluster configuration is a data object

Reliable repeatable deployment

Central point to manage services

Uses Ganglia and Nagios for monitoring and alerting

AD & LDAP Integration

Visualization of the cluster state and Hadoop jobs over time






Ambari






Apache Mahout

Machine Learning Library

Leverage MapReduce for clustering,

classification, collaborative filtering algorithms.

Recommendation: Takes user behavior as input and from that suggest items that user might like.

• Netflix

Clustering: Group objects into clusters

• Create groups of users on social media

Classification: Learn from categorized documents and is able to assign unlabeled documents to the correct categories.

• Recommend ads to users, classify text into categories






Serengeti

I just implemented virtualization in my datacenter. Can I use Hadoop on virtualized resources?

Yes – Serengeti allows you to do that!

Get much more agility – at cost of performance

Improve availability of your Hadoop Cluster

Share capacity across Hadoop clusters and other applications

Hadoop Virtual Extensions to make Hadoop virtualization-aware

Hadoop as a Service: Integrate with VMWare Chargeback and bill accordingly.






Platforms - HortonWorks






Dell – Cloudera Stack






Amazon Elastic MapReduce

Cloud platform (EC2 & S3) providing Hadoop as a Service

Provision as much capacity as you like for as much time as you like

Integration with S3 (Cloud Storage) – store your persistent data in S3

App or Service on Cloud, Hadoop cluster in same place

Multiple locations – you have an on-demand Hadoop cluster across 5 continents

MapR, KarmaSphere, etc. tools are available on the AWS platform itself.

Use expensive BI tools on the cloud pay-as-you-go





Download - Hadoop & Ecosystem - An Overview

Top Related