Download - Hadoop & Ecosystem - An Overview
US: +1 408 556 9645
India: +91 20 661 43 400
Web: http://www.clogeny.com
Email: [email protected]
Innovation → Execution → Solution → Delivered
Hadoop & Ecosystem
http://www.clogeny.com © 2012 Clogeny Technologies
Need for Hadoop?
Store terabytes and petabytes of data
• Reliable, Inexpensive, Accessible, Fault-tolerant
• Distributed File System (HDFS)
Distributed applications are hard to develop and manage
• MapReduce to make distributed computing tractable
Datasets too large for RDBMS
Scale, Scale, Scale – with cheap(er) hardware
SoCoMo
Hadoop = MapReduce + HDFS
http://www.clogeny.com © 2012 Clogeny Technologies
What types of problems does Hadoop Solve?
• Discover fraud patterns
• Collateral optimization, Risk modeling Risk Modeling
• Mine web click logs to detect patterns
• Build and test various customer behavioral models
• Reporting and post-analysis E-Commerce
• Collaborative filtering: User, Item, Content
• Product, show, movie, job recommendations
Recommendation Engines
• Clickstream analysis, forecasting, optimization
• Aggregate various data sources - social graph for eg.
• Report generation and post-analysis application Ad Targeting
• Improve search quality using pattern recognition
• Spam filtering Text Analysis
http://www.clogeny.com © 2012 Clogeny Technologies
MapReduce
Basic functional operations:
• Map
• Reduce
Do not modify data, they generate new data
Original data remains unmodified
http://www.clogeny.com © 2012 Clogeny Technologies
HDFS
Based on Google File System
Write Once Read Many Access Model
Splits input data into blocks – 64MB/128MB blocks (small file IO is bad, yes)
Blocks are replicated across nodes to handle node failures
NameNode: filesystem metadata, block replication, RW Access
DataNode: Store actual data (client reads from nearest datanode
Data pipelining: Clients writes to first datanode, datanode replicates to next datanode in pipeline
http://www.clogeny.com © 2012 Clogeny Technologies
HDFS Architecture
http://www.clogeny.com © 2012 Clogeny Technologies
Hadoop Architecture
Job Tracker: Farms out MR tasks to nodes in the cluster – ideally those having data or atleast in the same rack. Talks to NameNode for location of the data.
Task Tracker: A node in the cluster which accepts tasks from a JobTracker– Map, Reduce, Shuffle
http://www.clogeny.com © 2012 Clogeny Technologies
Example
WORDCOUNT
Read text files and count how often words occur.
• The input is text files
• The output is a text file
each line: word, tab, count
Map: Produce pairs of (word, count)
Reduce: For each word, sum up the counts.
GREP
Search input files for a given pattern
Map: emits a line if pattern is matched
Reduce: Copies results to output
http://www.clogeny.com © 2012 Clogeny Technologies
Wordcount Mapper
14 public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
15 private final static IntWritable one = new IntWritable(1); 16 private Text word = new Text(); 17 18 public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19 String line = value.toString(); 20 StringTokenizer tokenizer = new StringTokenizer(line); 21 while (tokenizer.hasMoreTokens()) { 22 word.set(tokenizer.nextToken()); 23 output.collect(word, one); 24 } 25 } 26 }
http://www.clogeny.com © 2012 Clogeny Technologies
Wordcount Reducer
28 public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
29 30 public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
31 int sum = 0; 32 while (values.hasNext()) { 33 sum += values.next().get(); 34 } 35 output.collect(key, new IntWritable(sum)); 36 } 37 }
http://www.clogeny.com © 2012 Clogeny Technologies
Apache Hive
Data Warehousing System for Hadoop
Facilitates data summarization, ad-hoc queries and analysis of large datasets
Query using an SQL like language called HiveQL. Can also use custom Mappers and reducers
Hive tables can be defined directly on HDFS files via SerDe, customized formats
Tables can be partitioned and data loaded separately in each partition for scale.
Tables can be clustered based on certain columns for query performance.
The schema is stored in an RDBMS. Has complex column types like map, array, struct in addition to atomic types.
http://www.clogeny.com © 2012 Clogeny Technologies
Apache Hive
Storage Format Example
CREATE TABLE mylog (
uhash BIGINT,
page_url string,
unix_time INT)
STORED AS TEXTFILE;
LOAD DATA INPATH ‘/user/log.txt’ INTO TABLE mylog;
Supports text file, sequence files, extensible for other formats.
Serialized Formats: Delimited, Thrift protocol
Deserialized: Java Integer/String/Array/HashMap, Hadoop Writable Classes, Thrift – user defined classes
CREATE TABLE mylog (
uhash BIGINT, page_url STRING, unix_time INT)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe’
STORED as RCFILE;
http://www.clogeny.com © 2012 Clogeny Technologies
Apache Hive
HiveQL – JOIN
INSERT OVERWRITE TABLE pv_users
SELECT pv.pageid, u,age_bkt
FROM pageview pv
JOIN user u
ON (pv.uhash = u.uhash)
HiveQL – GROUP BY
SELECT pageid, age_bkt, count(1)
FROM pv_users
GROUP BY pageid, age_bkt
http://www.clogeny.com © 2012 Clogeny Technologies
Apache Hive
Group By in Map Reduce
http://www.clogeny.com © 2012 Clogeny Technologies
Apache Pig
High level interface for Hadoop – Pig Latin is an declarative SQL like language for analyzing large data sets
Pig compiler produces sequences of Map-Reduce programs
Makes it easy to develop parallel processing programs – much easier than developing MR code directly
Interactive Shell - Grunt
User Defined Functions (UDFs) – specify custom processing in Java, Python, Ruby
Implement EVAL, AGGREGATE, FILTER functions
http://www.clogeny.com © 2012 Clogeny Technologies
Apache Pig – Word Count
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES '\\w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
http://www.clogeny.com © 2012 Clogeny Technologies
Sqoop
Tool to transfer data between Hadoop and relational databases like MySQL or Oracle.
Sqoop automates the process and takes the schema from the RDBMS.
Use Map-Reduce to transfer the data – making it a parallel and fault-tolerant process.
$ sqoop import --connect jdbc:mysql://database.example.com/employees
$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared
$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp
Supports incremental import of data – you can schedule imports
Can also import data into Hive, HBase
http://www.clogeny.com © 2012 Clogeny Technologies
Flume
Collecting data from various sources? When? With ugly unmaintainable scripts?
Flume - Distributed Data Collection Service
Scalable, Configurable, Robust
Collect data from various formats (generally to HDFS)
Flows: each flow corresponds to a type of data source
Different flows have different compression, reliability config
Flows are comprised of nodes chained together
Each node receives data at its source and sends it to a sink
http://www.clogeny.com © 2012 Clogeny Technologies
Flume
Example flows within a Flume service
Agents & Collectors
http://www.clogeny.com © 2012 Clogeny Technologies
HCatalog
So much data – that managing the metadata is difficult
Multiple tools & formats – MR, Pig, Hive How to share data easily?
Ops – Needs to manage data storage, cluster, data expiry, data replication, import, export
Register data through Hcatalog
Uses Hive’s metastore
A = load 'raw' using HCatLoader(); B = filter A by ds='20110225' and region='us' and property='news';
PIG no longer cares where “raw” is. So its location
can be changed. No data model needed.
http://www.clogeny.com © 2012 Clogeny Technologies
Oozie – Workflow Scheduler
Workflow scheduler system to manage Hadoop jobs
Create a Directed Acyclic Graphs of actions
Supports various types of Hadoop jobs like Java MR, Streaming MR, Pig, Hive, Sqoop as well as shell scripts and Java programs.
Create a flow-chart of your
data analysis process &
allow Oozie to co-ordinate
Time & data dependencies
XML based config
Simple GUI to track jobs and workflows
http://www.clogeny.com © 2012 Clogeny Technologies
Oozie – Workflow Scheduler
$ oozie job -oozie http://localhost:8080/oozie -config examples/apps/map-reduce/job.properties -run
.
job: 14-20090525161321-oozie-tucu
Check the workflow job status:
$ oozie job -oozie http://localhost:8080/oozie -info 14-20090525161321-oozie-tucu
Workflow Name : map-reduce-wf
App Path : hdfs://localhost:9000/user/tucu/examples/apps/map-reduce
Status : SUCCEEDED
Run : 0
User : tucu
Group : users
Created : 2009-05-26 05:01 +0000
Started : 2009-05-26 05:01 +0000
Ended : 2009-05-26 05:01 +0000
http://www.clogeny.com © 2012 Clogeny Technologies
Schedulers
All teams want access to the Hadoop clusters: Analytics team starts a query and takes down the prod cluster
How to allocate resources within a cluster across teams and users
How to assign priority to jobs?
Fair Scheduler:
Jobs are grouped into pools
Each pool has a guaranteed minimum share
Excess capacity is split between jobs
Capacity Scheduler:
Queues with allocation of part of available capacity (M & R slots)
Priorities for jobs, share available capacity across queues
For eg: queue per team
No preemption
http://www.clogeny.com © 2012 Clogeny Technologies
Ambari
Deployment, monitoring and management of Hadoop clusters is hard!
Ambari provides configuration management and deployment using Puppet
Cluster configuration is a data object
Reliable repeatable deployment
Central point to manage services
Uses Ganglia and Nagios for monitoring and alerting
AD & LDAP Integration
Visualization of the cluster state and Hadoop jobs over time
http://www.clogeny.com © 2012 Clogeny Technologies
Ambari
http://www.clogeny.com © 2012 Clogeny Technologies
Ambari
http://www.clogeny.com © 2012 Clogeny Technologies
Ambari
http://www.clogeny.com © 2012 Clogeny Technologies
Apache Mahout
Machine Learning Library
Leverage MapReduce for clustering,
classification, collaborative filtering algorithms.
Recommendation: Takes user behavior as input and from that suggest items that user might like.
• Netflix
Clustering: Group objects into clusters
• Create groups of users on social media
Classification: Learn from categorized documents and is able to assign unlabeled documents to the correct categories.
• Recommend ads to users, classify text into categories
http://www.clogeny.com © 2012 Clogeny Technologies
Serengeti
I just implemented virtualization in my datacenter. Can I use Hadoop on virtualized resources?
Yes – Serengeti allows you to do that!
Get much more agility – at cost of performance
Improve availability of your Hadoop Cluster
Share capacity across Hadoop clusters and other applications
Hadoop Virtual Extensions to make Hadoop virtualization-aware
Hadoop as a Service: Integrate with VMWare Chargeback and bill accordingly.
http://www.clogeny.com © 2012 Clogeny Technologies
Platforms - HortonWorks
http://www.clogeny.com © 2012 Clogeny Technologies
Dell – Cloudera Stack
http://www.clogeny.com © 2012 Clogeny Technologies
Amazon Elastic MapReduce
Cloud platform (EC2 & S3) providing Hadoop as a Service
Provision as much capacity as you like for as much time as you like
Integration with S3 (Cloud Storage) – store your persistent data in S3
App or Service on Cloud, Hadoop cluster in same place
Multiple locations – you have an on-demand Hadoop cluster across 5 continents
MapR, KarmaSphere, etc. tools are available on the AWS platform itself.
Use expensive BI tools on the cloud pay-as-you-go