sqlbits xi - etl with hadoop
DESCRIPTION
My SQLBits XI presentation about Hadoop, MapReduce and HiveTRANSCRIPT
Jan Pieter Posthuma – Inter Access
ETL with Hadoop and MapReduce
2
Introduction
Jan Pieter Posthuma Technical Lead Microsoft BI and
Big Data consultant Inter Access, local consultancy firm in the
Netherlands Architect role at multiple projects Analysis Service, Reporting Service,
PerformancePoint Service, Big Data, HDInsight, Cloud BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
3
Expectations
What to cover Simple ETL, so simple
sources Different way to achieve the
result
What not to cover Big Data Best Practices Deep internals Hadoop
4
Agenda
Hadoop HDFS Map/Reduce
– Demo
Hive and Pig– Demo
Polybase
5
Hadoop
Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware.
Widely accepted by Database vendors as a solution for unstructured data
Microsoft partners with HortonWorks and delivers their Hadoop Data Platform as Microsoft HDInsight
Available on premise and as an Azure service HortonWorks Data Platform (HDP) 100% Open Source!
6
Hadoop
Fast
L
oad
Source Systems
Historical Data(Beyond Active Window)
Summarize & Load
Big Data Sources (Raw, Unstructured)
Alerts, Notifications
Data & Compute Intensive Application
ERP CRM LOB APPS
Integrate/Enrich
SQL Server StreamInsight
Enterprise ETL with SSIS, DQS, MDS
HDInsight on Windows Azure
HDInsight on Windows Server
SQL Server FTDW Data Marts
SQL Server Reporting Services
SQL Server Analysis Server
Business Insights
Interactive Reports
Performance Scorecards
Crawlers
Bots
Devices
Sensors
SQL Server Parallel Data Warehouse
Data Insights Value
Azure Market Place
1. Data
Warehousing:Storing and analysis of
structured data
4. Business Analytics:Interactionwith data
2. Map Reduce:
Storing and processing of
unstructured data
3. Streaming:
Predictive Maintenance aka
Real-time data processing
CREATE EXTERNAL TABLE CustomerWITH (LOCATION=‘hdfs://10.13.12.14:5000/user/Hadoop/Customer’, FORMAT_OPTIONS (FIELDS_TERMINATOR = ‘,’)ASSELECT * FROM DimCustomer
7
Hadoop
HDFS – distributed, fault tolerant file system MapReduce – framework for writing/executing distributed,
fault tolerant algorithms Hive & Pig – SQL-like declarative languages Sqoop/PolyBase – package
for moving data between HDFS and relational DB systems
+ Others…
HDFS
Map/Reduce
Hive & PigSqoop /
Poly base
Avr
o (
Se
rial
iza
tion
)
HBaseZo
oke
epe
r
ETL Tools
BI Reporting
RDBMS
8
HDFS
Large File11001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110011001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110010101001110011001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001010100111001100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
…
6440MB
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 100
Block 101
64MB 64MB 64MB 64MB 64MB 64MB
…
64MB 40MB
Block 1
Block 2
Let’s color-code them
Block 3
Block 4
Block 5
Block 6
Block 100
Block 101
e.g., Block Size = 64MBHDFS
Map/Reduce
Hive & Pig Sqoo
p / Poly base
Files are composed of set of blocks• Typically 64MB in size• Each block is stored as a separate
file in the local file system (e.g. NTFS)
9
HDFS
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
(heartbeat, balancing, replication, etc.)
nodes write to local disk
namespace backups
HDFS was designed with the expectation that failures (both hardware and software) would occur frequently
10
Map/Reduce
Programming framework (library and runtime) for analyzing data sets stored in HDFS
MR framework provides all the “glue” and coordinates the execution of the Map and Reduce jobs on the cluster.
– Fault tolerant– Scalable
Map function:
var map = function(key, value, context) {}
Reduce function:
var reduce = function(key, values, context) {} HDFS
Hive & Pig Sqoo
p / Poly base
Map/Reduce
11
Map/Reduce<keyA, valuea><keyB, valueb><keyC, valuec>…
<keyA, valuea><keyB, valueb><keyC, valuec>…
<keyA, valuea><keyB, valueb><keyC, valuec>…
<keyA, valuea><keyB, valueb><keyC, valuec>…
Output
Reducer
<keyA, list(valuea, valueb, valuec, …)>
Reducer
<keyB, list(valuea, valueb, valuec, …)>
Reducer
<keyC, list(valuea, valueb, valuec, …)>
Sort and
groupbykey
DataNode
DataNode
DataNode
Mapper<keyi, valuei>
Mapper<keyi, valuei>
Mapper<keyi, valuei>
Mapper<keyi, valuei>
12
Demo
Weather info: Need daily max and min temperature per station
var map = function (key, value, context) {
if (value[0] != '#') {
var allValues = value.split(',');
if (allValues[7].trim() != '') {
context.write(allValues[0]+'-'+allValues[1],
allValues[0] + ',' + allValues[1] + ',' + allValues[7]);
}}};
Output <key, value>:
<“210-19510101”, “210,19510101,-4”>
<“210-19510101”, “210,19510101,1”>
# STN,YYYYMMDD,HH, DD,FH, FF,FX, T,T10,TD,SQ, Q,DR,RH, P,VV, N, U,WW,IX, M, R, S, O, Y# 210,19510101, 1,200, , 93, ,-4, , , , , , ,9947, , 8, , 5, , , , , , 210,19510101, 2,190, ,108, , 1, , , , , , ,9937, , 8, , 5, , 0, 0, 0, 0, 0
13
Demo (cont.)
var reduce = function (key, values, context) {
var mMax = -9999;
var mMin = 9999;
var mKey = key.split('-');
while (values.hasNext()) {
var mValues = values.next().split(',');
mMax = mValues[2] > mMax ? mValues[2] : mMax;
mMin = mValues[2] < mMin ? mValues[2] : mMin; }
context.write(key.trim(),
mKey[0].toString() + '\t' +
mKey[1].toString() + '\t' +
mMax.toString() + '\t' +
mMin.toString()); };
Reduce Input <key, values:=list(value1, …, valuen)>:<“210-19510101”, {“210,19510101,-4”, “210,19510101,1”}>
Map Output <key, value>:<“210-19510101”, “210,19510101,-4”><“210-19510101”, “210,19510101,1”>
Demo
15
Hive and Pig
Query: Find the sourceIP address that generated the most adRevenue along with its average pageRank
Rankings (
pageURL STRING,pageRank INT,avgDuration INT
);
UserVisits (
sourceIP STRING,destURL STRINGvisitDate DATE,adRevenue FLOAT,.. // fields omitted
);
Join required
HDFS
Map/Reduce
Hive & PigSqoop / Poly base
package edu.brown.cs.mapreduce.benchmarks;import java.util.*;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.*;import org.apache.hadoop.util.*;import org.apache.hadoop.mapred.lib.*;import org.apache.hadoop.fs.*;import edu.brown.cs.mapreduce.BenchmarkBase; public class Benchmark3 extends Configured implements Tool { public static String getTypeString(int type) { if (type == 1) { return ("UserVisits"); } else if (type == 2) { return ("Rankings"); } return ("INVALID"); } /* (non-Javadoc) * @see org.apache.hadoop.util.Tool#run(java.lang.String[]) */ public int run(String[] args) throws Exception { BenchmarkBase base = new BenchmarkBase(this.getConf(), this.getClass(), args); Date startTime = new Date(); System.out.println("Job started: " + startTime);
1
// Phase #1 // ------------------------------------------- JobConf p1_job = base.getJobConf(); p1_job.setJobName(p1_job.getJobName() + ".Phase1"); Path p1_output = new Path(base.getOutputPath().toString() + "/phase1"); FileOutputFormat.setOutputPath(p1_job, p1_output); // // Make sure we have our properties // String required[] = { BenchmarkBase.PROPERTY_START_DATE, BenchmarkBase.PROPERTY_STOP_DATE }; for (String req : required) { if (!base.getOptions().containsKey(req)) { System.err.println("ERROR: The property '" + req + "' is not set"); System.exit(1); } } // FOR p1_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class : KeyValueTextInputFormat.class); if (base.getSequenceFile()) p1_job.setOutputFormat(SequenceFileOutputFormat.class); p1_job.setOutputKeyClass(Text.class); p1_job.setOutputValueClass(Text.class); p1_job.setMapperClass(base.getTupleData() ? edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableMap.class : edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextMap.class); p1_job.setReducerClass(base.getTupleData() ? edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableReduce.class : edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextReduce.class); p1_job.setCompressMapOutput(base.getCompress()); 2
// Phase #2 // ------------------------------------------- JobConf p2_job = base.getJobConf(); p2_job.setJobName(p2_job.getJobName() + ".Phase2"); p2_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class : KeyValueTextInputFormat.class); if (base.getSequenceFile()) p2_job.setOutputFormat(SequenceFileOutputFormat.class); p2_job.setOutputKeyClass(Text.class); p2_job.setOutputValueClass(Text.class); p2_job.setMapperClass(IdentityMapper.class); p2_job.setReducerClass(base.getTupleData() ? edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TupleWritableReduce.class : edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TextReduce.class); p2_job.setCompressMapOutput(base.getCompress()); // Phase #3 // ------------------------------------------- JobConf p3_job = base.getJobConf(); p3_job.setJobName(p3_job.getJobName() + ".Phase3"); p3_job.setNumReduceTasks(1); p3_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class : KeyValueTextInputFormat.class); p3_job.setOutputKeyClass(Text.class); p3_job.setOutputValueClass(Text.class); //p3_job.setMapperClass(Phase3Map.class); p3_job.setMapperClass(IdentityMapper.class); p3_job.setReducerClass(base.getTupleData() ? edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TupleWritableReduce.class : edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TextReduce.class);
3
// // Execute #1 // base.runJob(p1_job); // // Execute #2 // Path p2_output = new Path(base.getOutputPath().toString() + "/phase2"); FileOutputFormat.setOutputPath(p2_job, p2_output); FileInputFormat.setInputPaths(p2_job, p1_output); base.runJob(p2_job); // // Execute #3 // Path p3_output = new Path(base.getOutputPath().toString() + "/phase3"); FileOutputFormat.setOutputPath(p3_job, p3_output); FileInputFormat.setInputPaths(p3_job, p2_output); base.runJob(p3_job); // There does need to be a combine if (base.getCombine()) base.runCombine(); return 0; }}
4
16
Hive and Pig
Principle is the same: easy data retrieval Both use MapReduce Different founders Facebook (Hive) and Yahoo (PIG) Different language SQL like (Hive) and more procedural (PIG) Both can store data in tables, which are stored as HDFS file(s) Extra language options to use benefits of Hadoop
– Partition by statement– Map/Reduce statement
‘Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL’
17
Hive
Query 1: SELECT count_big(*) FROM lineitem
Query 2: SELECT max(l_quantity) FROM lineitem WHERE l_orderkey>1000 and l_orderkey<100000
GROUP BY l_linestatus
Query 1 Query 20
500
1000
15001318
1397
252 279
HivePDW
Secs.
18
Demo
Use the same data file as previous demo But now we directly ‘query’ the file
Demo
20
Polybase
PDW v2 introduces external tables to represent HDFS data PDW queries can now span HDFS and PDW data Hadoop cluster is not part of the appliance
Social Apps
Sensor & RFID
Mobile Apps
WebApps
Unstructured data Structured data
RDBMS
HDFS EnhancedPDW
query engine
T-SQL
Relationaldatabases
HDFS
Map/Reduce
Hive & Pig Sqoo
p / Poly base
Polybase
Control NodeSQL
Server
Compute Node
SQL Server
Compute Node
SQL Server
Compute Node…
SQL ServerPDW Cluster
Namenode(HDFS)
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Node
DN
Hadoop Cluster
21
This is PDW!
22
PDW Hadoop
1. Retrieve data from HDFS with a PDW query– Seamlessly join structured and semi-structured data
2. Import data from HDFS to PDW– Parallelized CREATE TABLE AS SELECT (CTAS)– External tables as the source– PDW table, either replicated or distributed, as destination
3. Export data from PDW to HDFS– Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)– External table as the destination; creates a set of HDFS files
SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID AND c.URL=‘www.bing.com’;
CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL) AS SELECT URL, EventDate, UserID FROM ClickStream;
CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID) WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...) AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
23
Recap
Hadoop is the next big thing for DWH/BI Not a replacement, but an new dimension Many ways to integrate it’s data
What’s next?– Polybase combined with (custom) Map/Reduce?– HDInsight appliance?– Polybase for SQL Server vNext?
24
References
Microsoft BigData (HDInsight):http://www.microsoft.com/bigdata
Microsoft HDInsight Azure (3 months free trail):http://www.windowsazure.com
Hortonworks Data Platform sandbox (VMware): http://hortonworks.com/download/
Q&A
Coming up…
Speaker Title Room
Alberto Ferrari DAX Query Engine Internals Theatre
Wesley Backelant An introduction to the wonderful world of OData Exhibition B
Bob Duffy Windows Azure For SQL folk Suite 3
Dejan Sarka Excel 2013 Analytics Suite 1
Mladen PrajdićFrom SQL Traces to Extended Events. The next big switch. Suite 2
Sandip Pani New Analytic Functions in SQL server 2012 Suite 4
#SQLBITS