qubole hadoop-summit-2013-europe
TRANSCRIPT
Cloud Friendly Hadoop & Hive
Joydeep Sen Sarma
Qubole
2
Agenda
What is Qubole Data Service
Hadoop as a Service in Cloud
Hive as a Service in Cloud
3
Qubole Data Service
AWS S3
AWS EC2
Hadoop
Qubole Data Service
Sqoop Oozie Pig Hive
AWS S3
API
AWS EC2
Hadoop
5
Qubole Data Service
Sqoop Oozie Pig Hive
AWS S3
API
AWS EC2
S3://adco/logs
Mysql
Vertica
6
Hadoop
6
Qubole Data Service
Sqoop Oozie Pig Hive
AWS S3
API
ODBC SDK
AWS EC2
Explore – Integrate – Analyze – Schedule
S3://adco/logs
Mysql
Vertica
7
Hadoop
7
Qubole Data Service
Sqoop Oozie Pig Hive
AWS S3
API
ODBC SDK
AWS EC2
Explore – Integrate – Analyze – Schedule
S3://adco/logs
Mysql
Vertica
8
Agenda
• What is Qubole Data Service
• Hadoop as a Service in Cloud
• Hive as a Service in Cloud
9
Step 1(Optional): Setup Hadoop
10
Step 2: Fire Away
AdCo Hadoop
11
Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
AdCo Hadoop
12
Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
AdCo Hadoop
13 13
Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
group by a.id, a.zip;
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
14 14
Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
group by a.id, a.zip;
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
15
Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
16
Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
17
Step 2: Fire Away
AdCo Hadoop
18
Come back anytime
19
Hadoop as Service
1. Detect when cluster is required
– Not all Hive statements require cluster (EXPLAIN/SHOW/..)
2. Atomically create cluster
– Long running process, concurrency control using Mysql
3. Shutdown when not in use
– Do on hour boundary (whose?)
– Not if User Sessions are active!
20
Hadoop as Service
• Archive Job History/Logs to S3 – Transparent access to Old jobs
• Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR
– Use right number of slots per machine
• Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts
21
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
22
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …;
23
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …;
24
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …;
25
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
26
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
Demand
Supply
27
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
Demand
Supply
28
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
29
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
30
Scaling Down
1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today)
– Don’t go below minimum cluster size
2. Remove node from Map-Reduce Cluster
3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating
– One surviving replica and we are Done.
4. Delete Instance
31 31
Spot Instances
On an average 50-60% cheaper
32
Spot Instance: Challenges
• Can lose Spot nodes anytime
– Disastrous for HDFS
– Hybrid Mode: Use mix of On-Demand and Spot
– Hybrid Mode: Keep one replica in On-Demand nodes
• Spot Instances may not be available
– Timeout and use On-Demand nodes as fallback
33
Agenda
What is Qubole Data Service
Hadoop as a Service in Cloud
Hive as a Service in Cloud
34
Query History/Results
35
Cheap to Test
Evaluate expressions on sample data
36
Cheap to Test
Run Query on Sample
37
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
38
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
• Stable JVM Reuse!
– Fix re-entrancy issues
– 1.2-2x speedup
39
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
• Direct writes to S3
– HIVE-1620
• Stable JVM Reuse!
– Fix re-entrancy issues
– 1.2-2x speedup
40
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
• Direct writes to S3
– HIVE-1620
• Stable JVM Reuse!
– Fix re-entrancy issues
– 1.2-2x speedup
• Columnar Cache – Use HDFS as cache for S3
– Upto 5x faster for JSON data
41
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
• Direct writes to S3
– HIVE-1620
• NEW – Multi-Tenant Hive
Server
• Stable JVM Reuse!
– Fix re-entrancy issues
– 1.2-2x speedup
• Columnar Cache – Use HDFS as cache for S3
– Upto 5x faster for JSON data
Questions?
@Qubole
Free Trial: www.qubole.com