qubole hadoop-summit-2013-europe

Cloud Friendly Hadoop & Hive

Joydeep Sen Sarma

Qubole

2

Agenda

What is Qubole Data Service

Hadoop as a Service in Cloud

Hive as a Service in Cloud

3

Qubole Data Service

AWS S3

AWS EC2

Hadoop

Qubole Data Service

Sqoop Oozie Pig Hive

AWS S3

API

AWS EC2

Hadoop

5

Qubole Data Service


AWS S3

API

AWS EC2

S3://adco/logs

Mysql

Vertica

6

Hadoop

6

Qubole Data Service


AWS S3

API

ODBC SDK

AWS EC2

Explore – Integrate – Analyze – Schedule

S3://adco/logs

Mysql

Vertica

7

Hadoop

7

Qubole Data Service


AWS S3

API

ODBC SDK

AWS EC2

Explore – Integrate – Analyze – Schedule

S3://adco/logs

Mysql

Vertica

8

Agenda

• What is Qubole Data Service

• Hadoop as a Service in Cloud

• Hive as a Service in Cloud

9

Step 1(Optional): Setup Hadoop

10

Step 2: Fire Away

AdCo Hadoop

11

Step 2: Fire Away

select t.county, count(1) from (select

transform(a.zip) using ‘geo.py’ as

a.county from SMALL_TABLE a) t

group by t.county;

AdCo Hadoop

12

Step 2: Fire Away




group by t.county;

AdCo Hadoop

13 13

Step 2: Fire Away




group by t.county;

insert overwrite table dest

select a.id, a.zip, count(distinct b.uid)

from ads a join LARGE_TABLE b on (a.id=b.ad_id)

group by a.id, a.zip;

hadoop jar –Dmapred.min.split.size=32000000

myapp.jar –partitioner .org.apache…

AdCo Hadoop

14 14

Step 2: Fire Away




group by t.county;


select a.id, a.zip, count(distinct b.uid)

from ads a join LARGE_TABLE b on (a.id=b.ad_id)

group by a.id, a.zip;



AdCo Hadoop

15

Step 2: Fire Away



AdCo Hadoop

16

Step 2: Fire Away



AdCo Hadoop

17

Step 2: Fire Away

AdCo Hadoop

18

Come back anytime

19

Hadoop as Service

1. Detect when cluster is required

– Not all Hive statements require cluster (EXPLAIN/SHOW/..)

2. Atomically create cluster

– Long running process, concurrency control using Mysql

3. Shutdown when not in use

– Do on hour boundary (whose?)

– Not if User Sessions are active!

20

Hadoop as Service

• Archive Job History/Logs to S3 – Transparent access to Old jobs

• Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR

– Use right number of slots per machine

• Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts

21

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker

22

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker


select … from ads join

campaigns on …group by …;

23

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker




24

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker




25

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker



campaigns on …group by …; Progress

26

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker




Demand

Supply

27

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker




Demand

Supply

28

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker




29

Scaling Up

StarCluster

Map Tasks

ReduceTasks

AWS

Master

Slaves

Job Tracker




30

Scaling Down

1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today)

– Don’t go below minimum cluster size

2. Remove node from Map-Reduce Cluster

3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating

– One surviving replica and we are Done.

4. Delete Instance

31 31

Spot Instances

On an average 50-60% cheaper

32

Spot Instance: Challenges

• Can lose Spot nodes anytime

– Disastrous for HDFS

– Hybrid Mode: Use mix of On-Demand and Spot

– Hybrid Mode: Keep one replica in On-Demand nodes

• Spot Instances may not be available

– Timeout and use On-Demand nodes as fallback

33

Agenda

What is Qubole Data Service

Hadoop as a Service in Cloud

Hive as a Service in Cloud

34

Query History/Results

35

Cheap to Test

Evaluate expressions on sample data

36

Cheap to Test

Run Query on Sample

37

Fastest Hive SaaS

• Works with Small Files!

– Faster Split Computation (8x)

– Prefetching S3 files (30%)

38

Fastest Hive SaaS




• Stable JVM Reuse!

– Fix re-entrancy issues

– 1.2-2x speedup

39

Fastest Hive SaaS




• Direct writes to S3

– HIVE-1620



– 1.2-2x speedup

40

Fastest Hive SaaS





– HIVE-1620



– 1.2-2x speedup

• Columnar Cache – Use HDFS as cache for S3

– Upto 5x faster for JSON data

41

Fastest Hive SaaS





– HIVE-1620

• NEW – Multi-Tenant Hive

Server



– 1.2-2x speedup

• Columnar Cache – Use HDFS as cache for S3

– Upto 5x faster for JSON data

Questions?

@Qubole

Free Trial: www.qubole.com

qubole hadoop-summit-2013-europe

Technology

aws s3 s3

qubole data service

table destslavesselect

table dest slavesselect

county adco hadoop

ads joincampaigns

table b

scaling upinsert