hadoop databases for oracle dbas

52
Session ID: Prepared by: Hadoop databases: Hive, Impala, Spark, Presto For ORACLE DBAs 557 Maxym Kharchenko, Gluent @maxymkh

Upload: maxym-kharchenko

Post on 13-Apr-2017

399 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Hadoop databases for oracle DBAs

Session ID:

Prepared by:

Hadoop databases: Hive, Impala, Spark, PrestoFor ORACLE DBAs

557

Maxym Kharchenko, Gluent

@maxymkh

Page 2: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Whoami• Database Kernel developer

-> ORACLE DBA-> Database Hadoop/Cloud developer

• Worked with ORACLE for the last 15 years

• OCM, ORACLE Ace alumni, Amazon alumni

• Last year: OLTP -> Hadoop

Page 3: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Shameless plug about my company

GluentOracle

TeradataNoSQL

Big Data Sources

MSSQL

App X

App Y

App Z

Page 4: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Agenda• What’s Hadoop databases ?

• Hive/Impala/Spark vs. ORACLE (hopefully, demo)

• Best ways to start

Page 5: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

What is Hadoop:• For “Big data”

• Can deal with “Unstructured” data

• Distributed

• Consists of: HDFS + MapReduce

• Requires you to write MapReduce jobs, NoSql

Page 6: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Yes, but what does it all mean ?

Page 7: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Imagine that you are Googlein the early 2000s

Page 8: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Target Ads• You need to query web crawler data

• Which is unbelievably huge

• These queries need to be:

• (reasonably) Fast• (reasonably) Cheap• (reasonably) Easy to use

Page 9: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Let’s build a Data Warehouse

Page 10: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

(traditional) Data warehouse • Been there for years

• Mature and (relatively) advanced

• SQL !!!

Page 11: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Data Warehouse scorecardRequirements RDBMS(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯

Page 12: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Scaling up “Big data” ain’t cheap• Can’t fit all of the data

on a single box

• Cost is quicklygetting out of hand

Page 13: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

(cheap) Commodity systemsmake “big data” feasible

Page 14: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Solution = commodity systems

=

$$$$$ $$

Page 15: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Commodity systems scorecardRequirements Commodity(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data

Page 16: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

All your queries are Java Classes

Page 17: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Google• 2003:

Google File System(GFS) paper

• 2004:Google MapReduce(MR) paper

Page 18: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Hadoop• 2006: Hadoop

Page 19: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

”Traditional Data Warehouse” vs. Hadoop

Requirements Hadoop Data Warehouse(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯

Page 20: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

• 2010: Facebook releases Apache Hive

• SQL on Hadoop !

SQL on Hadoop - Hive

Page 21: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

• 2012: Cloudera announces Impala

• Faster SQL on Hadoop !

Another SQL on Hadoop - Impala

Page 22: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

And then, it exploded …

Page 23: Hadoop databases for oracle DBAs

“Hadoop” vs “Relational” databasesDemo … hopefully

Page 24: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

This is not about NoSql :-)

Page 25: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Same: Tablessql> describe sh.products;

+-----------------------+----------------+---------+| name | type | comment |+-----------------------+----------------+---------+| prod_id | bigint | || prod_name | string | || prod_desc | string | || prod_category_id | bigint | || prod_category_desc | string | || supplier_id | bigint | || prod_total_id | decimal(38,18) | || prod_src_id | decimal(38,18) | || prod_eff_from | timestamp | || prod_eff_to | timestamp | || prod_valid | string | |+-----------------------+----------------+---------+

Page 26: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Same: Running SQL queriessql> select prod_id, count(1)from sh.sales s, sh.channels cwhere c.channel_id = s.channel_id and c.channel_desc='Catalog'group by prod_idorder by 2 desclimit 5;

+------------------------+----------+| prod_id | count(1) |+------------------------+----------+| 43.000000000000000000 | 5182 || 46.000000000000000000 | 5165 || 22.000000000000000000 | 5162 || 123.000000000000000000 | 5152 || 32.000000000000000000 | 5145 |+------------------------+----------+Fetched 5 row(s) in 3.26s

Page 27: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Same: Queries are optimizedsql> explain select count(1) from sh.times;+----------------------------------------------------------+| Explain String |+----------------------------------------------------------+| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 || || 03:AGGREGATE [FINALIZE] || | output: count:merge(1) || | || 02:EXCHANGE [UNPARTITIONED] || | || 01:AGGREGATE || | output: count(1) || | || 00:SCAN HDFS [sh.times] || partitions=16/16 files=32 size=500.45KB |+----------------------------------------------------------+

Page 28: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: What gets optimized• No “regular” indexes

• But many operationsare distributed

SALES 1TIMES 1

SALES 2TIMES 2

SALES 3TIMES 3

Page 29: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: Native cloud filesystem supportsql> show partition sh.sales;

s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET

Page 30: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Database engine does NOT ”own” data

Page 31: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

example01.dbfsysaux01.dbfsystem01.dbftemp01.dbfundotbs01.dbfusers01.dbf

a01_data.parqa01_data.parqa03_data.parqa04_data.parqa05_data.parqa06_data.parq

Different: Different engines can work withthe same data files (even at the same time)

Page 32: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: … or copies of the data files

hdfs://adhoc/a.parqhdfs://adhoc/b.parqhdfs://adhoc/c.parqhdfs://adhoc/d.parqhdfs://adhoc/e.parqhdfs://adhoc/f.parq

hdfs://prod/a.parqhdfs://prod/b.parqhdfs://prod/c.parqhdfs://prod/d.parqhdfs://prod/e.parqhdfs://prod/f.parq

s3://backup/a.parqs3://backup/b.parqs3://backup/c.parqs3://backup/d.parqs3://backup/e.parqs3://backup/f.parq

Page 33: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: Open data formats• Not proprietary – many

tools can read/write

• No additional $$for “advanced features”:

• Columnar storage• Storage indexes• Compression

Page 34: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Same: “sqlplus-like” clients> impala-shell -i 10.0.0.1

[10.0.0.1:21000] > select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;

+-----------------------+----------+| prod_id | count(1) |+-----------------------+----------+| 48.000000000000000000 | 74026 |+-----------------------+----------+

> beeline –u 'jdbc:hive2://10.0.0.1:10000'0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;

Page 35: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: External dictionary

User data

Dictionary (SYS)

User data

Dictionary (SYS)

Hive Metastore

Page 36: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Different: Append only, “ETL-like” DML• Hadoop DML

is more like ETL

• Data is presumed static

• ACID: someinterpretation required

• Schema on read

UPDATE t SET a=12 WHERE b=1;

Table T (base):

a_data.orc

Table T (base):

a_data.orc

Table T (delta):

b_data.orc

Compactor runs …

Table T (base):

c_data.orc

Page 37: Hadoop databases for oracle DBAs

Databases

Page 38: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Apache Hive

Slave C

• “Designed” for “batch” queries (*)

• Runs on top of standardHadoop RM: YARN

• Supports multiple “engines”: MR, TEZ, Spark

• SerDes

YARN NM

datanode

Master

Hiveserver2

namenode

Slave C

YARN NM

datanode

Slave C

YARN NM

datanode

YARN RM

Page 39: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Slave A

Apache Impala• Designed for

“quick interactive” queries

• “Data-local” execution

• In-memory processingimpalad

datanode

Slave B

impalad

datanode

Slave C

impalad

datanode

Master

statestored

namenode

catalogd

Page 40: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Apache Spark• “Better Hadoop”

with “native”:SQL, Mlib, GraphX

• In-memory processing, based on RDDs

• Supports many clusters: “native”, YARN, Mesos

• Flexible programming model

Master

Driver

Slave A

Executor

Slave B

Executor

Slave C

Executor

Page 41: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Presto

Slave A

• Designed for “interactive” queries

• In-memory processing

• Custom storage “plugins”: Hive, Kafka, MySql, Postgres,… worker

Slave B

worker

Slave C

worker

Master

coordinator

Page 42: Hadoop databases for oracle DBAs

How to start

Page 43: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Step 1: Google “Hadoop ecosystem”

Page 44: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Step 2: Try to install the simplest thing

Page 45: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Step 3

Page 46: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Step 4

Page 47: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Hint: Nobody builds their own Linux anymore

Page 48: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Chose Hadoop distribution that suits you

Page 49: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

Hadoop distributions• Pre-built and pre-integrated

(aka: all things work out of the box)

• Each has their own “philosophy” …

• … As well as preferred Hadoop database

Page 50: Hadoop databases for oracle DBAs

April 2-6, 2017 in Las Vegas, NV USA #C17LV

So what’s in it for me ?• It’s interesting (cool technology that hits many recent

buzzwords)

• If you know ORACLE, it’s close to your skill set

• It’s promising and future oriented

Page 51: Hadoop databases for oracle DBAs

Q&A

Page 52: Hadoop databases for oracle DBAs

Please Complete Your Session Evaluation Evaluate this session in your COLLABORATE app. Pull up this session and tap "Session Evaluation" to complete the survey.

Session ID: 557