bridging oracle database and hadoop by alex gorbachev, pythian from oracle openworld ioug forum

52
Bridging Oracle Database and Hadoop Alex Gorbachev October 2015

Upload: alex-gorbachev

Post on 16-Apr-2017

2.512 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Bridging Oracle Database and Hadoop

Alex Gorbachev

October 2015

Page 2: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Alex Gorbachev• Chief Technology Officer at Pythian• Blogger• Cloudera Champion of Big Data• OakTable Network member• Oracle ACE Director• Founder of BattleAgainstAnyGuess.com• Founder of Sydney Oracle Meetup• EVP, IOUG

Page 3: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

What is Big Data?

and why Big Data today?

Page 4: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Why Big Data boom now?• Advances in communication – it’s now feasible to

transfer large amounts of data economically by anyone from virtually anywhere

• Commodity hardware – high performance and high capacity at low price is available

• Commodity software – open-source phenomena made advanced software products affordable to anyone

• New data sources – mobile, sensors, social media data-sources

• What’s been only possible at very high cost in the past, can now be done by any small or large business

Page 5: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Big Data = Affordable at Scale

Page 6: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Not everyone is Facebook, Google, Yahoo and etc.

These guys had to push the envelope because traditional technology didn’t

scale

Page 7: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Not everyone is Facebook, Google, Yahoo and etc.

These guys had to push the envelope because traditional technology didn’t

scale

Mere mortals’ challenge is cost and agility

Page 8: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

System capability per $Big Data technology may be expensive at low scale due to high engineering efforts.

Traditional technology becomes too complex and expensive to scale.

investments, $

capa

bilit

ies

traditional

Big Data

Page 9: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

What is Hadoop?

Page 10: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Hadoop Design Principle #1Scalable Affordable Reliable Data Store

Cheap & Scalable

Simple

Specialized

Shared nothing

Single threaded

writes

Fault tolerance

HDFS – Hadoop Distributed Filesystem

Page 11: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Hadoop Design Principle #2Bring Code to Data

Code

Data

Page 12: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Why is Hadoop so affordable?• Cheap hardware• Resiliency through software• Horizontal scalability• Open-source software

Page 13: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

How much does it cost?Oracle Big Data Appliance X5-2 rack - $525K list price• 18 data nodes• 648 CPU cores• 2.3 TB RAM• 216 x 4TB disks• 864TB of raw disk capacity• 288TB usable (triple mirror)• 40G InfiniBand + 10GbE

networking• Cloudera Enterprise

Page 14: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Hadoop is very flexible• Rich ecosystem of tools• Can handle any data format

– Relational– Text– Audio, video– Streaming data– Logs– Non-relational structured data (JSON, XML, binary

formats)– Graph data

• Not limited to relational data processing

Page 15: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Challenges with Hadoopfor those of us used to Oracle

• New data access tools– Relational and non-relational data

• Non-Oracle (and non-ANSI) Hive SQL– Java-based UDFs and UDAFs

• Security features are not there out-of-the-box

• Maybe slow for “small data”

Page 16: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Tables in Hadoop

using Hadoop with relational data abstractions

Page 17: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Apache Hive• Apache Hive provides a SQL layer over Hadoop

– data in HDFS (structured or unstructured via SerDe)– using one of distributed processing frameworks –

MapReduce, Spark, Tez• Presents data from HDFS as tables and columns

– Hive metastore (aka data dictionary)• SQL language access (HiveQL)

– Parses SQL and creates execution plans in MR, Spark or Tez

• JDBC and ODBC drivers– Access from ETL and BI tools– Custom apps– Development tools

Page 18: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Native Hadoop tools• Demo

• HUE– HDFS files– Hive– Impala

Page 19: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Access Hive using SQL Developer• Demo

• Use Cloudera JDBC drivers• Query data & browse metadata• Run DDL from SQL tab• Create Hive table definitions inside Oracle

DB

Page 20: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Hadoop and OBIEE 11g• OBIEE 11.1.1.7 can query Hive/Hadoop as

a data source– Hive ODBC drivers– Apache Hive Physical Layer database type

• Limited features– OBIEE 11.1.1.7 OBIEE has HiveServer1 ODBC

drivers– HiveQL is only a subset of ANSI SQL

• Hive query response time is slow for speed of thought response time

Page 21: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

ODI 12c• ODI – data transformation tool

– ELT approach pushes transformations down to Hadoop - leveraging power of cluster

– Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation

• Upcoming support for Pig and Spark• Workflow orchestration• Metadata and model-driven• GUI workflow design• Transformation audit & data quality

Page 22: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Moving Data to Hadoop using ODI• Interface with Apache Sqoop using IKM SQL

to Hive-HBase-File knowledge module– Hadoop ecosystem tool– Able to run in parallel– Optimized Sqoop JDBC drivers integration for

Oracle– Bi-directional in-and-out of Hadoop to RDBMS– Data is moved directly between Hadoop cluster

and database• Export RBDMS data to file and load using

IKM File to Hive

Page 23: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Integrating Hadoop with Oracle Database

Page 24: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Oracle Big Data Connectors• Oracle Loader for Hadoop

– Offloads some pre-processing to Hadoop MR jobs (data type conversion, partitioning, sorting).

– Direct load into the database (online method)– Data Pump binary files in HDFS (offline method)

• These can then be accessed as external tables on HDFS

• Oracle Direct Connector for Hadoop– Create external table on files in HDFS– Text files or Data Pump binary files– WARNING: lots of data movement! Great for

archival non-frequently accessed data to HDFS

Page 25: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Oracle Big Data SQL

25

Source: http://www.slideshare.net/gwenshap/data-wrangling-and-oracle-connectors-for-hadoop

Page 26: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Oracle Big Data SQL• Transparent access from Oracle

DB to Hadoop– Oracle SQL dialect– Oracle DB security model– Join data from Hadoop and Oracle

• SmartScan - pushing code to data– Same software base as on Exadata

Storage Cells– Minimize data transfer from Hadoop

to Oracle• Requires BDA and Exadata• Licensed per Hadoop disk spindle

26

Page 27: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Big Data SQL Demo

Page 28: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Big Data SQL in Oracle tools• Transparent to any app• SQL Developer• ODI• OBIEE

Page 29: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Hadoop as Data Warehouse

Page 30: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Traditional Needs of Data Warehouses• Speed of thought end user analytics

experience– BI tools coupled with DW databases

• Scalable data platform– DW database

• Versatile and scalable data transformation engine– ETL tools sometimes coupled with DW

databases• Data quality control and audit

– ETL tools

Page 31: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

What drives Hadoop adoption for Data Warehousing?

Page 32: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

What drives Hadoop adoption for Data Warehousing?

1. Cost efficiency

Page 33: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

What drives Hadoop adoption for Data Warehousing?

1. Cost efficiency2. Agility needs

Page 34: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Why is Hadoop Cost Efficient?Hadoop leverages two main trends in IT industry

• Commodity hardware – high performance and high capacity at low price is available

• Commodity software – open-source phenomena made advanced software products affordable to anyone

Page 35: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

How Does Hadoop Enable Agility?• Load first, structure later

– Don’t need to spend months changing DW to add new types of data without knowing for sure it will be valuable for end users

– Quick and easy to verify hypothesis – perfect data exploration platform

• All data in one place is very powerful– Much easier to test new theories

• Natural fit for “unstructured” data

Page 36: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Traditional needs of DW & Hadoop• Speed of thought end user analytics experience?

– Very recent features – Impala, Presto, Drill, Hadapt, etc.– BI tools embracing Hadoop as DW– Totally new products become available

• Scalable data platform?– Yes

• Versatile and scalable data transformation engine?– Yes but needs a lot of DIY– ETL vendors embraced Hadoop

• Data quality control and audit?– Hadoop makes it more difficult because of flexibility it brings– A lot of DIY but ETL vendors getting better supporting

Hadoop + new products appear

Getting there

Challenge

Page 37: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Unique Hadoop Challenges• Still “young” technology

– requires a lot of high quality engineering talent• Security doesn’t come out of the box

– Capabilities are there but very tedious to implement and somewhat fragile

• Challenge of selecting the right tool for the job– Hadoop ecosystem is huge

• Hadoop breaks IT silos• Requires commoditization of IT operations

– Large footprint with agile deployments

Page 38: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Typical Hadoop adoption in modern Enterprise IT

Data WarehouseHadoop

BI tools

Page 39: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Bring the world in your data center

Page 40: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Rare historical report

Page 41: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Find a needle in a haystack

Page 42: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Will Hadoop displace traditional DW platforms?

Hadoop

BI tools

Page 43: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Example pure Hadoop DW stack

HDFS

Hive/Pig FlumeSqoop DIYImpala

Ker

bero

s

Oozie + DIY -

data sources

Page 44: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Do you have a Big Data problem?

Page 45: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Your Datais NOTas BIG as you think

Page 46: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

is NOT a Big Data problem

Using 8 years old hardware…

Page 47: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

is NOT a Big Data problem

Misconfigured infrastructure…

Page 48: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

is NOT a Big Data problem

Lack of purging policy…

Page 49: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

is NOT a Big Data problem

Bad data model design…

Page 50: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

is NOT a Big Data problem

Bad SQL…

Page 51: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Your Datais NOTas BIG as you think

Controversy…

Page 52: Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle OpenWorld IOUG Forum

Thanks and Q&AContact info

[email protected]

+1-877-PYTHIAN

To follow uspythian.com/blog

@alexgorbachev @pythian

linkedin.com/company/pythian