integrated data warehouse with hadoop and oracle database

42
With Oracle Database and Hadoop Building the Integrated Data Warehouse Gwen Shapira, Senior Consultant

Upload: chen-gwen-shapira

Post on 26-Jan-2015

113 views

Category:

Technology


3 download

DESCRIPTION

Use cases and advice on integrating Hadoop with the enterprise data warehouse

TRANSCRIPT

Page 1: Integrated Data Warehouse with Hadoop and Oracle Database

With Oracle Database and Hadoop

Building the Integrated Data Warehouse

Gwen Shapira, Senior Consultant

Page 2: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Why Pythian •  Recognized Leader:

•  Global industry leader in data infrastructure managed services and consulting with expertise in Oracle, Oracle Applications, Microsoft SQL Server, MySQL, big data and systems administration

•  Work with over 200 multinational companies such as Forbes.com, Fox Sports, Nordion and Western Union to help manage their complex IT deployments

•  Expertise: •  One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 8

Oracle ACEs/ACE Directors •  Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata,

Oracle GoldenGate & Oracle RAC

•  Global Reach & Scalability: •  24/7/365 global remote support for DBA and consulting, systems administration, special

projects or emergency response

Page 3: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

About Gwen Shapira • Oracle ACE Director • 13 Years with pager

•  7 as Oracle DBA

• Senior Consultant:

• Has MacBook, will travel.

•  @gwenshap

•  http://www.pythian.com/news/author/shapira/

Page 4: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Agenda • What is Big Data?

• Why do we care about Big Data?

• Why your DWH needs Hadoop?

•  Examples of Hadoop in the DWH

•  How to integrate Hadoop into your DWH

•  Avoiding major pitfalls

Page 5: Integrated Data Warehouse with Hadoop and Oracle Database

What is Big Data?

Page 6: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

MORE DATA THAN YOU CAN HANDLE

Page 7: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

MORE DATA THAN RELATIONAL DATABASES CAN HANDLE

Page 8: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

MORE DATA THAN RELATIONAL DATABASES CAN HANDLE CHEAPLY

Page 9: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Data Arriving at fast Rates Typically unstructured Stored without aggregation

Analyzed in Real Time For Reasonable Cost

Page 10: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Where does Big Data come from?

• Social media • Enterprise transactional data • Consumer behaviour • Multimedia • Sensors and embedded devices • Network devices

Page 11: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Why all the Excitement?

Page 12: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Complex Data Architecture

Page 13: Integrated Data Warehouse with Hadoop and Oracle Database

Your DWH needs Hadoop

Page 14: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Big Problems with Big Data •  It is:

•  Unstructured

•  Unprocessed

•  Un-aggregated

•  Un-filtered

•  Repetitive

•  And generally messy.

Oh, and there is a lot of it.

Page 15: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Technical Challenges •  Storage capacity

•  Storage throughput

•  Pipeline throughput

•  Processing power

•  Parallel processing

•  System Integration

• Data Analysis

Scalable storage

Massive Parallel Processing

Ready to use tools

Page 16: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Hadoop Principles Bring Code to Data Share Nothing

Page 17: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Hadoop in a Nutshell

Map-Reduce: Framework for writing massively parallel jobs

HDFS: ���Replicated Distributed Big-Data File System

Page 18: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Hadoop Benefits •  Reliable solution based on unreliable hardware

• Designed for large files

•  Load data first, structure later

• Designed to maximize throughput of large scans

• Designed to maximize parallelism

• Designed to scale

•  Flexible development platform

•  Solution Ecosystem

Page 19: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Hadoop Limitations • Hadoop is scalable but not fast • Batteries not included •  Instrumentation not included either • Well-known reliability limitations

Page 20: Integrated Data Warehouse with Hadoop and Oracle Database

Use Cases and Customer Stories

Hadoop in the Data Warehouse

Page 21: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

ETL for Unstructured Data

Logs Web servers, app server, clickstreams

Flume Hadoop Cleanup,

aggregation Longterm storage

DWH BI,

batch reports

Page 22: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

ETL for Structured Data

OLTP Oracle, MySQL,

Informix…

Sqoop, Perl

Hadoop Transformation

aggregation Longterm storage

DWH BI,

batch reports

Page 23: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Bring the World into Your Datacenter

Page 24: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Rare Historical Report

Page 25: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Find Needle in Haystack

Page 26: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

We are not doing SQL anymore

Page 27: Integrated Data Warehouse with Hadoop and Oracle Database

Connecting the (big) Dots

Page 28: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Sqoop Queries

Page 29: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Sqoop is Flexible (for import)

• Select <columns> from <table> where <condition> • Or <write your own query>

• Split column

• Parallel

• Incremental

• File formats

Page 30: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Sqoop Import Examples

•  Sqoop  import  -­‐-­‐connect  jdbc:oracle:thin:@//dbserver:1521/masterdb    -­‐-­‐username  hr  -­‐-­‐table  emp    -­‐-­‐where  “start_date  >  ’01-­‐01-­‐2012’”  

 

•  Sqoop  import  jdbc:oracle:thin:@//dbserver:1521/masterdb    -­‐-­‐username  myuser    -­‐-­‐table  shops  -­‐-­‐split-­‐by  shop_id    -­‐-­‐num-­‐mappers  16  

Must be indexed or partitioned to avoid 16 full table scans

Page 31: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Less Flexible Export

• 100 row batch inserts • Commit every 100 batches

• Parallel export

• Update mode Example:

sqoop  export    -­‐-­‐connect  jdbc:oracle:thin:@//dbserver:1521/masterdb    -­‐-­‐table  bar    -­‐-­‐export-­‐dir  /results/bar_data  

Page 32: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Fuse-DFS • Mount HDFS on Oracle server:

• sudo yum install hadoop-0.20-fuse

• hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point>

• Use external tables to load data into Oracle • File Formats may vary • All ETL best practices apply

Page 33: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Oracle Loader for Hadoop • Load data from Hadoop into Oracle • Map-Reduce job inside Hadoop • Converts data types. • Partitions and sorts • Direct path loads • Reduces CPU utilization on database

Page 34: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Oracle Direct Connector to HDFS • Create external tables of files in HDFS • PREPROCESSOR  HDFS_BIN_PATH:hdfs_stream  • All the features of External Tables • Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS

Page 35: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Big Data Appliance and Exadata

Page 36: Integrated Data Warehouse with Hadoop and Oracle Database

How not to Fail

Page 37: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Data that belongs in RDBMS

Page 38: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Prepare for Migration

Page 39: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Use Hadoop Efficiently •  Understand your bottlenecks:

•  CPU, storage or network?

•  Reduce use of temporary data:

•  All data is over the network

•  Written to disk in triplicate.

•  Eliminate unbalanced workloads

• Offload work to RDBMS

•  Fine-tune optimization with Map-Reduce

Page 40: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Your Data is NOT as BIG

as you think

Page 41: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Getting Started

• Pick a Business Problem •  Acquire Data

• Use right tool for the job •  Hadoop can start on the cheap

•  Integrate the systems

• Analyze data •  Get operational

Page 42: Integrated Data Warehouse with Hadoop and Oracle Database

© 2012 – Pythian

Thank you and Q&A

http://www.pythian.com/news/

http://www.facebook.com/pages/The-Pythian-Group/163902527671

@pythian

http://www.linkedin.com/company/pythian

1-877-PYTHIAN

[email protected]

To contact us…

To follow us…

@pythianjobs