delivering the data factory, data reservoir and a scalable oracle big data architecture

174
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : [email protected] W : www.rittmanmead.com Rittman Mead BI Forum 2015 Masterclass Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

Upload: mark-rittman

Post on 25-Jul-2015

685 views

Category:

Data & Analytics


10 download

TRANSCRIPT

Page 1: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Rittman Mead BI Forum 2015 MasterclassDelivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

Page 2: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Part 1 Designing the Data Reservoir & Data Factory

Page 3: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

The Oracle IM + Big Data Reference ArchitectureActionable

Events

Event Engine Data Reservoir

Data Factory Enterprise Information Store

Reporting

Discovery Lab

Actionable Information

ActionableInsights

Input Events

Execution

Innovation

Discovery Output

Events & Data

Structured Enterprise Data

Other Data

Page 4: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Page 5: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

The Next-Gen BI Environment from this Architecture

•Traditional RDBMS DW now complemented by a Hadoop/NoSQL-based data reservoir • “Data Factory” term used for ETL and loading processes that provide conduit between them •Some data may be loaded into the data reservoir and only exist there •Some will be further processed and loaded into the DW (“Enterprise Information Store”) •Some may get directly loaded into the RBDMS •Use best option to support business needs

Page 6: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Introducing … The “Data Reservoir”?

•A reservoir is a lake than also can process and refine (your data) •Wide-ranging source of low-density, lower-value data to complement the DW

Page 7: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Today’s Layered Data Warehouse Architecture

Virtu

aliz

atio

n &

Q

uery

Fed

erat

ion

Enterprise Performance Management

Pre-built & Ad-hoc BI Assets

Information Services

Data Ingestion

Information Interpretation

Access & Performance Layer

Foundation Data Layer

Raw Data Reservoir

Data Science

Data Engines & Poly-structured sources

Content

Docs Web & Social Media

SMS

Structured Data Sources

•Operational Data •COTS Data •Master & Ref. Data •Streaming & BAM

Immutable raw data reservoir Raw data at rest is not interpreted

Immutable modelled data. Business Process Neutral form. Abstracted from business process changes

Past, current and future interpretation of enterprise data. Structured to support agile access & navigation

Discovery Lab Sandboxes Rapid Development Sandboxes

Project based data stores to support specific discovery objectives

Project based data stored to facilitate rapid content / presentation delivery

Data Sources

Page 8: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Combining Oracle RDBMS with Hadoop + NoSQL

•High-value, high-density data goes into Oracle RDBMS •Better support for fast queries, summaries, referential integrity etc

•Lower-value, lower-density data goes into Hadoop + NoSQL ‣Also provides flexible schema, more agile development

•Successful next-generation BI+DW projects combine both - neither on their own is sufficient

Page 9: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Options for Implementing a Data Reservoir

•Can add a Hadoop cluster, on commodity/existing server hardware, and link to Oracle DB ‣Use ODI etc for data transfer between Hadoop + Oracle

•Can implement using VMs etc for prototyping exercise ‣But beware of shared/virtualized storage for real production usage

•Approach taken by most of our “starter” customers, and by us in development

Page 10: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle’s Engineered System Data Reservoir Platform

Page 11: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

•Cloudera CDH ‣Used in Oracle Big Data Appliance, typically first to be supported with ODI etc

•Hortonworks HDP ‣Usually second to be supported; supports Tez, but late with Spark etc

•MapR ‣Some prefer this but rarely certified with Oracle products

•Pivotal / ODP ‣Sometimes find in use with Banks etc, but also rarely certified

• ..etc

Hadoop Distribution Options

Page 12: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle’s Big Data Products

•Oracle Big Data Appliance ‣Optimized hardware for Hadoop processing ‣Cloudera Distribution incl. Hadoop ‣Oracle Big Data Connectors, ODI etc

•Oracle Big Data Connectors •Oracle Big Data SQL •Oracle NoSQL Database •Oracle Data Integrator •Oracle R Distribution •OBIEE, BI Publisher and Endeca Info Discovery

Page 13: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle Big Data Appliance

•Engineered system for big data processing and analysis •Optimized for enterprise Hadoop workloads •288 Intel® Xeon® E5 Processors •1152 GB total memory •648TB total raw storage capacity ‣Cloudera Distribution of Hadoop ‣Cloudera Manager ‣Open-source R ‣Oracle NoSQL Database Community Edition ‣Oracle Enterprise Linux + Oracle JVM ‣New - Oracle Big Data SQL

Page 14: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Working with Oracle Big Data Appliance

•Don’t underestimate the value of “pre-integrated” - massive time-saver for client ‣No need to integrate Big Data Connectors, ODI Agent etc with HDFS, Hive etc etc

•Single support route - raise SR with Oracle, they will route to Cloudera if needed •Single patch process for whole cluster - OS, CDH etc etc •Full access to Cloudera Enterprise features •Otherwise … just another CDH cluster in terms of SSH access etc •We like it ;-)

Page 15: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Working with Cloudera Hadoop (CDH) - Observations

•Very good product stack, enterprise-friendly, big community, can do lots with free edition •Cloudera have their favoured Hadoop technologies - Spark, Kafka

•Also makes use of Cloudera-specific tools - Impala, Cloudera Manager etc •But ignores some tools that have value - Apache Tez for example

•Easy for an Oracle developer to get productive with the CDH stack •But beware of some immature technologies / products ‣Hive != Oracle SQL ‣Spark is very much an “alpha” product ‣Limitations in things like LDAP integration, end-to-end security ‣Lots of products in stack = lots of placesto go to diagnose issues

Page 16: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

CDH : Things That Work Well

•HDFS as a low-cost, flexibledata store / reservoir; Hive for SQL access to structured + semi-structured HDFS data

•Pig, Spark, Python, R for data analysis and munging

•Cloudera Manager and Hue for web-basedadmin + dev access

Real-Time Logs / Events

RDBMSImports

File / Unstructured Imports

Hive Metastore /HCatalog

HDFS Cluster Filesystem

Page 17: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle Big Data Connectors

•Oracle-licensed utilities to connect Hadoop to Oracle RBDMS ‣Bulk-extract data from Hadoop to Oracle, or expose HDFS / Hive data as external tables ‣Run R analysis and processing on Hadoop ‣Leverage Hadoop compute resources to offload ETL and other work from Oracle RBDMS ‣Enable Oracle SQL to access and load Hadoop data

Page 18: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Working with the Oracle Big Data Connectors

•Oracle Loader for Hadoop, Oracle SQL Connector for HDFS - rarely used ‣Sqoop works both way (Oracle>Hadoop, Hadoop>Oracle) and is “good enough” ‣OSCH replaced by Oracle Big Data SQL for direct Oracle>Hive access

•Oracle R Advanced Analytics for Hadoop has been very useful though ‣Run MapReduce jobs from R ‣Run R functions across Hive tables

Page 19: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle R Advanced Analytics for Hadoop Key Features

•Run R functions on Hive Dataframes •Write MapReduce functions in R

Page 20: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle Big Data SQL

•Part of Oracle Big Data 4.0 (BDA-only) ‣Also requires Oracle Database 12c, Oracle Exadata Database Machine

•Extends Oracle Data Dictionary to cover Hive •Extends Oracle SQL and SmartScan to Hadoop •Extends Oracle Security Model over Hadoop ‣Fine-grained access control ‣Data redaction, data masking ‣Uses fast c-based readers where possible(vs. Hive MapReduce generation) ‣Map Hadoop parallelism to Oracle PQ ‣Big Data SQL engine works on top of YARN ‣Like Spark, Tez, MR2

Exadata Storage Servers

HadoopCluster

Exadata DatabaseServer

Oracle Big Data SQL

SQL Queries

SmartScan SmartScan

Page 21: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Still a Key Role for Data Integration, and BI Tools•Fast, scaleable low-cost / flexible-schema data capture using Hadoop + NoSQL (BDA) •Long-term storage of the most important downstream data - Oracle RBDMS (Exadata) •Fast analysis + business-friendly interface : OBIEE, Endeca (Exalytics), RTD etc

Page 22: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Productising the Next-Generation IM Architecture

Page 23: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

OBIEE for Enterprise Analysis Across all Data Sources

•Dashboards, analyses, OLAP analytics, scorecards, published reporting, mobile

•Presented as an integrated business semantic model •Optional mid-tier query acceleration using Oracle Exalytics In-Memory Machine

•Access data from RBDMS, applications, Hadoop, OLAP, ADF BCs etc

Enterprise SemanticBusiness Model

Business PresentationLayer (Reports, Dashboards)

In-Memory Caching Layer

ApplicationSources

Hadoop /NoSQL Sources

DW / OLAP Sources

Page 24: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Adding Search / Discovery Tools

•For searching and cataloging data in the data reservoir •Typically use concepts of faceted search, and reading from Hive metastore •Options include Elasticsearch, Cloudera Search / Hue, Oracle Big Data Discovery

Page 25: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Bringing it All Together : Oracle Data Integrator 12c

•ODI provides an excellent framework for running Hadoop ETL jobs ‣ELT approach pushes transformations down to Hadoop - leveraging power of cluster

•Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation ‣Whilst still preserving RDBMS push-down ‣Extensible to cover Pig, Spark etc

•Process orchestration •Data quality / error handling •Metadata and model-driven •New in 12.1.3.0.1 - ability to generatePig and Spark jobs too

Page 26: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

How This Differs from the Discovery Lab

•We’re still loading and storing into Hadoop and NoSQL, but… ‣There’s governance and change control ‣Data is secured ‣Data loading and pipelines are resilient and “industrialized” ‣We use ETL tools, BI tools and search tools to enable access by end-users ‣We think about design standards, file and directory layouts, metadata etc

•Build on insights and models created in the Discovery Lab •Put them into production so the business can rely on them

Page 27: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Part 2 Building the Data Reservoir & Data Factory

Page 28: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Typical RM Project BDA Topology

•Starter BDA rack, or full rack •Kerberos-secured usingincluded KDC server

• Integration with corporate LDAPfor Cloudera Manager, Hue etc

•Developer access through Hue,Beeline, R Studio

•End-user access throughOBIEE, Endeca and other tools ‣With final datasets usuallyexported to Exadata or Exalytics

Page 29: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Typical RM Hadoop + BDD Development Environment

•Development takes place on workstations, notdirectly on Hadoop / BDA nodes

•ODI agent needs to be installed on a Hadoop node, or just use Oozie scheduler

•BDD typically runs on dedicated servers,can also be clustered

•CDH5.3 is a good place to start in-termsof compatibility, being supported etc

•Can usually use CDH Express, but fullversion can be trialled for 60 days ‣Useful for Cloudera Navigator,testing LDAP integration with CM

Page 30: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Components Required for Typical Production Environment

•Hadoop cluster - typically 6-20 nodes, CDH or Hortonworks HDP with YARN / Hadoop 2.0 ‣Can deploy on-premise, or in cloud (AWS etc) using Cloudera Director

•Oracle Database, ideally Exadata for Big Data SQL capabilities •ODI12c 12.1.3.0.1 with Big Data Options (additional license required over ODI EE) •Oracle Big Data Discovery ‣Currently only certified on CDH5.3, no Kerberos support yet

•Oracle Business Intelligence 11g ‣Limited Hive compatibility with 11.1.1.7; 11.1.1.9 promises HiveServer2 + Impala support

Page 31: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Complete Oracle Big Data Product Stack

Page 32: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Typical Configuration Tasks Post-Install

•Configure BDA directory structure, user access, LDAP integration etc •Connect ODI12c 12.1.3.0.1 to Hive, HDFS, Pig and Spark on Hadoop cluster •Connect OBIEE11g to Hive (and Impala) •Set up a developer workstation with client libraries, ODI Studio, OBIEE BI Administrator etc

/user/mrittman/scratchpad /user/ryeardley/scratchpad /user/mpatel/scratchpad /user/mrittman/scratchpad /user/mrittman/scratchpad /data/rm_website_analysis/logfiles/incoming /data/rm_website_analysis/logfiles/archive /data/rm_website_analysis/tweets/incoming /data/rm_website_analysis/tweets/archive

Page 33: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring Hadoop (BDA) for LDAP Integration

•Both Cloudera Manager (with CDH Enterprise) and Hue can be linked to corporate LDAP •Hive, Impala etc also need to be configured if you want to use Apache Sentry

Page 34: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configure HDFS Directory Structure, Permissions

•Best practice is to create application-specific HDFS directories for shared data •Separate ETL out from archiving, store data in subdirectory partitions •Use POSIX security model to grant RO access to groups of users •Consider using new HDFS ACLs where appropriate (beware memory implications though)

/user/mrittman/scratchpad /user/ryeardley/scratchpad /user/mpatel/scratchpad /user/mrittman/scratchpad /user/mrittman/scratchpad /data/rm_website_analysis/logfiles/incoming /data/rm_website_analysis/logfiles/archive /data/rm_website_analysis/tweets/incoming /data/rm_website_analysis/tweets/archive

Page 35: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Consider Access Control to Hive, Impala Tables

•Usual access control strategy is to limit users to accessing data through Hive tables

•Consider using Apache Sentry to provide RBAC over Hive and Impala tables ‣Column-based restrictions possible through SQL views ‣Requires Kerberos authentication and Hive/Impala LDAP integration as prerequisites

•Oracle Big Data SQL potentially a more complete solution, if available

Page 36: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring ODI12c 12.1.3.0.1 for Hadoop Data Integration

•New Hadoop DS technology used for registering base cluster details •New WebLogic Hive drivers used for Hive table access •Pig and Spark datasources configured for Pig Latin / Spark execution •Either client workstation needs to be configured as Hadoop client,or ODI agent installed on a Hadoop node ‣To execute Pig, Hive etc mappings

•Option now to use Oozie scheduler rather than ODI agent ‣Avoids need to install ODI agent on cluster ‣Integrates ODI workflows with other Hadoop scheduling

Page 37: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring OBIEE for Cloudera Impala Access

•Not officially supported with OBIEE 11.1.1.7, but does work •Only possible using Windows version of OBIEE (looser rules around unsupported drivers) •OBIEE 11.1.1.9 will come with Impala support

•Use Cloudera ODBC drivers •Configure Database Type as Apache Hadoop •For earlier versions of Impala, may need to disable ORDER BY in Database Features, have the BI Server do sorting

• Issue is that earlier versions of Impala requires LIMIT with all ORDER BY clauses ‣OBIEE could use LIMIT, but doesn’t for Impala at the moment (because not supported)

Page 38: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring OBIEE to Access a Kerberos-Secured Cluster

•Most production Hadoop clusters are Kerberos-secured •OBIEE can access secured clusters with appropriate ODBC drivers •Typically install Kerberos client on Windows workstation, and on server side

• If OBIEE runs using a system service account, ensure it can request a ticket too

Page 39: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring Oracle Big Data Discovery

•Configuration done during BDD installation, tied to a particular Hadoop cluster •Specify Cloudera Manager + Hadoop service URLs •May need to adjust RAM allocated to Spark Workers in Cloudera Manager ‣Currently only Spark Standalone(not YARN) supported

Page 40: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

End-to-End Oracle Big Data Example

•Rittman Mead want to understand drivers and audience for their website ‣What is our most popular content? Who are the most in-demand blog authors? ‣Who are the influencers? What do they read?

•Three data sources in scope:

RM Website Logs Twitter Stream Website Posts, Comments etc

Page 41: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Two Analysis Scenarios : Reporting, and Data Discovery

• Initial task will be to ingest data from webserver logs, Twitter firehose, site content + ref data •Land in Hadoop cluster, basic transform, format, store; then, analyse the data:

Combine with Oracle Big Data SQL for structured OBIEE dashboard analysis

Combine with site content, semantics, text enrichment Catalog and explore using Oracle Big Data Discovery

What pages are people visiting? Who is referring to us on Twitter? What content has the most reach?

Why is some content more popular? Does sentiment affect viewership? What content is popular, where?

Page 42: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Data Sources used for ETL Ingestion & Reporting Exercise

Spark

Hive

HDFS

Spark

Hive

HDFS

Spark

Hive

HDFS

Cloudera CDH5.3 BDA Hadoop Cluster

Big Data SQL

Exadata Exalytics

Flume Flume

DimAttributes

SQL for BDA Exec

Filtered &Projected Rows / Columns

OBIEE

TimesTen

12c In-Mem

Ingest Process Publish

Page 43: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Apache Flume : Distributed Transport for Log Activity

•Apache Flume is the standard way to transport log files from source through to target • Initial use-case was webserver log files, but can transport any file from A>B •Does not do data transformation, but can send to multiple targets / target types •Mechanisms and checks to ensure successful transport of entries

•Has a concept of “agents”, “sinks” and “channels” •Agents collect and forward log data •Sinks store it in final destination •Channels store log data en-route

•Simple configuration through INI files •Handled outside of ODI12c

Page 44: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Flume Source / Target Configuration

•Conf file for source system agent •TCP port, channel size+type, source type

•Conf settings for target agent, through CM •TCP port, channel size+type, sink type

Page 45: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Also - Apache Kafka : Reliable, Message-Based

•Developed by LinkedIn, designed to address Flume issues around reliability, throughput ‣(though many of those issues have been addressed since)

•Designed for persistent messages as the common use case ‣Website messages, events etc vs. log file entries

•Consumer (pull) rather than Producer (push) model •Supports multiple consumers per message queue •More complex to set up than Flume, and can useFlume as a consumer of messages ‣But gaining popularity, especially alongside Spark Streaming

Page 46: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Starting Flume Agents, Check Files Landing in HDFS Directory

•Start the Flume agents on source and target (BDA) servers •Check that incoming file data starts appearing in HDFS ‣Note - files will be continuously written-to as entries added to source log files ‣Channel size for source, target agentsdetermines max no. of events buffered ‣If buffer exceeded, new events droppeduntil buffer < channel size

Page 47: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Adding Social Media Datasources to the Hadoop Dataset

•The log activity from the Rittman Mead website tells us what happened, but not “why” •Common customer requirement now is to get a “360 degree view” of their activity ‣Understand what’s being said about them ‣External drivers for interest, activity ‣Understand more about customer intent, opinions

•One example is to add details of social media mentions,likes, tweets and retweets etc to the transactional dataset ‣Correlate twitter activity with sales increases, drops ‣Measure impact of social media strategy ‣Gather and include textual, sentiment, contextualdata from surveys, media etc

Page 48: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Accessing the Twitter “Firehose”

•Twitter provides an API for developers to use to consume the Twitter “firehose”

•Can specify keywords to limit the tweets consumed

•Free service, but some limitations on actions (number of requests etc)

• Install additional Flume source JAR (pre-built available, but best to compile from source) ‣https://github.com/cloudera/cdh-twitter-example

•Specify Twitter developer API key and keyword filters in the Flume conf settings

Page 49: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Making the Webserver Log Data Available to ODI

•Flume log data from webserver arrives as files in HDFS •Can either be accessed in that form by ODI, or presented as a Hive table to ODI using SerDe ‣Both are fine, but creating the Hive table in advance makes ODI developer job simpler

Page 50: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Creating a Hive Table over the Log Data, using SerDe

•Hive works by defining a table structure over data in HDFS, typically plain text with delimiter •But can make use of SerDes (serializer-deserializers) to parse other formats •Takes semi-structured data (Apache Combined Log Format) and turns into structured (Hive) ‣Can also use IKM File to Hive with same SerDe definition, to do within ODICREATE external TABLE apachelog_parsed( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \”]*|\"[^\"]*\")(-|[0-9]*) (-|[0-9]*)(?: ([^ \"] *|\".*\") ([^ \"]*|\".*\"))?" ) STORED AS TEXTFILE LOCATION '/user/flume/rm_website_logs;

Page 51: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Copying SerDe JAR Files to Hadoop Lib Directory

•Make sure any SerDe files for parsing Hive table data are copied to Hadoop lib directory •Do this for all Hadoop nodes in the cluster

sudo cp /usr/lib/hive/lib/hive-contrib-0.13.1-cdh5.3.0.jar /usr/lib/hadoop/lib

Page 52: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Making Twitter Data Available to ODI

•Simplest approach again is to define a Hive table over the Twitter data •Arrives in files via Flume agent, but in JSON format •Potentially contains more fields than we are interested in - and in JSON format •Can address in ODI data load, but simpler to parse and select elements of interest beforehand

Page 53: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Two-Stage Hive Table Creation using JSON SerDe

• Initial table uses JSON SerDe to parse all Twitter JSON documents in HDFS directory •Clone + build from https://github.com/cloudera/cdh-twitter-example/tree/master/hive-serdesCREATE EXTERNAL TABLE `tweets`( `id` bigint COMMENT 'from deserializer', `created_at` string COMMENT 'from deserializer', `source` string COMMENT 'from deserializer', `favorited` boolean COMMENT 'from deserializer', `retweeted_status` struct<text:string,user:struct<screen_name:string,name:string>, retweet_count:int> COMMENT 'from deserializer', `entities` struct<urls:array<struct<expanded_url:string>>, user_mentions:array<struct<screen_name:string,name:string>>, hashtags:array<struct<text:string>>> COMMENT 'from deserializer', `text` string COMMENT 'from deserializer', `user` struct<screen_name:string,name:string,friends_count:int,followers_count:int, statuses_count:int,verified:boolean,utc_offset:int,time_zone:string> COMMENT 'from deserializer', `in_reply_to_screen_name` string COMMENT 'from deserializer') ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://bigdatalite.rittmandev.com:8020/user/oracle/data/tweets';

Page 54: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Two-Stage Hive Table Creation using JSON SerDe

•Second table extracts the individual fields from STRUCT datatypes in first table ‣Could be done through a view, but Big Data Discovery doesn’t support them yetCREATE TABLE `tweets_expanded` AS select `tweets`.`id`, `tweets`.`created_at`, `tweets`.`user`.screen_name as `user_screen_name`, `tweets`.`user`.friends_count as `user_friends_count`, `tweets`.`user`.followers_count as `user_followers_count`, `tweets`.`user`.statuses_count as `user_tweets_count`, `tweets`.`text`, `tweets`.`in_reply_to_screen_name`, `tweets`.`favorited`, `tweets`.`retweeted_status`.user.screen_name as `retweet_user_screen_name`, `tweets`.`retweeted_status`.retweet_count as `retweet_count`, `tweets`.`entities`.urls[0].expanded_url as `url1`, `tweets`.`entities`.urls[1].expanded_url as `url2`, `tweets`.`entities`.hashtags[0].text as `hashtag1`, `tweets`.`entities`.hashtags[1].text as `hashtag2`, `tweets`.`entities`.hashtags[2].text as `hashtag3`, `tweets`.`entities`.hashtags[3].text as `hashtag4`, `tweets`.`entities`.user_mentions[0].screen_name as `user_mentions_screen_name1`, `tweets`.`entities`.user_mentions[1].screen_name as `user_mentions_screen_name2`, `tweets`.`entities`.user_mentions[2].screen_name as `user_mentions_screen_name3`, `tweets`.`entities`.user_mentions[3].screen_name as `user_mentions_screen_name4`, `tweets`.`entities`.user_mentions[4].screen_name as `user_mentions_screen_name5` from `tweets`;

Page 55: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring the ODI12c 12.1.3.0.1 Hadoop Datasource

•New feature in ODI12.1.3.0.1 with Big Data Extensions •Defines the physical server and Java library locations for other tools (Pig etc) to use ‣Namenode location ‣Working area in HDFS for ODI ‣Location on HDFS to store basicdetails of ODI installation / repo

Page 56: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring the ODI12c 12.1.3.0.1 Hive Datasource

•Used for reverse-engineering Hive table structures from Hadoop •Uses JDBC connection, new WLS-derived driver •Need to also either install Hadoop/Hive client on ODI Studio workstation, or install ODI Agent on target Hadoop cluster to actually execute mappings ‣New option to use Oozie removes need for ODI Agent though

Page 57: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Import Hive Table Metadata into ODI Repository

•Connections to Hive, Hadoop (and Pig) set up earlier •Define physical and logical schemas, reverse-engineer the table definitions into repository ‣Can be temperamental with tables using non-standard SerDes; make sure JARs registered

1

2

3

Page 58: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Data Flow through the Hadoop + Exadata Data Reservoir

Spark

Hive

HDFS

Spark

Hive

HDFS

Spark

Hive

HDFS

Cloudera CDH5.3 BDA Hadoop Cluster

Big Data SQL

Exadata Exalytics

Flume Flume

DimAttributes

SQL for BDA Exec

Filtered &Projected Rows / Columns

OBIEE

TimesTen

12c In-Mem

Ingest Process Publish

GG

Page 59: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Major ETL Steps

1. Join initial log data extract to additional reference data (already in Hive) 2. Supplement with additional Oracle RDBMS data (brought in via Sqoop) 3. Filter log data to leave just requests for blog pages 4. Take the Twitter data, and filter to just tweets referencing RM web pages 5. Join Twitter activity to page hits, to create aggregate for the two 6. Geocode page hits to determine

country + city of visitor 7. Sessionize the log data for use with

an R classification routine

Page 60: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 1 : Join Incoming Log Hive Table to Hive Ref Data

• IKM Hive Append can be used to perform Hive table joins, filtering, agg. etc. • INSERT only, no DELETE, UPDATE etc •Join to other Hive tables, or combine with Sqoop KMs etc to bring in Oracle data •Supports most ODI operators ‣Filter ‣Aggregate ‣Join (ANSI-style) ‣etc

Page 61: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 1 : Join Incoming Log Hive Table to Hive Ref Data

•ODI 12.1.3.0.1 replaces the previous template-style KMs (IKM Hive-to-Hive Control Append) with new component-style KMs ‣Makes it possible to mix-and-match sources ‣Enables logical mapping to generate Hive, Pig and Spark code

Page 62: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 1 : Join Incoming Log Hive Table to Hive Ref Data

•Executing mapping generates HiveQL code, executed through an ODI Agent (or Oozie) •Code runs on Hadoop cluster, compiling down to Java MapReduce code

Page 63: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 2 : Supplement with Oracle Reference Data

• In this step, the log data will be supplemented with additional reference data in Oracle •Uses Sqoop (LKM SQL to Hive Sqoop) to extract Oracle data into Hive staging table •Join temporary Hive table to the main log Hive table ‣Logical mapping just references theOracle source table, no need formapping designer to consider Sqoop

Page 64: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 2 : Supplement with Oracle Reference Data

•Mapping physical details specify Sqoop KM for extract (LKM SQL to Hive Sqoop) • IKM Hive Append used for join and load into Hive target

Page 65: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 2 : Supplement with Oracle Reference Data

•Mapping execution then runs in three stages: ‣Create temporary Hive table for staging data ‣Generate and run Sqoop job to export reference data out of Oracle RBDMS ‣Join incoming reference Hive table to log data Hive table

Page 66: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Alternative to Batch Replication using Sqoop : GoldenGate

•Oracle GoldenGate 12c for Big Data can replicate database transactions into Hadoop •Load directly into Hive / HDFS, or feed transactions into Apache Flume as flume events •Provides a way to replicate Oracle + other RBDMS data into the data reservoir ‣Works with Flume to provide a single streaming route into the the data reservoir

Page 67: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Enabling Oracle Database 12c for GoldenGate Replication

•Oracle GoldenGate 11gR2 for Oracle Database introduced Integrated Capture Mode ‣Integrated with database, just enable with alter system set enable_goldengate_replication=true ‣Required for Oracle Database 12c container databases (as found on Big Data Lite 4.1 VM)

Page 68: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle RDBMS to Hive via Flume Configuration Steps

1. Configure the source database for ARCHIVELOG mode, integrated capture and supplementary logging

2. Create data source definition file to specify the database schema / tables to replicate 3. Set up the database capture (extract) process to write transactions to the trail file 4. Configure the GoldenGate Flume adapter to send transactions written to the trail file to a

Flume Adapter, via Avro RPC messages 5. Set up and configure a Flume Adapter to receive those messages, and write them in Hive

data storage format to HDFS for the target Hive table

Program Status Group Lag at Chkpt Time Since Chkpt

MANAGER RUNNING EXTRACT RUNNING FLUME 00:00:00 00:00:02 EXTRACT RUNNING ORAEXT 00:00:10 00:00:02

select CONCAT('Rows loaded from gg_Test.logs into HDFS via Flume: ', count(*)) from gg_test.logs; … Rows loaded from gg_Test.logs into HDFS via Flume: 100

sqlplus gg_test@orcl/welcome1 begin P_GENERATE_LOGS(100); end;

2

1 3

Page 69: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 3 : Filter Log Data to Retain Just Blog Page Views

•Same approach as with first mapping, Hive source to Hive target •Uses Filter operator to add WHERE clause to HiveQL SELECT statement

Page 70: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 4 : Filter Tweets to Just Leave RM Blog References

•Same process as previous step; extract from Hive source, filter, load into Hive target •Filter on two URL columns as tweet can contain multiple URL references ‣Two picked as arbitrary limit to URL extraction

Page 71: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Mapping Variant : Generate as Pig Latin vs. HiveQL

•ODI 12.1.3.0.1 comes with the ability to generate Pig Latin as well as HiveQL •Alternative to Hive, defines data manipulation as dataflow steps (like an execution plan) •Start with one or more data sources, add steps to apply filters, group, project columns •Generates MapReduce to execute data flow, similar to Hive; extensible through UDFsa = load '/user/oracle/pig_demo/marriott_wifi.txt'; b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/user/oracle/pig_demo/pig_wordcount';

[oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/pig_demo/pig_wordcount Found 2 items -rw-r--r-- 1 oracle oracle 0 2014-10-11 11:48 /user/oracle/pig_demo/pig_wordcount/_SUCCESS -rw-r--r-- 1 oracle oracle 1965 2014-10-11 11:48 /user/oracle/pig_demo/pig_wordcount/part-r-00000 [oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/pig_demo/pig_wordcount/part-r-00000 2 . 1 I 6 a ...

21

3

Page 72: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring the ODI12c 12.1.3.0.1 Pig Datasource

•A way of linking a Pig execution environment to a previously-defined Hadoop DS •Also gives ability to define additional JARs to use with Pig - DataFu, Piggybank etc •Can be defined as either Local (running Pig code on workstation) or MapReduce

Page 73: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring a Mapping for Pig Latin Code Generation

•On the logical mapping, set the Staging Location Hint to the Pig logical schema •For the mapping operators, set the Execute on Hint to Staging

Set as property for whole mapping

Page 74: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Creating a Physical Mapping Configured for Pig Latin

•Create additional deployment specification for Pig physical mapping •Mapping operators will use Pig component KMs •Set KM for target table or file to <Default> (from original IKM Hive Append)

Page 75: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Executing a Pig Latin Mapping

•Can either run in Local, or MapReduce mode ‣Local usually faster for unit testing, MapReduce runs on full Hadoop cluster

Page 76: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 5 : Join Tweets to Log Entries, Aggregate

•Simple join between two Hive tables, after aggregating their contents ‣Previous transformations in earlier mappings standardised the URL format

•Add page view and tweet totals to list of blog pages accessed

Page 77: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 6 : Geocode Log Entries using IP Address

•Another requirement we have is to “geocode” the webserver log entries •Based on the fact that IP ranges can usually be attributed to specific countries •Not functionality normally found in Hive etc, but can be done with add-on APIs •Approach used by Google Analytics etc to show where visitors are located

Page 78: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

How GeoIP Geocoding Works

•Uses free Geocoding API and database from Maxmind •Convert IP address to an integer •Find which integer range our IP address sits within •But Hive can’t use BETWEEN in a join…

•Solution : Expose PAGEVIEWS Hive table using Big Data SQL, then join to lookup tablein Oracle database

Page 79: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle Big Data SQL and Data Integration

•Gives us the ability to easily bring in Hadoop (Hive) data into Oracle-based mappings •Allows us to create Hive-based mappings that use Oracle SQL for transforms, joins •Faster access to Hive data for real-time ETL scenarios •Through Hive, bring NoSQL and semi-structured data access to Oracle ETL projects •For our scenario - join weblog + customer data in Oracle RDBMS, no need to stage in Hive

Page 80: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Using Big Data SQL in an ODI12c Mapping

•By default, Hive table has to be exposed as an ORACLE_HIVE external table in Oracle first •Then register that Oracle external table in ODI repository + model

External table creation in Oracle

Logical Mapping using just Oracle tables

1

2

Register in ODI Model

3

Page 81: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

New KM : LKM Hive to Oracle (Big Data SQL)

•New KM works in similar way to Sqoop KM : Creates temporary ORACLE_HIVE tableto expose Hive data in Oracle environment ‣Allows Hive+Oracle joins by auto-creating ORACLE_HIVE extttab definition to enable Big Data SQL Hive table access

Page 82: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ODI12c Mapping Creates Temp Exttab, Joins to Oracle

1

2

Register in ODI Model3

4

Hive table AP uses LKM Hive to Oracle (Big Data SQL)

IKM Oracle Insert

Big Data SQL Hive External Table created as temp object

Main integration SQL routines uses regular Oracle SQL join (including use of advanced SQL functions, e.g. REGEXP_SUBSTR)

Page 83: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ETL Step 7 : Sessionize Log Data, for R Classification Model

•Discovery Lab part of the masterclass created a classification model using R

•Used as input a sessionized version of the log activity, grouping page views within 60s

•Sessionization routine was written as Pig script, using DataFu and Piggybank UDFs ‣DataFu is a library of Pig functions initially developed by LinkedIn, now an Apache project ‣Piggybank is a community-created library of Pig UDFs and store/load routines

•So why was Pig used for this sessionization task?

Page 84: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Apache Pig Characteristics vs. Hive

•Ability to load data into a defined schema, or use schema-less (access fields by position) •Fields can contain nested fields (tuples) •Grouping records on a key doesn’t aggregate them, it creates a nested set of rows in column •Uses “lazy execution” - only evaluates data flow once final output has been requests •Makes Pig an excellent language for interactive data exploration

vs.

Page 85: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Pig Data Processing Example : Count Page Request Totals

raw_logs =LOAD '/user/oracle/rm_logs/' USING TextLoader AS (line:chararray); logs_base = FOREACH raw_logs GENERATE FLATTEN ( REGEX_EXTRACT_ALL ( line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"' ) ) AS ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: chararray, bytes_string: chararray, referrer: chararray, browser: chararray ); page_requests = FOREACH logs_base GENERATE SUBSTRING(time,3,6) as month, FLATTEN(STRSPLIT(request,' ',5)) AS (method:chararray, request_page:chararray, protocol:chararray); page_requests_short = FOREACH page_requests GENERATE $0,$2; page_requests_short_filtered = FILTER page_requests_short BY (request_page is not null AND SUBSTRING(request_page,0,3) == '/20'); page_request_group = GROUP page_requests_short_filtered BY request_page; page_request_group_count = FOREACH page_request_group GENERATE $0, COUNT(page_requests_short_filtered) as total_hits; page_request_group_count_sorted = ORDER page_request_group_count BY $1 DESC; page_request_group_count_limited = LIMIT page_request_group_count_sorted 10;

Page 86: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Pig Data Processing Example : Join to Post Titles, Authors

•Pig allows aliases (datasets) to be joined to each other •Example below adds details of post names, authors; outputs top pages dataset to fileraw_posts = LOAD '/user/oracle/pig_demo/posts_for_pig.csv' USING TextLoader AS (line:chararray); posts_line = FOREACH raw_posts GENERATE FLATTEN ( STRSPLIT(line,';',10) ) AS ( post_id: chararray, title: chararray, post_date: chararray, type: chararray, author: chararray, post_name: chararray, url_generated: chararray ); posts_and_authors = FOREACH posts_line GENERATE title,author,post_name,CONCAT(REPLACE(url_generated,'"',''),'/') AS (url_generated:chararray); pages_and_authors_join = JOIN posts_and_authors BY url_generated, page_request_group_count_limited BY group; pages_and_authors = FOREACH pages_and_authors_join GENERATE url_generated, post_name, author, total_hits; top_pages_and_authors = ORDER pages_and_authors BY total_hits DESC; STORE top_pages_and_authors into '/user/oracle/pig_demo/top-pages-and-authors.csv' USING PigStorage(‘,');

Page 87: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Pig Extensibility through UDFs and Streaming

•Similar to Apache Hive, Pig can be programatically extended through UDFs •Example below uses Function defined in Python script to geocode IP addresses

#!/usr/bin/python import sys sys.path.append('/usr/lib/python2.6/site-packages/') import pygeoip @outputSchema("country:chararray") def getCountry(ip): gi = pygeoip.GeoIP('/home/nelio/GeoIP.dat') country = gi.country_name_by_addr(ip) return country

register 'python_geoip.py' using jython as pythonGeoIP; raw_logs =LOAD '/user/root/logs/' USING TextLoader AS (line:chararray); logs_base = FOREACH raw_logs GENERATE FLATTEN ( REGEX_EXTRACT_ALL ( line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"' ) ) AS ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray ); ipaddress = FOREACH logs_base GENERATE remoteAddr; clean_ip = FILTER ipaddress BY (remoteAddr matches '^.*?((?:\\d{1,3}\\.){3}\\d{1,3}).*?$'); country_by_ip = FOREACH clean_ip GENERATE pythonGeoIP.getCountry(remoteAddr);

Page 88: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Pig Sessionization Script used in Discovery Labregister /opt/cloudera/parcels/CDH/lib/pig/datafu.jar; register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar; DEFINE Sessionize datafu.pig.sessions.Sessionize('60m'); DEFINE Median datafu.pig.stats.StreamingMedian(); DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.9','0.95'); DEFINE VAR datafu.pig.VAR(); DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO(); DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix(); -------------------------------------------------------------------------------- -- Import and clean logs raw_logs = LOAD '/user/flume/rm_logs/apache_access_combined' USING TextLoader AS (line:chararray); -- Extract individual fields logs_base = FOREACH raw_logs GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')) AS (remoteAddr: chararray, remoteLogName: chararray, user: chararray, time: chararray, request: chararray, status: chararray, bytes_string: chararray, referrer:chararray, browser: chararray);

-- Remove Bots and convert timestamp logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|Bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*'); -- Remove uselesss columns and convert timestamp clean_logs = FOREACH logs_base_nobots GENERATE CustomFormatToISO(time,'dd/MMM/yyyy:HH:mm:ss Z') as time, remoteAddr, request, status, bytes_string, referrer, browser; -------------------------------------------------------------------------------- -- Sessionize the data clean_logs_sessionized = FOREACH (GROUP clean_logs BY remoteAddr) { ordered = ORDER clean_logs BY time; GENERATE FLATTEN(Sessionize(ordered)) AS (time, remoteAddr, request, status, bytes_string, referrer, browser, sessionId); }; -- The following steps will generate a tsv file in your home directory to download and work with in R store clean_logs_sessionized into '/user/jmeyer/clean_logs' using PigStorage('\t','-schema');

Page 89: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Converting the Pig Script to an ODI Mapping

•Not an obvious translation - Pig data flows don’t map 1:1 with Hive set-based transformations ‣Pig aliases use lazy execution: intermediate results aren’t materialised as Hive tables ‣Some concepts - GENERATE FLATTEN etc - don’t translate to SQL expressions ‣DataFu and Piggybank UDFs don’t have equivalent Hive versions

clean_logs_sessionized = FOREACH (GROUP clean_logs BY remoteAddr) { ordered = ORDER clean_logs BY time; GENERATE FLATTEN(Sessionize(ordered)) AS (time, remoteAddr, request, status, bytes_string, referrer, browser, sessionId); };

select sum(f.flights) from flight_performance f join origin o on (f.origin = o.origin)where o.origin = 'SFO';

vs.

Page 90: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

ODI 12.1.3.0.1 Logical Mapping for Log Sessionization

Expression operator used instead of Hive table target;generated as ALIAS when deployed as Pig Latin mapping Table Function operator used to generate another ALIAS

by running input attributes through arbitrary Pig Latin script

Only data materialised is in Hive table,at end of dataflow

3

21

Page 91: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Expression Mapping Operator Used to Create Next Alias

•Using Expression rather than datastore operator creates transformation “in-line” •With Pig execution, generates expression as ALIAS •Allows use of expressions (e.g. CustomFormatToISO Piggybank UDF) •Filters etc included in ALIAS definition

Page 92: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Table Function Operator used for Executing Pig Commands

•Table function operator processes input attributes through arbitrary script • In pig mappings, allows use of more complex Pig transformations ‣GENERATE FLATTEN, use of DataFu Sessionize UDF

•Final ALIAS defined within Pig Latin script has to match name of Table Function operator

Page 93: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Pig Latin Generated Script for Sessionization Task

•Creates single dataflow using series of ALIASes • Includes Pig Latin commands added through Table Function •Matches logic and approach of original hand-coded Pig script, but now managed within ODI

Page 94: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Create ODI Package for Processing Steps, and Execute

•Create ODI Package or Load Plan to run steps in sequence ‣With load plan, can also add exceptions and recoverability

•Execute package to load data into final Hive tables

Page 95: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Summary : Data Processing Phase

•We’ve now processed the incoming data, filtering it and transforming to required state •Joined (“mashed-up”) datasets from website activity, and social media mentions • Ingestion and the load/processing stages are now complete •Now we want to make the Hadoop output available to a wider, non-technical audience…

Page 96: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Part 3 Reporting and Dashboards across the Data Reservoir using Oracle Big Data SQL + OBIEE

Page 97: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Options for Sharing Data Reservoir Data with Users

•Several options for reporting on the content in the data reservoir and DW ‣Using a reporting & dashboarding tool compatible with Hive + DW, e.g. OBIEE11g ‣Using a search/data discovery tool, for example Big Data Discovery ‣Export Hadoop/Hive data into Oracleand report from there Actionable

Events

Event.Engine Enterprise.Information.Store

Reporting

Discovery.Lab

ActionableInformation

ActionableInsights

InputEvents

Execution

Innovation

Discovery.Output

Events.&.Data

StructuredEnterprise.Data

OtherData

Data.Reservoir

Data.Factory

Page 98: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Alternative to Reporting Against Hadoop : Export to Data Mart

• In most cases, for general reporting access, exporting into RDBMS makes sense •Export Hive data from Hadoop into Oracle Data Mart or Data Warehouse •Use Oracle RDBMS for high-value data analysis, full access to RBDMS optimisations •Potentially use Exalytics for in-memory RBDMS access

Loading Stage

Processing Stage

Store / Export Stage

Real-Time Logs / Events

RDBMSImports

File / Unstructured Imports

RDBMSExports

File Exports

Page 99: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Using the Right Server for the Right Job

•Hadoop for large scale, high-speed data ingestion and processing •Oracle RDBMS and Exadata for long-term storage of high-value data •Oracle Exalytics for speed-of-though analytics in TimesTen and Oracle Essbase

Page 100: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle Business Intelligence and Big Data Sources

•OBIEE 11g from 11.1.1.7 can connect to Hadoop sources ‣OBIEE 11.1.1.7+ supports Hive/Hadoop as a data source, via specific Hive ODBC driversand Apache Hive Physical Layer database type

‣But practically, it comes with limitations ‣Current 11.1.1.7 version of OBIEE only ships with HiveServer1 ODBC drivers ‣HiveQL is a limited subset of ISO/Oracle SQL ‣… and Hive access is really slow

Page 101: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Configuring OBIEE for Hive Access

•As of OBIEE 11.1.1.7, access is through Oracle-supplied Data Direct Drivers ‣Not compatible with HiveServer2 protocol used by CDH4+ ‣As workaround, use Windows version of OBIEE and Cloudera ODBC drivers ‣OBIEE 11.1.1.9 will come with HiveServer2 drivers (hopefully)

•Need to configure on both server, and BI Administration workstation

Page 102: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Setting up the ODBC Connection to Hadoop Environment

•Example uses OBIEE 11.1.1.7 on Windows, to allow use of Cloudera Hive ODBC drivers (HiveServer2) ‣Linux OBIEE 11g version only allows use of Oracle-supplied HiveServer1 drivers

• Install ODBC drivers, create system DSN •Use username/password authentication, or Kerberos if required

Page 103: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Importing Hive Metadata

1. Use BI Administration tool, File > Import Metadata 2. Select DSN previously created for Hive datasource 3. Import table metadata from correct Hive database 4. Set Database Type to Apache Hadoop

3

21

Page 104: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Testing Hive Connection & Data Retrieval

•Confirm that Hive table data can be returned by the BI Administration tool ‣Basic check before carrying on; should also check with the RPD online too (for BI Server)

Page 105: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Building an Initial Business Model from Hive Tables

•Main fact table is based on page requests (ACCESS_PER_POST)

•Pages dimension table (POSTS) •Simple counts of pages viewed per author, post category etc

Page 106: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Federated Hive and Oracle Data via BI Server

•Oracle Database has a table containing HTTP status codes • Import into RPD to include in business model

Page 107: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Join Hive Fact (Log) Data to Oracle Reference Data

•BI Server issues two separate queries; one to Hive, one to Oracle •Returned datasets then joined (stitch-join) by BI Server and returned as single resultset

Page 108: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

How Can This Be Improved On?

•Gives the ability to supplement Hadoop data with reference data from Oracle, Excel etc

•But response time is still quite slow •What about faster versions of Hive - Cloudera Impala for example?

•Cloudera’s answer to Hive query response time issues •MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access •Mostly in-memory, but spills to disk if required

•Uses Hive metastore to access Hive table metadata •Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc

Page 109: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

How Impala Works

•A replacement for Hive, but uses Hive concepts anddata dictionary (metastore)

•MPP (Massively Parallel Processing) query enginethat runs within Hadoop ‣Uses same file formats, security,resource management as Hadoop

•Processes queries in-memory •Accesses standard HDFS file data •Option to use Apache AVRO, RCFile,LZO or Parquet (column-store)

•Designed for interactive, real-timeSQL-like access to Hadoop

Impala

Hadoop

HDFS etc

BI Server

Presentation Svr

Cloudera ImpalaODBC Driver

Impala

Hadoop

HDFS etc

Impala

Hadoop

HDFS etc

Impala

Hadoop

HDFS etc

Impala

Hadoop

HDFS etc

Multi-NodeHadoop Cluster

Page 110: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Enabling Hive Tables for Impala

•Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list •Run SHOW TABLES Impala SQL command to view tables available •Run COUNT(*) on main ACCESS_PER_POST table to see typical response time[oracle@bigdatalite ~]$ impala-shell Starting Impala Shell without Kerberos authentication

[bigdatalite.localdomain:21000] > invalidate metadata; Query: invalidate metadata

Fetched 0 row(s) in 2.18s [bigdatalite.localdomain:21000] > show tables; Query: show tables +-----------------------------------+ | name | +-----------------------------------+ | access_per_post | | access_per_post_cat_author | | … | | posts | |——————————————————————————————————-+ Fetched 45 row(s) in 0.15s

[bigdatalite.localdomain:21000] > select count(*) from access_per_post; Query: select count(*) from access_per_post +----------+ | count(*) | +----------+ | 343 | +----------+ Fetched 1 row(s) in 2.76s

Page 111: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Setting up an ODBC Connection to Impala

•Download ODBC drivers for Impala from Cloudera Website ‣Windows, Linux, Mac, AIX

•Create system DSN as normal, use port 21050 •Configure authentication ‣For unsecured cluster, use “No Authentication” ‣For secured, use Kerberos, etc

•Test datasource to check successful connectivity •Complete on both Windows workstation, and server hosting BI Server component

|

Page 112: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Recreate Business Model, Re-run Basic Report

•Significant improvement over Hive response time •Now makes Hadoop suitable for ad-hoc querying

|

Logical Query Summary Stats: Elapsed time 2, Response time 1, Compilation time 0 (seconds)

Logical Query Summary Stats: Elapsed time 50, Response time 49, Compilation time 0 (seconds)Simple Two-Table Join against Hive Data Only

Simple Two-Table Join against Impala Data Only

vs

Page 113: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Re-Create Oracle Query Federation, and Retest

•Add Oracle HTTP Status table to business model sourced from Impala data •Join HTTP Status table to Impala fact table in Physical layer •Recreate query to compare response time to Hive + Oracle version

Logical Query Summary Stats: Elapsed time 102, Response time 102, Compilation time 0 (seconds)

Logical Query Summary Stats: Elapsed time 1, Response time 1, Compilation time 0 (seconds)

Federated Query joining Hive + Oracle Data

Federated Query joining Impala + Oracle Data

vs

Page 114: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Any Way We Can Improve This Further?

• If available, use Oracle Big Data SQL to query Hive data only, or federated Hive + Oracle •Access Hive data through Big Data SQL SmartScan feature, for Exadata-type response time •Use standard Oracle SQL across both Hive and Oracle data •Also extends to data in Oracle NoSQL database

Page 115: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle Big Data SQL

•Part of Oracle Big Data 4.0 (BDA-only) ‣Also requires Oracle Database 12c, Oracle Exadata Database Machine

•Extends Oracle Data Dictionary to cover Hive •Extends Oracle SQL and SmartScan to Hadoop •Extends Oracle Security Model over Hadoop ‣Fine-grained access control ‣Data redaction, data masking ‣Uses fast c-based readers where possible(vs. Hive MapReduce generation) ‣Map Hadoop parallelism to Oracle PQ ‣Big Data SQL engine works on top of YARN ‣Like Spark, Tez, MR2

Exadata Storage Servers

HadoopCluster

Exadata DatabaseServer

Oracle Big Data SQL

SQL Queries

SmartScan SmartScan

Page 116: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

View Hive Table Metadata in the Oracle Data Dictionary

•Oracle Database 12c 12.1.0.2.0 with Big Data SQL option can view Hive table metadata ‣Linked by Exadata configuration steps to one or more BDA clusters

•DBA_HIVE_TABLES and USER_HIVE_TABLES exposes Hive metadata •Oracle SQL*Developer 4.0.3, with Cloudera Hive drivers, can connect to Hive metastore

SQL> col database_name for a30 SQL> col table_name for a30 SQL> select database_name, table_name 2 from dba_hive_tables;

DATABASE_NAME TABLE_NAME ------------------------------ ------------------------------ default access_per_post default access_per_post_categories default access_per_post_full default apachelog default categories default countries default cust default hive_raw_apache_access_log

Page 117: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Hive Access through Oracle External Tables + Hive Driver

•Big Data SQL accesses Hive tables through external table mechanism ‣ORACLE_HIVE external table type imports Hive metastore metadata ‣ORACLE_HDFS requires metadata to be specified

•Access parameters cluster and tablename specify Hive table source and BDA cluster

CREATE TABLE access_per_post_categories( hostname varchar2(100), request_date varchar2(100), post_id varchar2(10), title varchar2(200), author varchar2(100), category varchar2(100), ip_integer number) organization external (type oracle_hive default directory default_dir access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));

Page 118: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Big Data SQL Server Dataflow

•Read data from HDFS Data Node ‣Direct-path reads ‣C-based readers when possible ‣Use native Hadoop classes otherwise

•Translate bytes to Oracle

•Apply SmartScan to Oracle bytes ‣Apply filters ‣Project columns ‣Parse JSON/XML ‣Score models Disks%

Data$Node$

Big$Data$SQL$Server$

External$Table$Services$

Smart$Scan$

RecordReader%

SerDe%

10110010%

10110010%

10110010%

1%

2%

3%

1

2

3

Page 119: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Use Rich Oracle SQL Dialect over Hadoop (Hive) Data

•Ranking Functions ‣rank, dense_rank, cume_dist, percent_rank, ntile

•Window Aggregate Functions ‣Avg, sum, min, max, count, variance, first_value, last_value

•LAG/LEAD Functions •Reporting Aggregate Functions ‣Sum, Avg, ratio_to_report

•Statistical Aggregates ‣Correlation, linear regression family, covariance

•Linear Regression ‣Fitting of ordinary-least-squares regression line to set of number pairs

•Descriptive Statistics •Correlations ‣Pearson’s correlation coefficients

•Crosstabs ‣Chi squared, phi coefficinet

•Hypothesis Testing ‣Student t-test, Bionomal test

•Distribution ‣Anderson-Darling test - etc.

Page 120: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Leverages Hive Metastore for Hadoop Java Access Classes

•As with other next-gen SQL access layers, uses common Hive metastore table metadata •Provides route to underlying Hadoop data for Oracle Big Data SQL c-based SmartScan

Page 121: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Extending SmartScan, and Oracle SQL, Across All Data

•Brings query-offloading features of Exadatato Oracle Big Data Appliance

•Query across both Oracle and Hadoop sources • Intelligent query optimisation applies SmartScanclose to ALL data

•Use same SQL dialect across both sources •Apply same security rules, policies, user access rights across both sources

Page 122: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Example Usage : Use Big Data SQL for Geocoding Exercise

•Earlier on we used ODI and Big Data SQL to join incoming log data to Geocoding table •Big Data SQL used as it enabled Hive data to use BETWEEN join •We will now reproduce using OBIEE environment

•Benefit is doing on the fly, outside of ETL

Hive Weblog Activity tableOracle Geocoding lookup tables

Combined output in report form

Page 123: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Create ORACLE_HIVE External Table over Hive Table

•Use the ORACLE_HIVE access driver type to create Oracle external table over Hive table •ACCESS_PER_POST_EXTTAB and POSTS_EXTTAB now appear in Oracle data dictionary

Page 124: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Import Oracle Tables, Create RPD joining Tables Together

•No need to use Hive ODBC drivers - Oracle OCI connection instead •No issue around HiveServer1 vs HiveServer2

•Big Data SQL handles authenticationwith Hadoop cluster in background, Kerberos etc

•Transparent to OBIEE - all appear as Oracle tables

•Join across schemas if required

Page 125: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Create Physical Data Model from Imported Table Metadata

•Join ORACLE_HIVE external tables to reference table from Oracle DB

Page 126: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Recreate Business Model, All Sourced From Oracle

•Map incoming physical tables into a star schema •Add aggregation method for fact measures •Add logical keys for logical dimension tables •Remove columns from fact table that aren’t measures

Page 127: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Create Report against Oracle + Big Data SQL Tables

•BI Server thinks that all data sourced from Oracle •Uses full Oracle SQL features, guarantees all Oracle-sourced reports will work if DW data offloaded to Hadoop (Hive)

•Fast access through SmartScan feature

WITH SAWITH0 AS (select count(T45134.TIME) as c1, T45146.POST_AUTHOR as c2, T44832.DSC as c3 from BDA_OUTPUT.POSTS_EXTTAB T45146, BLOG_REFDATA.HTTP_STATUS_CODES T44832, BDA_OUTPUT.ACCESS_PER_POST_EXTTAB T45134 where ( T44832.STATUS = T45134.STATUS and T45134.POST_ID = T45146.POST_ID ) group by T44832.DSC, T45146.POST_AUTHOR) select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3, D1.c4 as c4 from ( select distinct 0 as c1, D1.c2 as c2, D1.c3 as c3, D1.c1 as c4 from SAWITH0 D1 order by c3, c2 ) D1 where rownum <= 65001

Page 128: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Uses Concept of Query Franchising vs Query Federation

•Oracle Database handles all queries for client tool, then offloads to Hive if needed •Contrast with Query federation - BI Server has to issue separateSQL queries for each source, then stitch-join results ‣And be aware of different SQL dialects, DB features etc

WITH SAWITH0 AS (select count(T45134.TIME) as c1, T45146.POST_AUTHOR as c2, T44832.DSC as c3 from BDA_OUTPUT.POSTS_EXTTAB T45146, BLOG_REFDATA.HTTP_STATUS_CODES T44832, BDA_OUTPUT.ACCESS_PER_POST_EXTTAB T45134 where ( T44832.STATUS = T45134.STATUS and T45134.POST_ID = T45146.POST_ID ) group by T44832.DSC, T45146.POST_AUTHOR) select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3, D1.c4 as c4 from ( select distinct 0 as c1, D1.c2 as c2, D1.c3 as c3, D1.c1 as c4 from SAWITH0 D1 order by c3, c2 ) D1 where rownum <= 65001

Page 129: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Uses Concept of Query Franchising vs Query Federation

•Oracle Database handles all queries for client tool, then offloads to Hive if needed •Contrast with Query federation - BI Server has to issue separateSQL queries for each source, then stitch-join results ‣And be aware of different SQL dialects, DB features etc

•Only columns (projection) and rows (filtering) required to answer query sent back to Exadata

•Storage Indexes used on both Exadata Storage Servers and BDA nodes to skip block reads for irrelevant data

•HDFS caching used to speed-upaccess to commonly-usedHDFS data

Page 130: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Create Initial Analyses Against Combined Dataset

•Create analyses usingfull SQL features

•Access to Oracle RDBMSAdvanced Analytics functionsthrough EVALUATE,EVALUATE_AGGR etc

•Big Data SQL SmartScan featureprovides fast, ad-hoc accessto Hive data, avoiding MapReduce

Page 131: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Prepare Physical Model for Big Data SQL Join to GEOIP Data

•Create SELECT table view in RPD over ACCESS_PER_POST_EXTTAB tableto derive IP address integer from hostname IP address ‣Also add in a conversion of access date field - for later…

• Import GEOIP_COUNTRY reference table into RPD •Join on

Page 132: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Access to Full Set of Oracle Join Types

•No longer restricted to HiveQL equi-joins - Big Data SQL supports all Oracle join operators •Use to join Hive data (using View over external table) to the IP range country lookup tableusing BETWEEN join operator

Page 133: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Reports Now Include Country Data via IP Geocoding

•Makes use of Oracle SQL’s BETWEEN join operator •Underlying log + posts data still sourced from Hive, via Big Data SQL Query Franchising

Page 134: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Add In Time Dimension Table

•Enables time-series reporting; pre-req for forecasting (linear regression-type queries) •Map to Date field in view over ORACLE_HIVE table ‣Convert incoming Hive STRING field to Oracle DATE for better time-series manipulation

Page 135: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Now Enables Time-Series Reporting and Country Lookups

Page 136: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Use Exalytics In-Memory Aggregate Cache if Required

• If further query acceleration is required, Exalytics In-Memory Cache can be used •Enabled through Summary Advisor, caches commonly-used aggregates in in-memory cache •Options for TimesTen or Oracle Database 12c In-Memory Option •Returns aggregated data “at the speed of thought”

Page 137: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Part 4 Discovering and Analyzing the Data Reservoir using Oracle Big Data Discovery

Page 138: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Enable Incoming Site Activity Data for Data Discovery

•Another use-case for Hadoop data is “data discovery” ‣Load data into the data reservoir ‣Catalog and understand separate datasets ‣Enrich data using graphical tools ‣Join separate datasets together ‣Present textual data alongside measuresand key attributes ‣Explore and analyse using faceted search

2 Combine with site content, semantics, text enrichment Catalog and explore using Oracle Big Data Discovery

Why is some content more popular? Does sentiment affect viewership? What content is popular, where?

Page 139: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle Big Data Discovery

• “The Visual Face of Hadoop” - cataloging, analysis and discovery for the data reservoir •Runs on Cloudera CDH5.3+ (Hortonworks support coming soon) •Combines Endeca Server + Studio technology with Hadoop-native (Spark) transformations

Page 140: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Data Sources used for Data Discovery Exercise

Spark

Hive

HDFS

Spark

Hive

HDFS

Spark

Hive

HDFS

Cloudera CDH5.3 BDA Hadoop Cluster

Hive Client

HDFS Client

BDD DGraphGateway

Hive Client

BDD StudioWeb UI

BDD Node

BDD Data Processing

BDD Data Processing

BDD Data Processing

Ingest semi-process logs

(1m rows)

Ingest processedTwitter activity

Write-backTransformations

to full datasets

UploadSite page and

comment contents

Persist uploaded DGraphcontent in Hive / HDFS

Data Discovery using Studio

web-based app

Page 141: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Oracle Big Data Discovery Architecture

•Adds additional nodes into the CDH5.3 cluster, for running DGraph and Studio •DGraph engine based on Endeca Server technology, can also be clustered

•Hive (HCatalog) used for reading table metadata,mapping back to underlying HDFS files

•Apache Spark then used to upload (ingest)data into DGraph, typically 1m row sample

•Data then held for online analysis in DGraph •Option to write-back transformations tounderlying Hive/HDFS files using Apache Spark

Page 142: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Ingesting & Sampling Datasets for the DGraph Engine

•Datasets in Hive have to be ingested into DGraph engine before analysis, transformation •Can either define an automatic Hive table detector process, or manually upload •Typically ingests 1m row random sample ‣1m row sample provides > 99% confidence that answer is within 2% of value shownno matter how big the full dataset (1m, 1b, 1q+) ‣Makes interactivity cheap - representative dataset

Amount'of'data'queried

The'100%'premium

CostAccuracy

Page 143: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Ingesting Site Activity and Tweet Data into DGraph

•Two output datasets from ODI process have to be ingested into DGraph engine •Upload triggered by manual call to BDD Data Processing CLI ‣Runs Oozie job in the background to profile,enrich and then ingest data into DGraph

[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_per_post_cat_author [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t rm_linked_tweets

Hive

Apache Spark

pageviews X rows

pageviews >1m rows

Profiling pageviews >1m rows

Enrichment pageviews >1m rows

BDD

pageviews >1m rows

{ "@class" : "com.oracle.endeca.pdi.client.config.workflow. ProvisionDataSetFromHiveConfig", "hiveTableName" : "rm_linked_tweets", "hiveDatabaseName" : "default", "newCollectionName" : “edp_cli_edp_a5dbdb38-b065…”, "runEnrichment" : true, "maxRecordsForNewDataSet" : 1000000, "languageOverride" : "unknown" }

1

23

Page 144: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Ingesting Site Activity and Tweet Data into DGraph

•Two output datasets from ODI process have to be ingested into DGraph engine •Upload triggered by manual call to BDD Data Processing CLI ‣Runs Oozie job in the background to profile,enrich and then ingest data into DGraph

Hive

Apache Spark

Full Table

SampledTable

Profiling ProfiledSampled Tbl

Enrichment EnrichedSampled Tbl

BDD

BDD Dataset1

2

Page 145: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Ingesting and Sampling Hive Data into Big Data Discovery

[oracle@bigdatalite ~]$ cd /home/oracle/movie/Middleware/BDD1.0/dataprocessing/edp_cli [oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t access_per_post_cat_author [oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t rm_linked_tweets

Page 146: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

View Ingested Datasets, Create New Project

• Ingested datasets are now visible in Big Data Discovery Studio •Create new project from first dataset, then add second

Page 147: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Automatic Enrichment of Ingested Datasets

• Ingestion process has automatically geo-coded host IP addresses •Other automatic enrichments run after initial discovery step, based on datatypes, content

Page 148: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Initial Data Exploration On Uploaded Dataset Attributes

•For the ACCESS_PER_POST_CAT_AUTHORS dataset, 18 attributes now available •Combination of original attributes, and derived attributes added by enrichment process

Page 149: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Explore Attribute Values, Distribution using Scratchpad

•Click on individual attributes to view more details about them •Add to scratchpad, automatically selects most relevant data visualisation

1

2

Page 150: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Filter (Refine) Visualizations in Scratchpad

•Click on the Filter button to display a refinement list

Page 151: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Display Refined Data Visualization

•Select refinement (filter) values from refinement pane •Visualization in scratchpad now filtered by that attribute ‣Repeat to filter by multiple attribute values

Page 152: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Save Scratchpad Visualization to Discovery Page

•For visualisations you want to keep, you can add them to Discovery page •Dashboard / faceted search part of BDD Studio - we’ll see more later

12

Page 153: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Select Multiple Attributes for Same Visualization

•Select AUTHOR attribute, seeinitial ordered values, distribution

•Add attribute POST_DATE ‣choose between multiple instances of first attribute split by second ‣or one visualisation with multiple series

Page 154: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Data Transformation & Enrichment

•Data ingest process automatically applies some enrichments - geocoding etc •Can apply others from Transformation page - simple transformations & Groovy expressions

Page 155: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Standard Transformations - Simple & Using Editor

•Group and bin attribute values; filter on attribute values, etc •Use Transformation Editor for custom transformations (Groovy, incl. enrichment functions)

Page 156: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Datatype Conversion Example : String to Date / Time

•Datatypes can be converted into other datatypes, with data transformed if required •Example : convert Apache Combined Format Log date/time to Java date/time

Page 157: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Transformations using Text Enrichment / Parsing

•Uses Salience text engine under the covers •Extract terms, sentiment, noun groups, positive / negative words etc

Page 158: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Create New Attribute using Derived (Transformed) Values

•Choose option to Create New Attribute, to add derived attribute to dataset •Preview changes, then save to transformation script

12

3

Page 159: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Commit Transforms to DGraph, or Create New Hive Table

•Transformation changes have to be committed to DGraph sample of dataset ‣Project transformations kept separate from other project copies of dataset

•Transformations can also be applied to full dataset, using Apache Spark ‣Creates new Hive table of complete dataset

Page 160: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Upload Additional Datasets

•Users can upload their own datasets into BDD, from MS Excel or CSV file •Uploaded data is first loaded into Hive table, then sampled/ingested as normal

12

3

Page 161: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Join Datasets On Common Attributes

•Used to create a dataset based on the intersection (typically) of two datasets •Not required to just view two or more datasets together - think of this as a JOIN and SELECT

Page 162: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Join Example : Add Post + Author Details to Tweet URL

•Tweets ingested into data reservoir can reference a page URL •Site Content dataset contains title, content, keywords etc for RM website pages •We would like to add these details to the tweets where an RM web page was mentioned ‣And also add page author details missing from the site contents upload

Main “driving” dataset Contains tweet user details,tweet text, hashtags, URL referenced, location of tweeter etc

Contains full details of each site page, including URL, title, content, category

Join on URL referenced in tweet

Contains the post author details missing from the Site Content dataset

Join on internal Page ID

Page 163: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Multi-Dataset Join Step 1 : Join Site Contents to Posts

•Site contents dataset needs to gain access to the page author attribute only found in Posts •Create join in the Dataset Relationships panel, using Post ID as the common attribute •Join from Site contents to Posts, to create left-outer join from first to second table

1

2

3

Previews rows from the join, based onpost_id = a (post_id column)

Page 164: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Multi-Dataset Join Step 2 : Standardise URL Formats

•URLs in Twitter dataset have trailing ‘/‘, whereas URLs in RM site data do not •Use the Transformation feature in Studio to add trailing ‘/‘ to RM site URLs •Select option to replace the current URL values and overwrite within project dataset

Page 165: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Multi-Dataset Join Step 3 : Join Tweets to Site Content

•Join on the standardised-format URL attributes in the two datasets •Data view will now contain the page content and author for each tweet mentioning RM

1

2

3

Page 166: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Key BDD Studio Differentiator : Faceted Search Across Hadoop

•BDD Studio dashboards support faceted search across all attributes, refinements •Auto-filter dashboard contents on selected attribute values - for data discovery •Fast analysis and summarisation through Endeca Server technology

Further refinement on“OBIEE” in post keywords

3Results now filteredon two refinements

4

Page 167: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Create Discovery Pages for Dataset Analysis

•Select from palette of visualisation components •Select measures, attributes for display

Page 168: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Select From Multiple Visualisation Types

Page 169: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

•Search on attribute values, text in attributes across all datasets •Extracted keywords, free text field search

Faceted Search Across Project

Page 170: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Key BDD Studio Differentiator : Faceted Search Across Hadoop

•BDD Studio dashboards support faceted search across all attributes, refinements •Auto-filter dashboard contents on selected attribute values •Fast analysis and summarisation through Endeca Server technology

“Mark Rittman” selected from Post Authors Results filtered on

selected refinement

1 2

Page 171: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Summary

•Oracle Big Data, together with OBIEE, ODI and Oracle Big Data Discovery •Complete end-to-end solution with engineered hardware, and Hadoop-native tooling

1 Combine with Oracle Big Data SQL for structured OBIEE dashboard analysis 2 Combine with site content, semantics, text enrichment

Catalog and explore using Oracle Big Data Discovery

Page 172: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

… And Finally Additional Resources How to Contact Us

Page 173: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Additional Resources

•Articles on the Rittman Mead Blog ‣http://www.rittmanmead.com/category/oracle-big-data-appliance/ ‣http://www.rittmanmead.com/category/big-data/ ‣http://www.rittmanmead.com/category/oracle-big-data-discovery/

•Slides will be on the BI Forum USB sticks •Rittman Mead offer consulting, training and managed services for Oracle Big Data ‣Oracle & Cloudera partners ‣http://www.rittmanmead.com/bigdata

Page 174: Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : [email protected] W : www.rittmanmead.com

Thank You for Attending!

•Thank you for attending this presentation, and more information can be found at http://www.rittmanmead.com

•Contact us at [email protected] or [email protected] •Look out for our book, “Oracle Business Intelligence Developers Guide” out now! •Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead)