r and big data using revolution r enterprise with hadoop

32
Revolution Confidential Revolution Analytics Bringing the Analytical Power of R to the Hadoop Platform Simon Field Technical Director, Revolution Analytics June 14, 2013

Upload: revolution-analytics

Post on 26-Jan-2015

112 views

Category:

Technology


0 download

DESCRIPTION

Find out how Revolution Analytics is making it easier to work with Hadoop frameworks with Revolution R Enterprise.

TRANSCRIPT

Revolution Confidential

Revolution Analytics

Bringing the Analytical Power of R to the Hadoop Platform

Simon FieldTechnical Director,Revolution Analytics

June 14, 2013

Revolution ConfidentialVigorous Growth of Big Data…

2

The global Big Data Market revenue is expected to grow from $1.56 billion in 2012 to $13.95 billion in 2017, at an estimated CAGR of

54.9% from 2012 to 2017.

- Marketsandmarkets.com study, 14 April 2013

“…the market for Big Data technology will reach 16.9 billion by 2015, up from $3.2 billion in 2010. That is a 40 percent-a-year

growth rate – about seven times the estimated growth rate for the overall information technology and communications business.”

– IDC study, March 2012

Revolution ConfidentialBig Data = Opportunity + Disruption

3

Huge New Data Assets• Internet – Commerce, Communications, Collaboration• Social Media – Personal, Presence, New Social Networks• Ubiquitous Telemetry – Machines Everywhere

Huge New Data Assets• Internet – Commerce, Communications, Collaboration• Social Media – Personal, Presence, New Social Networks• Ubiquitous Telemetry – Machines Everywhere

Rapidly-Evolving Platforms• “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms”• Vast Choices Among Open Source Platfroms• Eliminate Time Consuming Data Movements

Rapidly-Evolving Platforms• “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms”• Vast Choices Among Open Source Platfroms• Eliminate Time Consuming Data Movements

Emerging Business Opportunities• Data Science Unlocks New Insight• Big Data Drives Better Decisionmaking• Platforms Evolve Rationally Toward Big Data Vision

Emerging Business Opportunities• Data Science Unlocks New Insight• Big Data Drives Better Decisionmaking• Platforms Evolve Rationally Toward Big Data Vision

Revolution Confidential

Hadoop Analytics Platforms: Disruption, Challenge, Growth & Opportunity At Once

4

• Java Skill Requirements• Hadoop’s Innovation Pace• Java Skill Requirements• Hadoop’s Innovation Pace

• Analytical• Write Once, Deploy Anywhere

Growth: Skill Development

• EDW Saturation• Limited Analytical Capabilities• EDW Saturation• Limited Analytical Capabilities

• Data Science Skill Shortage• MapReduce Paradigm

Disruption: Evolving Ecosystems

• Designed for Massive Scale• Commodity Foundations• Designed for Massive Scale• Commodity Foundations

• Built for Data Variety• Open Source Innovation Pace

Challenge: Big Data Readiness

• Descriptive -> Predictive• Short Analytical Cycle Time• Descriptive -> Predictive• Short Analytical Cycle Time

• Ubiquitous Analytical Decisions• Low-Latency Analytics

Opportunity: New, More Capable Analytic Foundation

Revolution ConfidentialWhat We Need: Convergence Data Science With business solutions that fuse statistics, mathematics

and software into meaningful applications.

Software Engineering With tools and frameworks to create agile, scalable

analytics-based applications

IT Operations Management Deployment platforms that are integrated, cost-effective,

secure and ubiquitous.

5

Revolution ConfidentialWhat is the R Statistics Language? The R Language: Straightforward Procedural Language for Stats, Math

and Data Science Open Source

The R Community: 2M Users with the skill to tackle big data mathematical /

statistical and ML needs. Began on workstation / modest SMP servers

The R Ecosystem: 4500+ Freely Available Algorithms in CRAN Applicable to Big Data if scaled

6

Revolution ConfidentialWhy R and Hadoop?

Hadoop’s dominates Big Data Storage and Computational platforms. R dominates Data Science, Providing a

Language, Users Thousands of Pre-Built Algorithms. Bringing Them Together is Our Goal Today.

7

Revolution ConfidentialMission

Company Confidential – Do not distribute 8

Enterprise-ready

Revolution R Enterprise is the only commercial big data analytics platform

based on open source R statistical computing language

Multi-platform

Scalable from desktop to big data

Delivers high performance analytics

Easier to build and deploy analytic applications

Revolution Confidential

Global Industries ServedFinancial ServicesDigital MediaGovernmentHealth & Life SciencesHigh TechManufacturingRetailTelco

Our Software DeliversPower: Distributed, scalable high performance advanced analyticsProductivity: Easier to build and deploy analytic applicationsEnterprise Readiness: Multi-platform

Our PhilosophyCustomer-centric innovationEasy to do business with

Our InvestorsIntel CapitalNorth BridgePresidio Ventures

Who we areLeading provider of commercial analytics platform based on open source R statistical computing language

Customers200+ Global 2000

Global PresenceNorth America / EMEA / APAC

Our Services DeliverKnowledge: Our experts enable you to be expertsTime-to-Value: Our Quickstart projects give you a jumpstartGuidance: Our customer support team is here to help you

Company Confidential – Do not distribute 9

Revolution Confidential

Big Data Speed and Scale with Revolution R Enterprise

Fast Math Libraries

Parallelized Algorithms

In-Database Execution

Multi-Threaded ExecutionMulti-Core Execution

In-Hadoop Execution

Memory Management

Parallelized User Code

Revolution Confidential

11

Revolution R Enterprise Propels Enterprises into the Future

Dec

isio

nAnalytic ApplicationsAnalytic Applications

Inte

grat

ion

MiddlewareMiddleware

Dat

a

HadoopHadoop Data Warehouse

Data Warehouse

Other Data

Sources

Other Data

Sources

Ana

lytic

s

Revolution R EnterpriseHigh Performance Analytics Platform

Revolution R EnterpriseHigh Performance Analytics Platform

||||||||||||||||||

|||||||||

Revolution Confidential

Digital Media & RetailDigital Media & Retail

200+ Corporate Customers and GrowingFinance & InsuranceFinance & Insurance Healthcare & Life SciencesHealthcare & Life Sciences

Manufacturing & High TechManufacturing & High TechAcademic & Gov’tAcademic & Gov’t

12

Revolution Confidential

Revolution R Enterprise and R MapReduce

Bringing The R Language to the Hadoop Environment.

13

Revolution Confidential

R MapReduce:Fast, Agile Analytics for Hadoop Today

R MapReduce Enables R-Based Analytics In Hadoop: Use R to Explore and Visualize Data to Develop Insights Build Models Using Widely-Available Techniques Score Data Directly in Hadoop Using R Models Run R as Mappers and Reducers in Hadoop

Advantages: No data movement Connects R to HDFS, Hbase and Hive Run standard MapReduce jobs R Programmers need not learn Java Need Not Rewrite R into Java Pig or SQL to Score Data No Data Movement Needed Accelerates Projects Leveraging Libraries By Bringing

4500+ Open Source R Algorithms in CRAN1 to Hadoop

14

Dat

a

Data Warehouse

Data Warehouse

Other Data

Sources

Other Data

Sources

Ana

lytic

s

MapReduceMapReduce

App

licat

ions

Hadoop

||||||||

|||||||||||||||||||||||||||||||||||||||||||||||||||||

||||||||

||||||||

Other MapReduce

Jobs

Other MapReduce

Jobs

HDFSHDFS

HbaseHbase

R MapReduce (RMR)

R MapReduce (RMR)

HiveHive

1 CRAN: Comprehensive R Archive Network – an open source collection of 4500+ R-based statistics, analtyics, graphics and data manipulations algorithms for R users.

Revolution Confidential

R MapReduce (RMR)

R MapReduce:Build MapReduce Jobs Entirely In R

15

Your Creativity.+

Your Code.

+

4500+ R Packges in CRAN

=

Rich, Powerful Data Analytics That

Runs in MapReduce.

Revolution R Enterprise

Revolution R Enterprise

Hbase

Hadoop

Hive

HDFS

MAPMAP MAPMAP MAPMAP

REDUCEREDUCE REDUCEREDUCE CRAN Packages

Revolution ConfidentialWhy Build MapReduce Jobs using R? What can you do with it?

Transform, Aggregate, Regress, Cluster, Filter, Simulate, Model, Score …

Run R Programs While Leveraging Hadoop’s Scalability Big I/O: Score data files containing billions of rows Big Math: Run compute-intensive algorithms in parallel – Monte Carlo,

Random Trees, etc. Deliver results to BI or Visualization Tools and Production

Applications

When to chose RMR: Need to Develop Analytics in R, on Big data in Hadoop Stringent Latency Requirements Scarce R and Java Developers Need to Collaborate Not Duplicate

16

Revolution Confidential

R MapReduce:Create Mappers and Reducers Using R

How: Build R Code Using

Revolution R Enterprise Use Open Source Algorithms

From CRAN project. Leverage HDFS and

MapReduce Directly Deploy R Mappers &

Reducers in Hadoop

17

Dat

a

Data Warehouse

Data Warehouse

Other Data

Sources

Other Data

Sources

Ana

lytic

s

MapReduceMapReduce

App

licat

ions

R MapReduce (RMR)

R MapReduce (RMR)

Revolution R Enterprise

Revolution R Enterprise

Hadoop

||||||||

|||||||||||||||||||||||||||||||||||||||||||||||||||||

||||||||

||||||||

Other MapReduce

Jobs

Other MapReduce

Jobs

R CodeR Code

R PackagesR Packages

HDFSHDFS

HbaseHbaseHiveHive

RRERRE

CRAN Packages

Revolution Confidential

Mappers & Reducers:100% R. 100% Hadoop.

For Hadoop Users: Integrates R with Hadoop via

Hadoop Streaming Creates MapReduce Jobs

Compatible with JobTracker No Need to Recode Models No Latency to Move Data

For R Programmers No need for Java Programming Serialized & Deserializes Data

Between HDFS and R Handles Standard HDFS Read &

Write Transparently Provides Explicit Access to

HDFS, Hbase and Hive via Packages

Access to CRAN Algorithm Library

18

Mapperor Reducer

Hadoop Streaming

R Code

Revolution R Enterprise

Revolution R Enterprise

High-Speed Connectors

Data Deserialization

Data Serialization

HbaseHive

HDFS

HDFS

CRAN

Revolution ConfidentialLeveraging R with Hadoop

With R “Inside” Hadoop… In-Place ETL

Data Transformation in R Enrichment and Correlation Using

Other Data In Hadoop Simulation/Experimentation

Execute Complex Simulations on Massively-Parallel Hadoop Clusters

Scoring Run Scoring Models Directly in

Hadoop. No Movement Penalty

How? Write Mappers & Reducers in R and

Deploy Using RMapReduce Augment Hadoop with CRAN1

Packages

191 Use of CRAN algorithms limited to non-graphical, parallelizable algorithms

Revolution ConfidentialLimitations of R MapReduce R Programmer Must “Think MapReduce” –

Dividing Work into Cascades of Map, Reduce, Repeat.

Algorithms Must be Designed for Parallelism Including External Packages Used.

Fits: Hadoop Literate Teams or Those With Good Support

Non-Fits: Analytics Teams Tinkering with Hadoop on Short

Timeframes.

Company Confidential – Do not distribute 20

Dat

a

Data Warehouse

Data Warehouse

Other Data

Sources

Other Data

Sources

Ana

lytic

s

MapReduceMapReduce

App

licat

ions

R MapReduce (RMR)

R MapReduce (RMR)

Hadoop

||||||||

|||||||||||||||||||||||||||||||||||||||||||||||||||||

||||||||

||||||||

Other MapReduce

Jobs

Other MapReduce

Jobs

HDFSHDFS

HbaseHbaseHiveHive

Revolution Confidential

More Ways to Leverage R with Hadoop:“Beside” Architectures

Inside Hadoop In-Place ETL

Data Transformation in R Enrichment and Correlation Using

Other Data In Hadoop Simulation/Experimentation

Execute Complex Simulations on Massively-Parallel Hadoop Clusters

Scoring Run Scoring Models Directly in

Hadoop. No Movement Penalty

How? Write Mappers & Reducers in R and

Deploy Using RMapReduce Augment Hadoop with CRAN1

Packages

“Beside” Architectures: Drivers:

Large or Unpredictable R Workloads Modest Hadoop Cluster Shared Production Hadoop Cluster Hadoop Novice Large Numbers of R Users. Modest Data Sets To Be Scored

Movement Penalty Isn’t Prohibitive Maximized Computational Scale

Access to ScaleR Parallel External Memory Algorithms (PEMAs)

Advantages: Makes Hadoop Easier to Administer Stabilies Hadoop Resource Availability

21

Revolution ConfidentialTwo Additional “Beside” Architectures Alternatives:

RRE “Beside” Hadoop RRE Both “Beside” and “Inside” Hadoop with RMR

“Beside” Usage: Sample into “Beside” Server or Cluster Analyze and Model on R Server or Cluster Score Data on R Server or Cluster Results to Hadoop for Use.

“Both” Usage - Same As Above Except: Move Model to Data on Hadoop Score Data In-Place on Hadoop

Why multiple options? Greatest Flexibility Optimize Skill Sets Scale Clusters Independently Control Concurrency and Security Optimize Utilization Same R Code Can Run in Both Balance Ease of Use/Development and Resulting Performance & Scale

22

Revolution Confidential

Data Warehouse

Data Warehouse

Other Data

Sources

Other Data

Sources

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

|||||||

|||||||

RRE “Beside” Hadoop Separate Hadoop & R

Clusters Connectors HDFS,

Hbase & Hive Explore & Model Data

on R server(s) Return Scored Data to

HDFS/Hbase/Hive When To Use:

Small, Shared or Production Hadoop Cluster

Need Parallelized Algorithms

Heavy Random Workloads

Extensive “Sandboxing”

Modest Data Scoring Data Security

Constraints. … while awaiting

YARN… Advantages:

Concurrency By Separation

Security By Separation Independent

Scalability ScaleR Parallel

Algorithms23

Dat

aA

naly

tics

MapReduceMapReduce

App

licat

ions

HadoopCluster

|||||||

Other MapReduce

Jobs

Other MapReduce

Jobs

HDFSHDFS

HbaseHbaseHiveHive

RRERRE

CRAN Packages

Revolution R Enterprise

Revolution R Enterprise

||||||

ConnectR:HbaseHDFS

ODBC &High-Speed Connectors

Analytics Apps.

Analytics Apps.

Analytics Server or Cluster:

Linux, Windows, LSF or Azure

Data Manipulation and Analysis

Data Manipulation and Analysis

BI &Visualization

Revolution Confidential

Data Warehouse

Data Warehouse

Other Data

Sources

Other Data

Sources |||||||

|||||||

RRE “Beside” and “Inside” Both “Inside” and “Beside” Platforms Connect a Compute

Cluster to Hadoop to Run R

Move Models to Score Big Data on Hadoop

When To Use: Production Hadoop

Cluster Need Parallelized

Algorithms Heavy Random

Workloads Extensive

“Sandboxing” Large Data Scoring Data Security

Constraints. … while awaiting

YARN… Advantages:

Concurrency & Security

Independent Scalability

Big Data Scoring Flexibility Low Latency

24

Dat

aA

naly

tics

MapReduceMapReduce

App

licat

ions

HadoopCluster

|||||||

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Other MapReduce

Jobs

Other MapReduce

Jobs

HDFSHDFS

HbaseHbaseHiveHive

||||||

ConnectR:HbaseHDFS

ODBC &High-Speed Connectors

Analytics Server or Cluster:

Linux, Windows, LSF or Azure

R MapReduce (RMR)

R MapReduce (RMR)

RRERRE

CRAN Packages

Analytics Apps.

Analytics Apps.

Revolution R Enterprise

Revolution R Enterprise

ConnectR:HbaseHDFS

ODBC &High-Speed Connectors

Analytics Server or Cluster:

Linux, Windows, LSF or Azure

BI &Visualization

Revolution Confidential

•Segment•Categorize•Select Features•Simulate•Predict•Validate

ModelModel•Deploy•Score•Integrate

DeployDeploy• Measure

Accuracy• Iterate

ImproveImprove

Typical Predictive Analytics Workflow

25

• Ingest• Format• Enrich• Filter• Aggregate• Profile

Data PrepData Prep

•Sample•Cluster•Visualize•Correlate•Sandboxing

ExploreExplore

Revolution Confidential

‘Beside’ and/or ‘Inside’: Dominant Usage Patterns Observed

Use Case 1: Real-Time Scoring Example – Fraud Prevention

Use Case 2: Modeling and Scoring Example – Attribution Analysis

Use Case 3: Production Analytics Example – Telematics-Assisted Underwriting

26

Revolution Confidential

In-House Systems:

Transaction History

27

Example 1: Card Fraud Detection

MapReduceMapReduce

Hadoop

HDFSHDFS

HbaseHbase

1 Ingest

Weblog Data

Personal Data:

Credit-worthiness

Banking

2

4

Filter & Xform

3Correlate & Rate

Transaction Data

R MapReduce (RMR)

R MapReduce (RMR)

OtherMapReduce

Jobs

OtherMapReduce

Jobs

Develop Risk

Models

6

Revolution R Enterprise

Revolution R EnterpriseConnectR:

HbaseHDFS

ODBC &High-Speed Connectors

R Workstation

Deliver & Integrate

Execute Models5

Filter & Score

Transactions

BI &Visualization

Mortgage Data

Authorization Systems

Demographic Data

Revolution Confidential

In-House Systems:

EDW, CRM, Datamarts

Example 2:Attribution Analysis “Beside” Hadoop

MapReduceMapReduce

Hadoop

HDFSHDFS HbaseHbase

1Ingest

Weblog Data

Marketing Service Provider Feeds:Acxiom

ExperianExactTargetMonitored

ResponsesCoreMetrics

DotomiDoubleClick

8

3

7

4

Call centerData

Java MapReduce

Jobs

Java MapReduce

Jobs

Develop Attribution

Models

Deliver to Users

Revolution R Enterprise

Revolution R Enterprise

ConnectR:HbaseHDFS

ODBC &High-Speed Connectors

Analytics Apps.

Analytics Apps.

Linux ServerCluster Server

BI &Visualization

2

Filter & Transform

Score

6

6

Load Analysis Environment

Aggregate, Profile,

& EnrichSessionize

Revolution Confidential

29

Example 3:Telematics-Enhanced Underwriting

1Ingest

8

2Correlate Sources

3 Filter, Aggregate &

Profile

Deliver to Underwriting

& Call Response Systems

Revolution R Enterprise

Revolution R Enterprise

ConnectR:HbaseHDFS

ODBC &High-Speed Connectors

Underwriting ApplicationsUnderwriting Applications

Linux ServerCluster Server

MapReduceMapReduce

Hadoop

HDFSHDFS

OtherMapReduce

Jobs

OtherMapReduce

Jobs

HbaseHbase

6

Policy Origination Data

Vehicle Sensor Data:SpeedTime

AccelerationLocation

Creditworthiness Data

Insured Data:Loss History

Payment HistoryCredit File

Demographics 4

Load Model Environment

Export Models

ScoreLarge

Datasets

5R MapReduce (RMR)

R MapReduce (RMR)

7

Develop Risk

Models

Revolution ConfidentialConclusion Big Data Is Hard. Hadoop is Key to Managing It. R is Key to Applying It.

Revolution R on Hadoop Brings Data Science to Big Data Hadoop Brings Parallel Performance to R R Brings a Community with Know-How to Hadoop

Revolution Analytics Can Deliver Convergence Today. … and the Future of R on Hadoop is Even Brighter…

30

Revolution Confidential

31

Revolution ConfidentialThank you.

32

www.revolutionanalytics.com  650.646.9545 Twitter: @RevolutionR

The leading commercial provider of software and support for the popular open source R statistics language.