r and big data using revolution r enterprise with hadoop
DESCRIPTION
Find out how Revolution Analytics is making it easier to work with Hadoop frameworks with Revolution R Enterprise.TRANSCRIPT
Revolution Confidential
Revolution Analytics
Bringing the Analytical Power of R to the Hadoop Platform
Simon FieldTechnical Director,Revolution Analytics
June 14, 2013
Revolution ConfidentialVigorous Growth of Big Data…
2
The global Big Data Market revenue is expected to grow from $1.56 billion in 2012 to $13.95 billion in 2017, at an estimated CAGR of
54.9% from 2012 to 2017.
- Marketsandmarkets.com study, 14 April 2013
“…the market for Big Data technology will reach 16.9 billion by 2015, up from $3.2 billion in 2010. That is a 40 percent-a-year
growth rate – about seven times the estimated growth rate for the overall information technology and communications business.”
– IDC study, March 2012
Revolution ConfidentialBig Data = Opportunity + Disruption
3
Huge New Data Assets• Internet – Commerce, Communications, Collaboration• Social Media – Personal, Presence, New Social Networks• Ubiquitous Telemetry – Machines Everywhere
Huge New Data Assets• Internet – Commerce, Communications, Collaboration• Social Media – Personal, Presence, New Social Networks• Ubiquitous Telemetry – Machines Everywhere
Rapidly-Evolving Platforms• “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms”• Vast Choices Among Open Source Platfroms• Eliminate Time Consuming Data Movements
Rapidly-Evolving Platforms• “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms”• Vast Choices Among Open Source Platfroms• Eliminate Time Consuming Data Movements
Emerging Business Opportunities• Data Science Unlocks New Insight• Big Data Drives Better Decisionmaking• Platforms Evolve Rationally Toward Big Data Vision
Emerging Business Opportunities• Data Science Unlocks New Insight• Big Data Drives Better Decisionmaking• Platforms Evolve Rationally Toward Big Data Vision
Revolution Confidential
Hadoop Analytics Platforms: Disruption, Challenge, Growth & Opportunity At Once
4
• Java Skill Requirements• Hadoop’s Innovation Pace• Java Skill Requirements• Hadoop’s Innovation Pace
• Analytical• Write Once, Deploy Anywhere
Growth: Skill Development
• EDW Saturation• Limited Analytical Capabilities• EDW Saturation• Limited Analytical Capabilities
• Data Science Skill Shortage• MapReduce Paradigm
Disruption: Evolving Ecosystems
• Designed for Massive Scale• Commodity Foundations• Designed for Massive Scale• Commodity Foundations
• Built for Data Variety• Open Source Innovation Pace
Challenge: Big Data Readiness
• Descriptive -> Predictive• Short Analytical Cycle Time• Descriptive -> Predictive• Short Analytical Cycle Time
• Ubiquitous Analytical Decisions• Low-Latency Analytics
Opportunity: New, More Capable Analytic Foundation
Revolution ConfidentialWhat We Need: Convergence Data Science With business solutions that fuse statistics, mathematics
and software into meaningful applications.
Software Engineering With tools and frameworks to create agile, scalable
analytics-based applications
IT Operations Management Deployment platforms that are integrated, cost-effective,
secure and ubiquitous.
5
Revolution ConfidentialWhat is the R Statistics Language? The R Language: Straightforward Procedural Language for Stats, Math
and Data Science Open Source
The R Community: 2M Users with the skill to tackle big data mathematical /
statistical and ML needs. Began on workstation / modest SMP servers
The R Ecosystem: 4500+ Freely Available Algorithms in CRAN Applicable to Big Data if scaled
6
Revolution ConfidentialWhy R and Hadoop?
Hadoop’s dominates Big Data Storage and Computational platforms. R dominates Data Science, Providing a
Language, Users Thousands of Pre-Built Algorithms. Bringing Them Together is Our Goal Today.
7
Revolution ConfidentialMission
Company Confidential – Do not distribute 8
Enterprise-ready
Revolution R Enterprise is the only commercial big data analytics platform
based on open source R statistical computing language
Multi-platform
Scalable from desktop to big data
Delivers high performance analytics
Easier to build and deploy analytic applications
Revolution Confidential
Global Industries ServedFinancial ServicesDigital MediaGovernmentHealth & Life SciencesHigh TechManufacturingRetailTelco
Our Software DeliversPower: Distributed, scalable high performance advanced analyticsProductivity: Easier to build and deploy analytic applicationsEnterprise Readiness: Multi-platform
Our PhilosophyCustomer-centric innovationEasy to do business with
Our InvestorsIntel CapitalNorth BridgePresidio Ventures
Who we areLeading provider of commercial analytics platform based on open source R statistical computing language
Customers200+ Global 2000
Global PresenceNorth America / EMEA / APAC
Our Services DeliverKnowledge: Our experts enable you to be expertsTime-to-Value: Our Quickstart projects give you a jumpstartGuidance: Our customer support team is here to help you
Company Confidential – Do not distribute 9
Revolution Confidential
Big Data Speed and Scale with Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded ExecutionMulti-Core Execution
In-Hadoop Execution
Memory Management
Parallelized User Code
Revolution Confidential
11
Revolution R Enterprise Propels Enterprises into the Future
Dec
isio
nAnalytic ApplicationsAnalytic Applications
Inte
grat
ion
MiddlewareMiddleware
Dat
a
HadoopHadoop Data Warehouse
Data Warehouse
Other Data
Sources
Other Data
Sources
Ana
lytic
s
Revolution R EnterpriseHigh Performance Analytics Platform
Revolution R EnterpriseHigh Performance Analytics Platform
||||||||||||||||||
|||||||||
Revolution Confidential
Digital Media & RetailDigital Media & Retail
200+ Corporate Customers and GrowingFinance & InsuranceFinance & Insurance Healthcare & Life SciencesHealthcare & Life Sciences
Manufacturing & High TechManufacturing & High TechAcademic & Gov’tAcademic & Gov’t
12
Revolution Confidential
Revolution R Enterprise and R MapReduce
Bringing The R Language to the Hadoop Environment.
13
Revolution Confidential
R MapReduce:Fast, Agile Analytics for Hadoop Today
R MapReduce Enables R-Based Analytics In Hadoop: Use R to Explore and Visualize Data to Develop Insights Build Models Using Widely-Available Techniques Score Data Directly in Hadoop Using R Models Run R as Mappers and Reducers in Hadoop
Advantages: No data movement Connects R to HDFS, Hbase and Hive Run standard MapReduce jobs R Programmers need not learn Java Need Not Rewrite R into Java Pig or SQL to Score Data No Data Movement Needed Accelerates Projects Leveraging Libraries By Bringing
4500+ Open Source R Algorithms in CRAN1 to Hadoop
14
Dat
a
Data Warehouse
Data Warehouse
Other Data
Sources
Other Data
Sources
Ana
lytic
s
MapReduceMapReduce
App
licat
ions
Hadoop
||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||
||||||||
Other MapReduce
Jobs
Other MapReduce
Jobs
HDFSHDFS
HbaseHbase
R MapReduce (RMR)
R MapReduce (RMR)
HiveHive
1 CRAN: Comprehensive R Archive Network – an open source collection of 4500+ R-based statistics, analtyics, graphics and data manipulations algorithms for R users.
Revolution Confidential
R MapReduce (RMR)
R MapReduce:Build MapReduce Jobs Entirely In R
15
Your Creativity.+
Your Code.
+
4500+ R Packges in CRAN
=
Rich, Powerful Data Analytics That
Runs in MapReduce.
Revolution R Enterprise
Revolution R Enterprise
Hbase
Hadoop
Hive
HDFS
MAPMAP MAPMAP MAPMAP
REDUCEREDUCE REDUCEREDUCE CRAN Packages
Revolution ConfidentialWhy Build MapReduce Jobs using R? What can you do with it?
Transform, Aggregate, Regress, Cluster, Filter, Simulate, Model, Score …
Run R Programs While Leveraging Hadoop’s Scalability Big I/O: Score data files containing billions of rows Big Math: Run compute-intensive algorithms in parallel – Monte Carlo,
Random Trees, etc. Deliver results to BI or Visualization Tools and Production
Applications
When to chose RMR: Need to Develop Analytics in R, on Big data in Hadoop Stringent Latency Requirements Scarce R and Java Developers Need to Collaborate Not Duplicate
16
Revolution Confidential
R MapReduce:Create Mappers and Reducers Using R
How: Build R Code Using
Revolution R Enterprise Use Open Source Algorithms
From CRAN project. Leverage HDFS and
MapReduce Directly Deploy R Mappers &
Reducers in Hadoop
17
Dat
a
Data Warehouse
Data Warehouse
Other Data
Sources
Other Data
Sources
Ana
lytic
s
MapReduceMapReduce
App
licat
ions
R MapReduce (RMR)
R MapReduce (RMR)
Revolution R Enterprise
Revolution R Enterprise
Hadoop
||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||
||||||||
Other MapReduce
Jobs
Other MapReduce
Jobs
R CodeR Code
R PackagesR Packages
HDFSHDFS
HbaseHbaseHiveHive
RRERRE
CRAN Packages
Revolution Confidential
Mappers & Reducers:100% R. 100% Hadoop.
For Hadoop Users: Integrates R with Hadoop via
Hadoop Streaming Creates MapReduce Jobs
Compatible with JobTracker No Need to Recode Models No Latency to Move Data
For R Programmers No need for Java Programming Serialized & Deserializes Data
Between HDFS and R Handles Standard HDFS Read &
Write Transparently Provides Explicit Access to
HDFS, Hbase and Hive via Packages
Access to CRAN Algorithm Library
18
Mapperor Reducer
Hadoop Streaming
R Code
Revolution R Enterprise
Revolution R Enterprise
High-Speed Connectors
Data Deserialization
Data Serialization
HbaseHive
HDFS
HDFS
CRAN
Revolution ConfidentialLeveraging R with Hadoop
With R “Inside” Hadoop… In-Place ETL
Data Transformation in R Enrichment and Correlation Using
Other Data In Hadoop Simulation/Experimentation
Execute Complex Simulations on Massively-Parallel Hadoop Clusters
Scoring Run Scoring Models Directly in
Hadoop. No Movement Penalty
How? Write Mappers & Reducers in R and
Deploy Using RMapReduce Augment Hadoop with CRAN1
Packages
191 Use of CRAN algorithms limited to non-graphical, parallelizable algorithms
Revolution ConfidentialLimitations of R MapReduce R Programmer Must “Think MapReduce” –
Dividing Work into Cascades of Map, Reduce, Repeat.
Algorithms Must be Designed for Parallelism Including External Packages Used.
Fits: Hadoop Literate Teams or Those With Good Support
Non-Fits: Analytics Teams Tinkering with Hadoop on Short
Timeframes.
Company Confidential – Do not distribute 20
Dat
a
Data Warehouse
Data Warehouse
Other Data
Sources
Other Data
Sources
Ana
lytic
s
MapReduceMapReduce
App
licat
ions
R MapReduce (RMR)
R MapReduce (RMR)
Hadoop
||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||
||||||||
Other MapReduce
Jobs
Other MapReduce
Jobs
HDFSHDFS
HbaseHbaseHiveHive
Revolution Confidential
More Ways to Leverage R with Hadoop:“Beside” Architectures
Inside Hadoop In-Place ETL
Data Transformation in R Enrichment and Correlation Using
Other Data In Hadoop Simulation/Experimentation
Execute Complex Simulations on Massively-Parallel Hadoop Clusters
Scoring Run Scoring Models Directly in
Hadoop. No Movement Penalty
How? Write Mappers & Reducers in R and
Deploy Using RMapReduce Augment Hadoop with CRAN1
Packages
“Beside” Architectures: Drivers:
Large or Unpredictable R Workloads Modest Hadoop Cluster Shared Production Hadoop Cluster Hadoop Novice Large Numbers of R Users. Modest Data Sets To Be Scored
Movement Penalty Isn’t Prohibitive Maximized Computational Scale
Access to ScaleR Parallel External Memory Algorithms (PEMAs)
Advantages: Makes Hadoop Easier to Administer Stabilies Hadoop Resource Availability
21
Revolution ConfidentialTwo Additional “Beside” Architectures Alternatives:
RRE “Beside” Hadoop RRE Both “Beside” and “Inside” Hadoop with RMR
“Beside” Usage: Sample into “Beside” Server or Cluster Analyze and Model on R Server or Cluster Score Data on R Server or Cluster Results to Hadoop for Use.
“Both” Usage - Same As Above Except: Move Model to Data on Hadoop Score Data In-Place on Hadoop
Why multiple options? Greatest Flexibility Optimize Skill Sets Scale Clusters Independently Control Concurrency and Security Optimize Utilization Same R Code Can Run in Both Balance Ease of Use/Development and Resulting Performance & Scale
22
Revolution Confidential
Data Warehouse
Data Warehouse
Other Data
Sources
Other Data
Sources
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||
|||||||
RRE “Beside” Hadoop Separate Hadoop & R
Clusters Connectors HDFS,
Hbase & Hive Explore & Model Data
on R server(s) Return Scored Data to
HDFS/Hbase/Hive When To Use:
Small, Shared or Production Hadoop Cluster
Need Parallelized Algorithms
Heavy Random Workloads
Extensive “Sandboxing”
Modest Data Scoring Data Security
Constraints. … while awaiting
YARN… Advantages:
Concurrency By Separation
Security By Separation Independent
Scalability ScaleR Parallel
Algorithms23
Dat
aA
naly
tics
MapReduceMapReduce
App
licat
ions
HadoopCluster
|||||||
Other MapReduce
Jobs
Other MapReduce
Jobs
HDFSHDFS
HbaseHbaseHiveHive
RRERRE
CRAN Packages
Revolution R Enterprise
Revolution R Enterprise
||||||
ConnectR:HbaseHDFS
ODBC &High-Speed Connectors
Analytics Apps.
Analytics Apps.
Analytics Server or Cluster:
Linux, Windows, LSF or Azure
Data Manipulation and Analysis
Data Manipulation and Analysis
BI &Visualization
Revolution Confidential
Data Warehouse
Data Warehouse
Other Data
Sources
Other Data
Sources |||||||
|||||||
RRE “Beside” and “Inside” Both “Inside” and “Beside” Platforms Connect a Compute
Cluster to Hadoop to Run R
Move Models to Score Big Data on Hadoop
When To Use: Production Hadoop
Cluster Need Parallelized
Algorithms Heavy Random
Workloads Extensive
“Sandboxing” Large Data Scoring Data Security
Constraints. … while awaiting
YARN… Advantages:
Concurrency & Security
Independent Scalability
Big Data Scoring Flexibility Low Latency
24
Dat
aA
naly
tics
MapReduceMapReduce
App
licat
ions
HadoopCluster
|||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Other MapReduce
Jobs
Other MapReduce
Jobs
HDFSHDFS
HbaseHbaseHiveHive
||||||
ConnectR:HbaseHDFS
ODBC &High-Speed Connectors
Analytics Server or Cluster:
Linux, Windows, LSF or Azure
R MapReduce (RMR)
R MapReduce (RMR)
RRERRE
CRAN Packages
Analytics Apps.
Analytics Apps.
Revolution R Enterprise
Revolution R Enterprise
ConnectR:HbaseHDFS
ODBC &High-Speed Connectors
Analytics Server or Cluster:
Linux, Windows, LSF or Azure
BI &Visualization
Revolution Confidential
•Segment•Categorize•Select Features•Simulate•Predict•Validate
ModelModel•Deploy•Score•Integrate
DeployDeploy• Measure
Accuracy• Iterate
ImproveImprove
Typical Predictive Analytics Workflow
25
• Ingest• Format• Enrich• Filter• Aggregate• Profile
Data PrepData Prep
•Sample•Cluster•Visualize•Correlate•Sandboxing
ExploreExplore
Revolution Confidential
‘Beside’ and/or ‘Inside’: Dominant Usage Patterns Observed
Use Case 1: Real-Time Scoring Example – Fraud Prevention
Use Case 2: Modeling and Scoring Example – Attribution Analysis
Use Case 3: Production Analytics Example – Telematics-Assisted Underwriting
26
Revolution Confidential
In-House Systems:
Transaction History
27
Example 1: Card Fraud Detection
MapReduceMapReduce
Hadoop
HDFSHDFS
HbaseHbase
1 Ingest
Weblog Data
Personal Data:
Credit-worthiness
Banking
2
4
Filter & Xform
3Correlate & Rate
Transaction Data
R MapReduce (RMR)
R MapReduce (RMR)
OtherMapReduce
Jobs
OtherMapReduce
Jobs
Develop Risk
Models
6
Revolution R Enterprise
Revolution R EnterpriseConnectR:
HbaseHDFS
ODBC &High-Speed Connectors
R Workstation
Deliver & Integrate
Execute Models5
Filter & Score
Transactions
BI &Visualization
Mortgage Data
Authorization Systems
Demographic Data
Revolution Confidential
In-House Systems:
EDW, CRM, Datamarts
Example 2:Attribution Analysis “Beside” Hadoop
MapReduceMapReduce
Hadoop
HDFSHDFS HbaseHbase
1Ingest
Weblog Data
Marketing Service Provider Feeds:Acxiom
ExperianExactTargetMonitored
ResponsesCoreMetrics
DotomiDoubleClick
8
3
7
4
Call centerData
Java MapReduce
Jobs
Java MapReduce
Jobs
Develop Attribution
Models
Deliver to Users
Revolution R Enterprise
Revolution R Enterprise
ConnectR:HbaseHDFS
ODBC &High-Speed Connectors
Analytics Apps.
Analytics Apps.
Linux ServerCluster Server
BI &Visualization
2
Filter & Transform
Score
6
6
Load Analysis Environment
Aggregate, Profile,
& EnrichSessionize
Revolution Confidential
29
Example 3:Telematics-Enhanced Underwriting
1Ingest
8
2Correlate Sources
3 Filter, Aggregate &
Profile
Deliver to Underwriting
& Call Response Systems
Revolution R Enterprise
Revolution R Enterprise
ConnectR:HbaseHDFS
ODBC &High-Speed Connectors
Underwriting ApplicationsUnderwriting Applications
Linux ServerCluster Server
MapReduceMapReduce
Hadoop
HDFSHDFS
OtherMapReduce
Jobs
OtherMapReduce
Jobs
HbaseHbase
6
Policy Origination Data
Vehicle Sensor Data:SpeedTime
AccelerationLocation
Creditworthiness Data
Insured Data:Loss History
Payment HistoryCredit File
Demographics 4
Load Model Environment
Export Models
ScoreLarge
Datasets
5R MapReduce (RMR)
R MapReduce (RMR)
7
Develop Risk
Models
Revolution ConfidentialConclusion Big Data Is Hard. Hadoop is Key to Managing It. R is Key to Applying It.
Revolution R on Hadoop Brings Data Science to Big Data Hadoop Brings Parallel Performance to R R Brings a Community with Know-How to Hadoop
Revolution Analytics Can Deliver Convergence Today. … and the Future of R on Hadoop is Even Brighter…
30