Download - The Modern Data Architecture for Predictive Analytics with Hortonworks and Revolution Analytics
© Hortonworks Inc. 2013
Modern Data Architecture…for Predictive AnalyticsDavid Smith VP Marketing and Community - Revolution Analytics
John KreisaVP Strategic Marketing- Hortonworks
Page 1
© Hortonworks Inc. 2013
Your Presenters
• David Smith (@revodavid)–VP Marketing and Community at Revolution
Analytics–Data Scientist, Blogger and co-author of An
Introduction to R
• John Kreisa (@marked_man)–VP Strategic Marketing, Hortonworks–Over 20 years in data management as a
developer and a marketer–Avid camper
Page 2
© Hortonworks Inc. 2013
Today’s Topics
• Introduction• Drivers for the Modern Data Architecture (MDA)• Apache Hadoop in the MDA• R’s role in the MDA• Q&A
Page 3
© Hortonworks Inc. 2013
Poll #1: What stage are you at looking in Hadoop?
•Research
•Evaluation
•Trial
•Haven’t started research
Page 4
© Hortonworks Inc. 2013
Existing Data Architecture
Page 5
APPL
ICAT
ION
SDA
TA S
YSTE
M
REPOSITORIES
SOU
RCES Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
OPERATIONALTOOLS
MANAGE & MONITOR
DEV & DATATOOLS
BUILD & TEST
Business Analytics
Custom Applications
PackagedApplications
© Hortonworks Inc. 2013
Existing Data Architecture
Page 6
APPL
ICAT
ION
SDA
TA S
YSTE
M
REPOSITORIES
SOU
RCES Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Business Analytics
Custom Applications
PackagedApplications
Source: IDC
2.8 ZB in 2012
85% from New Data Types
15x Machine Data by 2020
40 ZB by 2020
© Hortonworks Inc. 2013 - Confidential
Modern Data Architecture Enabled
Page 7
APPL
ICAT
ION
SDA
TA S
YSTE
M
REPOSITORIES
SOU
RCES Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources (Sensor, Sentiment, Geo, Unstructured)
OPERATIONALTOOLS
MANAGE & MONITOR
DEV & DATATOOLS
BUILD & TEST
Business Analytics
Custom Applications
PackagedApplications
© Hortonworks Inc. 2013 - Confidential
Hadoop Powers Modern Data Architecture
Page 8
Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF) that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment.
Hadoop Cluster
compute&
storage. . .
. . .
. .compute
&storage
.
.
Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
© Hortonworks Inc. 2013 - Confidential
Driving Efficiency Driving Opportunity
Drivers for Hadoop Adoption
Modern Data ArchitectureHadoop has a central role in next
generation data architectures while integrating with existing data systems
Business ApplicationsUse Hadoop to extract insights that enable new customer value and competitive edge
ExistingTraditionalServer log
Clickstream
Big Data SetsEmerging
Sentiment/SocialMachine/SensorGeo-locations
© Hortonworks Inc. 2013 - Confidential
Opportunity in types of data
1. SentimentUnderstand how your customers feel about your brand and products – right now
2. ClickstreamCapture and analyze website visitors’ data trails and optimize your website
3. Sensor/MachineDiscover patterns in data streaming automatically from remote sensors and machines
4. GeographicAnalyze location-based data to manage operations where they occur
5. Server LogsResearch logs to diagnose process failures and prevent security breaches
6. Unstructured (txt, video, pictures, etc..)Understand patterns in files across millions of web pages, emails, and documents
Value
Page 10
© Hortonworks Inc. 2013 - Confidential
Efficiency in the Modern Data Architecture
Page 11
APPL
ICAT
ION
SDA
TA S
YSTE
M
REPOSITORIES
SOU
RCES Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources (Sensor, Sentiment, Geo, Unstructured)
Business Analytics
Custom Applications
PackagedApplications
• Drive efficiency via modern data architecture
• Store data once and access it in many ways
• Often referred to a data lake or data repository
• Infrastructure platform driven
• IT-oriented, TCO based
© Hortonworks Inc. 2013 - Confidential
Engineered for Interoperability
Page 12
APPL
ICAT
ION
SDA
TA S
YSTE
MSO
URC
ES
RDBMS EDW MPP
Emerging Sources (Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources (CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
© Hortonworks Inc. 2013 - Confidential
IntegratedInteroperable with existing data center investments Skills
Leverage your existing skills: development, operations, analytics
Requirements for Hadoop Adoption
Page 13
Key ServicesPlatform, operational and data services essential for the enterprise
3Requirements for Hadoop’s Role in the Modern Data Architecture
© Hortonworks Inc. 2013 - Confidential
Revolution R Enterprise Architecture
Page 14
APPL
ICAT
ION
SDA
TA S
YSTE
M
REPOSITORIES
SOU
RCES Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources (Sensor, Sentiment, Geo, Unstructured)
OPERATIONALTOOLS
MANAGE & MONITOR
DEV & DATATOOLS
BUILD & TEST
Business Analytics
Custom Applications
PackagedApplications
= Revolution R Enterprise
© Hortonworks Inc. 2013
Today’s Topics
• Introduction• Drivers for the Modern Data Architecture (MDA)• Apache Hadoop’s role in the MDA• R’s role in the MDA• Q&A
Page 15
© Hortonworks Inc. 2013
Poll #2: Which of the following best describes your use of R and Hadoop?
•We have R+ Hadoop in Production
•We have testing R+ Hadoop
•We have started to investigate but nothing is implemented
•No current plansPage 16
Revolution ConfidentialWhat is the Open Source R Project?
The R Language: Object-Oriented Language for Stats, Math and Data Science Comprehensive data visualization and statistical modeling capabilities
The R Community: 2M+ Users with the Skill to Tackle Big Data Statistical and Numerical
Analysis and Machine Learning Projects New graduates with data skills learn R
The R Ecosystem: 5000+ Freely Available Algorithms in CRAN Specialized methods for finance, economics, genomics, linguistics,
and every data-driven domain
17
Revolution Confidential
R is open source and drives analytic innovation but has some limitations for Enterprises
Bigger data sizes
Speed of analysis
Production support
Memory Bound Big Data
Single ThreadedScale out, parallel processing, high speed
Community SupportCommercial production support
Innovation and scale
Innovative5000+ packages Exponential growth
Combines with open source R packages where needed
Revolution ConfidentialRevolution R Enterprise
19
Enterprise-Ready
Revolution R Enterprise is the only commercial big data analytics platform
based on open source R statistical computing language
Cross-Platform
Big Data Analytics
High Performance Analytics
Easier Build & Deploy
Modern Data ArchitectureExtract and Analyze
Ad-hoc Data Distillation Exploratory Data Analysis / Data Visualization Model Development
AMBARI
MAPREDUCE
YARN
HDFS REST
DATA REFINEMENT
HIVEPIG CUSTOM
HTTP
STREAM
LOAD
SQOOP
FLUME
WebHDFS
NFS
STRUCTURE
HCATALOG (metadata services)
Query/Visualization/ Reporting/Analytical
Tools and Apps
SOURCE DATA
- Sensor Logs- Clickstream- Flat Files- Unstructured- Sentiment- Customer- Inventory
DBs
JMSQueue’s
FilesFilesFiles
LOAD
SQOOP/Hive
Web HDFS
Data Sources
CSV
DATABASES
INTERACTIVE
HIVE Server2
Analytical ToolsANALYTICAL
rHadoop
Revolution ConfidentialThe Data Scientist’s Big Data Toolkit
21
Statistical Tests
Machine Learning
Simulation
Descriptive Statistics
Data Visualization
R Data Step
Predictive Models
Sampling
Parallel External-Memory Algorithms
22
CPU
CPU
CPU
SMP SERVER
Parallel External-Memory Algorithms
23
HADOOP NODE
HADOOP NODE
HADOOP NODE
HADOOP CLUSTER
Revolution Confidential
Modern Data Architecture with RRE7In-Hadoop Predictive Analytics Production Data Distillation (e.g. Semantic Analysis) Production Model Processing / Re-Estimation Production Model Scoring
AMBARI
MAPREDUCE
YARN
HDFS REST
DATA REFINEMENT
HIVEPIG CUSTOM
DISTILLED DATA FILES
HTTP
STREAM
LOAD
SQOOP
FLUME
WebHDFS
NFS
STRUCTURE
HCATALOG (metadata services)
Query/Visualization/ Reporting/Analytical
Tools and Apps
SOURCE DATA
- Sensor Logs- Clickstream- Flat Files- Unstructured- Sentiment- Customer- Inventory
DBs
JMSQueue’s
FilesFilesFiles
LOAD
SQOOP/Hive
Web HDFS
Data Sources
CSV
DATABASES
INTERACTIVE
HIVE Server2
Analytical ToolsANALYTICAL
Revolution R Enterprise
Revolution ConfidentialHadoop As An R Engine
Use Revolution R Enterprise
PEMAs in Hadoop No need to change existing R code
Simple R programming No need to “Think In MapReduce”
Eliminate data movement to
slash latencies
Use Hadoop nodes as parallel R
computation engines
25
Hadoop
© Hortonworks Inc. 2013
IntegratedInteroperable with existing data center investments Skills
Leverage your existing skills: development, operations, analytics
Requirements for Hadoop Adoption
Page 26
Key ServicesPlatform, operational and data services essential for the enterprise
3Requirements for Hadoop’s Role in the Modern Data Architecture
© Hortonworks Inc. 2013
Poll #3: Which of the following would you most like to accomplish with R + Hadoop?
•Build a model to be put in product in Hadoop
•Build a model to be put in product elsewhere
•Create new data from Hadoop to supplement an existing analytics process
•Something else
Page 27
© Hortonworks Inc. 2013
Next Steps:
Page 28
More about Revolution Analytics and Hadoophttp://www.revolutionanalytics.com/products/r-for-hadoop.php
Get started on Hadoop with Hortonworks Sandboxhttp://hortonworks.com/sandbox
Follow us:@hortonworks@RevolutionR