© 2016 IBM Corporation
Accelerating and Scaling R Analytics
Using Spark R, In-memory Columnar Databases,
and Hadoop
Dan Gouveia, Ironside
Chi Shu, Ironside
Rich Tarro, IBM
2 © 2016 IBM Corporation
Agenda
R Analytics– Overview
– Challenges
Scaling R
dashDB Overview
Demonstration: R with dashDB In-Database Analytics
Spark Overview– Spark DataFrames
– SparkR
Demonstration: SparkR, using Jupyter Notebook
Demonstration: SparkR with dashDB, using RStudio
3 © 2016 IBM Corporation
Agenda
R Analytics– Overview
– Challenges
Scaling R
dashDB Overview
Demonstration: R with dashDB In-Database Analytics
Spark Overview– Spark DataFrames
– SparkR
Demonstration: SparkR, using Jupyter Notebook
Demonstration: SparkR with dashDB, using RStudio
4 © 2016 IBM Corporation
What is R?
Interpreted programming language for statistical computing and
graphics
Freely available under the GNU General Public License
Widely used among statisticians and data miners for data and
statistical analysis
The capabilities of R are extended through user-created packages– R includes a core set of packages
– More than 7,801 additional packages (as of January 2016) available at the
Comprehensive R Archive Network (CRAN)
R's popularity has increased substantially in recent years
5 © 2016 IBM Corporation
Data Science Tool Adoption
6 © 2016 IBM Corporation
R Challenges
R is single threaded
R requires data to be loaded into memory– objects must all fit in memory
R can only process data on a single machine
7 © 2016 IBM Corporation
Previous approaches for scaling R for Big Data
RHIPE implementation
include an R API for writing MapReduce from R
RHadoop implementation
8 © 2016 IBM Corporation
Agenda
R Analytics– Overview
– Challenges
Scaling R
dashDB Overview
Demonstration: R with dashDB In-Database Analytics
Spark Overview– Spark DataFrames
– SparkR
Demonstration: SparkR, using Jupyter Notebook
Demonstration: SparkR with dashDB, using RStudio
9 © 2016 IBM Corporation
dashDB
Spark
Introducing 2 Ways to scale R Analytics
A fully-managed cloud
data warehouse,
purpose-built for
analytics
An open source cluster-
computing framework
10 © 2016 IBM Corporation
Agenda
R Analytics– Overview
– Challenges
Scaling R
dashDB Overview
Demonstration: R with dashDB In-Database Analytics
Spark Overview– Spark DataFrames
– SparkR
Demonstration: SparkR, using Jupyter Notebook
Demonstration: SparkR with dashDB, using RStudio
11 © 2016 IBM Corporation
IBM dashDB – Analytics Warehouse as a Service
For apps that need:
• Elastic scalability
• High availability
• Data model flexibility
• Data mobility
• Text search
• Geospatial
Available as:• Fully managed DBaaS
• On-premises private cloud
• Hybrid architecture
BLU Acceleration
Netezza In-Database
Analytics
Cloudant NoSQL
Integration
In-database analytics capabilities for best performance atop a fully-managed warehouse
dashDB MPP
Fully-managed data warehouse on cloud
BLU Acceleration columnar technology +
Netezza in-database analytics
BLU in-memory processing, data skipping, actionable
compression, parallel vector processing, , “Load & Go”
administration
Netezza predictive analytic algorithms
Fully integrated RStudio & R language
Oracle compatibility
Massively Parallel Processing (MPP)
On disk data encryption and
secure connectivity
12 © 2016 IBM Corporation
MPP for IBM dashDB
Massively Parallel Processing– Coordination of multiple CPU cores and servers, working together to solve complex tasks & queries
– Add more servers for additional processing power!
Query takes
1 hour
Query takes
15 min
Traditional Approach
Parallelization of Cores• For smaller data sets < 12TB
• Generally less expensive
• Slower performance
MPP Approach
Parallelization of Cores and Servers• For larger data sets > 4 TB
• Larger monthly budget
• Very high performance
Query is
segmented into
smaller tasks
Four (4) servers work
together on separate
tasks of the
original query
13 © 2016 IBM Corporation
IBM Netezza Advanced Analytics Built In!
k-Means Clustering
Linear Regression
Decision Tree
Geospatial
14 © 2016 IBM Corporation
Database (BLU)
Analytic
Applications
Anatomy of dashDB’s Analytic Warehouse
Execute entire custom(er) analytic programs inside of the database!
Analytic Code
& Algorithms:
Analytic Data:
Deploy custom(er) code and execute jobs
via special SQL function interfaces3
SQLsSQLs
BENEFIT: Bring custom-designed analytic functions and programs directly to the data!
Canned Algorithms
La
ngu
age
Fra
me
wo
rk
(UD
X &
AE
)
Data
15 © 2016 IBM Corporation
dashDB – Integrated Analytics Environment with
Open-Source R
16 © 2016 IBM Corporation
IBM DBR Package
Allows you to perform the following…
Create in-database Data Frames
Query a database, creating an R data frame
Sample, merge, create contingency table, etc…
Modeling– K-Means
– Association
– Linear Regression
– Decision Tree
17 © 2016 IBM Corporation
IBM DBR Package - Example
Running the LM function against ~40K database
records, using 12GB memory…
18 © 2016 IBM Corporation
Agenda
R Analytics– Overview
– Challenges
Scaling R
dashDB Overview
Demonstration: R with dashDB In-Database Analytics
Spark Overview– Spark DataFrames
– SparkR
Demonstration: SparkR, using Jupyter Notebook
Demonstration: SparkR with dashDB, using RStudio
19 © 2016 IBM Corporation
Demonstration: R with dashDB In-Database Analytics
What this demonstration will cover:
Walk through IBMDBR package concepts– Brief discussion of the dataset
– Brief discussion of the tasks that will be performed• Making a long dataset wide
• Modeling
• Scoring
– Connect Rstudio to dashDB
– Create ida data frame
– Subset ida data frame (column-wise)
– Merge ida data frame
– Subset ida data frame (row-wise)
– Modeling
– Scoring
20 © 2016 IBM Corporation
Agenda
R Analytics– Overview
– Challenges
Scaling R
dashDB Overview
Demonstration: R with dashDB In-Database Analytics
Spark Overview– Spark DataFrames
– SparkR
Demonstration: SparkR, using Jupyter Notebook
Demonstration: SparkR with dashDB, using RStudio
21 © 2016 IBM Corporation
Spark – Let’s get on the same page
Apache Spark is an open source parallel processing
framework that enables users to run large-scale data
analytics applications across clustered computers.
It can process data from: Hadoop Distributed File System (HDFS)
NoSQL dbs
Relational data stores (e.g. Apache Hive)
It can process data in-memory or on-disk.
22 © 2016 IBM Corporation
Spark Background
Started as a research project in 2009,
open source in 2010– General purpose cluster computing system
– Generalizes MapReduce
– Batch oriented processing
– Main concept: Resilient Distributed Datasets (RDDs)
Apache incubator project in June 2013– Apache top level project Feb 27, 2014
Current version 1.6.1– Requires Scala 2.10.x, Maven
– Languages supported: Java, Scala, Python, R
(Java 7+, Python 2.6+, R 3.1+)
– May need additional libraries for Python
ex: numpy
23 © 2016 IBM Corporation
Key reasons for interest in Spark
Performance In-memory architecture greatly reduces disk I/O
Anywhere from 20-100x faster for common tasks
Productivity Concise and expressive syntax, especially compared to prior approaches
Single programming model across a range of use cases and steps in data lifecycle
Integrated with common programming languages – Java, Python, Scala, R
New tools continually reduce skill barrier for access (e.g. SQL for analysts)
Leverages existing
investments
Works well within existing Hadoop ecosystem
Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities
24 © 2016 IBM Corporation
Spark Programming Languages
Scala **
Java
Python
R
Language 2014 2015
Scala 84% 71%
Java 38% 31%
Python 38% 58%
R unknown 18%
Survey done by Databricks,
summer 2015
** Spark written in Scala
25 © 2016 IBM Corporation
Spark Application Architecture
A Spark application is initiated from a driver program
Spark execution modes:– Standalone with the built-in cluster manager
– Use Mesos as the cluster manager
– Use YARN as the cluster manager
– Standalone cluster on Amazon EC2
26 © 2016 IBM Corporation
Spark DataFrames
Distributed collection of data organized in named columns– Conceptually equivalent to a relational table, R/Python data frames
Supported format and sources– Can be created from an SQLContext
– From sources such as: JSON, Hive, JDBC, parquet, etc.
Benefits:– Easier manipulation interface (similar to SQL)
– Higher abstraction for possible optimization
27 © 2016 IBM Corporation
Spark SQL
Provide for relational queries expressed in SQL, HiveQL and Scala
Seamlessly mix SQL queries with Spark programs
DataFrames provide a single interface for efficiently working with
structured data including Apache Hive, Parquet and JSON files
Leverages Hive frontend and metastore– Compatibility with Hive data, queries, and UDFs
– HiveQL limitations may apply
– Not ANSI SQL compliant
Standard connectivity through JDBC/ODBC
28 © 2016 IBM Corporation
SparkR
SparkR is an R package that provides a light-weight front-end to use
Apache Spark from R
Exposes the Spark API and allows users to interactively run jobs
Provides a distributed data frame implementation that supports
operations like selection, filtering, aggregation etc. (similar to R data
frames, dplyr) but on large datasets– Conceptually equivalent to a table in a relational database or a data frame in R,
but with richer optimizations under the hood
– DataFrames can be constructed from a wide array of sources such as:
structured data files, tables in Hive, external databases, or existing local R data
frames
Supports operations like selection, filtering, aggregation etc. (similar
to R data frames) but on large datasets
SparkR also supports distributed machine learning using MLlib
29 © 2016 IBM Corporation
Running SQL Queries from SparkR
A SparkR DataFrame can also be registered as a temporary table in
Spark SQL
Registering a DataFrame as a table allows you to run SQL queries
over its data
The sql function enables applications to run SQL queries
programmatically and returns the result as a DataFrame.
30 © 2016 IBM Corporation
RStudio
RStudio is a free and open-source integrated development environment (IDE) for R
31 © 2016 IBM Corporation
Web-Based Notebooks
Notebooks:
“interactive computational environment, in which you can combine code
execution, rich text, mathematics, plots and rich media”
Zeppelin– Apache incubator project
– Suport multiple interpreters• Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown
and Shell
Jupyter– Based on IPython
– Supports multiple interpreters• Python, Scala, R
32 © 2016 IBM Corporation
Agenda
R Analytics– Overview
– Challenges
Scaling R
dashDB Overview
Demonstration: R with dashDB In-Database Analytics
Spark Overview– Spark DataFrames
– SparkR
Demonstration: SparkR, using Jupyter Notebook
Demonstration: SparkR with dashDB, using RStudio
33 © 2016 IBM Corporation
Demonstration: SparkR, using a Jupyter Notebook
What this demonstration will cover:
Explore the concept of a notebook
Walk-through high-level SparkR concepts– Data access (file)
– Aggregated Calculations
– Transformations
– Enriching data (adding weather)
– Visualizing data (ggplot.SparkR)
34 © 2016 IBM Corporation
Agenda
R Analytics– Overview
– Challenges
Scaling R
dashDB Overview
Demonstration: R with dashDB In-Database Analytics
Spark Overview– Spark DataFrames
– SparkR
Demonstration: SparkR, using Jupyter Notebook
Demonstration: SparkR with dashDB, using RStudio
35 © 2016 IBM Corporation
Demonstration: SparkR with dashDB, using RStudio
What this demonstration will cover:
Walk-through high-level concepts–Connecting R to Spark– Data access (dashDB)
– Aggregated Calculations
– Enriching data (adding weather via API)
– Visualizing data (ggplot)
36 © 2016 IBM Corporation
Wrap-up
37 © 2016 IBM Corporation
Summary
Key Concepts Covered:
1. R is a popular and powerful data science tool
2. R has limitations– Single threaded, in-memory solution
– Not scalable on its own
3. IBM’s dashDB provides a means of scaling R– Can use R to develop in-database analytics applications
– Leverage the powerful MPP capabilities
4. Spark (SparkR) provides another means of scaling R– Can use R to develop within the Spark framework
– Take advantage of clustered computing power
– Parallels between R data frame and Spark DataFrame
38 © 2016 IBM Corporation
Next Big Data Developers Meetup
Building a Recommendation Engine
with Spark MLlib
Tuesday, June 6, 2016 @ 6 PM
IBM Client Center–1 Rogers St.
–Cambridge, MA
39 © 2016 IBM Corporation
40 © 2016 IBM Corporation
Backup
41 © 2016 IBM Corporation