accelerating and scaling r analytics using spark r, in ...files.meetup.com/9505222/scaling with r...

Post on 20-May-2020

13 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2016 IBM Corporation

Accelerating and Scaling R Analytics

Using Spark R, In-memory Columnar Databases,

and Hadoop

Dan Gouveia, Ironside

Chi Shu, Ironside

Rich Tarro, IBM

2 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

3 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

4 © 2016 IBM Corporation

What is R?

Interpreted programming language for statistical computing and

graphics

Freely available under the GNU General Public License

Widely used among statisticians and data miners for data and

statistical analysis

The capabilities of R are extended through user-created packages– R includes a core set of packages

– More than 7,801 additional packages (as of January 2016) available at the

Comprehensive R Archive Network (CRAN)

R's popularity has increased substantially in recent years

5 © 2016 IBM Corporation

Data Science Tool Adoption

6 © 2016 IBM Corporation

R Challenges

R is single threaded

R requires data to be loaded into memory– objects must all fit in memory

R can only process data on a single machine

7 © 2016 IBM Corporation

Previous approaches for scaling R for Big Data

RHIPE implementation

include an R API for writing MapReduce from R

RHadoop implementation

8 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

9 © 2016 IBM Corporation

dashDB

Spark

Introducing 2 Ways to scale R Analytics

A fully-managed cloud

data warehouse,

purpose-built for

analytics

An open source cluster-

computing framework

10 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

11 © 2016 IBM Corporation

IBM dashDB – Analytics Warehouse as a Service

For apps that need:

• Elastic scalability

• High availability

• Data model flexibility

• Data mobility

• Text search

• Geospatial

Available as:• Fully managed DBaaS

• On-premises private cloud

• Hybrid architecture

BLU Acceleration

Netezza In-Database

Analytics

Cloudant NoSQL

Integration

In-database analytics capabilities for best performance atop a fully-managed warehouse

dashDB MPP

Fully-managed data warehouse on cloud

BLU Acceleration columnar technology +

Netezza in-database analytics

BLU in-memory processing, data skipping, actionable

compression, parallel vector processing, , “Load & Go”

administration

Netezza predictive analytic algorithms

Fully integrated RStudio & R language

Oracle compatibility

Massively Parallel Processing (MPP)

On disk data encryption and

secure connectivity

12 © 2016 IBM Corporation

MPP for IBM dashDB

Massively Parallel Processing– Coordination of multiple CPU cores and servers, working together to solve complex tasks & queries

– Add more servers for additional processing power!

Query takes

1 hour

Query takes

15 min

Traditional Approach

Parallelization of Cores• For smaller data sets < 12TB

• Generally less expensive

• Slower performance

MPP Approach

Parallelization of Cores and Servers• For larger data sets > 4 TB

• Larger monthly budget

• Very high performance

Query is

segmented into

smaller tasks

Four (4) servers work

together on separate

tasks of the

original query

13 © 2016 IBM Corporation

IBM Netezza Advanced Analytics Built In!

k-Means Clustering

Linear Regression

Decision Tree

Geospatial

14 © 2016 IBM Corporation

Database (BLU)

Analytic

Applications

Anatomy of dashDB’s Analytic Warehouse

Execute entire custom(er) analytic programs inside of the database!

Analytic Code

& Algorithms:

Analytic Data:

Deploy custom(er) code and execute jobs

via special SQL function interfaces3

SQLsSQLs

BENEFIT: Bring custom-designed analytic functions and programs directly to the data!

Canned Algorithms

La

ngu

age

Fra

me

wo

rk

(UD

X &

AE

)

Data

15 © 2016 IBM Corporation

dashDB – Integrated Analytics Environment with

Open-Source R

16 © 2016 IBM Corporation

IBM DBR Package

Allows you to perform the following…

Create in-database Data Frames

Query a database, creating an R data frame

Sample, merge, create contingency table, etc…

Modeling– K-Means

– Association

– Linear Regression

– Decision Tree

17 © 2016 IBM Corporation

IBM DBR Package - Example

Running the LM function against ~40K database

records, using 12GB memory…

18 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

19 © 2016 IBM Corporation

Demonstration: R with dashDB In-Database Analytics

What this demonstration will cover:

Walk through IBMDBR package concepts– Brief discussion of the dataset

– Brief discussion of the tasks that will be performed• Making a long dataset wide

• Modeling

• Scoring

– Connect Rstudio to dashDB

– Create ida data frame

– Subset ida data frame (column-wise)

– Merge ida data frame

– Subset ida data frame (row-wise)

– Modeling

– Scoring

20 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

21 © 2016 IBM Corporation

Spark – Let’s get on the same page

Apache Spark is an open source parallel processing

framework that enables users to run large-scale data

analytics applications across clustered computers.

It can process data from: Hadoop Distributed File System (HDFS)

NoSQL dbs

Relational data stores (e.g. Apache Hive)

It can process data in-memory or on-disk.

22 © 2016 IBM Corporation

Spark Background

Started as a research project in 2009,

open source in 2010– General purpose cluster computing system

– Generalizes MapReduce

– Batch oriented processing

– Main concept: Resilient Distributed Datasets (RDDs)

Apache incubator project in June 2013– Apache top level project Feb 27, 2014

Current version 1.6.1– Requires Scala 2.10.x, Maven

– Languages supported: Java, Scala, Python, R

(Java 7+, Python 2.6+, R 3.1+)

– May need additional libraries for Python

ex: numpy

23 © 2016 IBM Corporation

Key reasons for interest in Spark

Performance In-memory architecture greatly reduces disk I/O

Anywhere from 20-100x faster for common tasks

Productivity Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala, R

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing

investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities

24 © 2016 IBM Corporation

Spark Programming Languages

Scala **

Java

Python

R

Language 2014 2015

Scala 84% 71%

Java 38% 31%

Python 38% 58%

R unknown 18%

Survey done by Databricks,

summer 2015

** Spark written in Scala

25 © 2016 IBM Corporation

Spark Application Architecture

A Spark application is initiated from a driver program

Spark execution modes:– Standalone with the built-in cluster manager

– Use Mesos as the cluster manager

– Use YARN as the cluster manager

– Standalone cluster on Amazon EC2

26 © 2016 IBM Corporation

Spark DataFrames

Distributed collection of data organized in named columns– Conceptually equivalent to a relational table, R/Python data frames

Supported format and sources– Can be created from an SQLContext

– From sources such as: JSON, Hive, JDBC, parquet, etc.

Benefits:– Easier manipulation interface (similar to SQL)

– Higher abstraction for possible optimization

27 © 2016 IBM Corporation

Spark SQL

Provide for relational queries expressed in SQL, HiveQL and Scala

Seamlessly mix SQL queries with Spark programs

DataFrames provide a single interface for efficiently working with

structured data including Apache Hive, Parquet and JSON files

Leverages Hive frontend and metastore– Compatibility with Hive data, queries, and UDFs

– HiveQL limitations may apply

– Not ANSI SQL compliant

Standard connectivity through JDBC/ODBC

28 © 2016 IBM Corporation

SparkR

SparkR is an R package that provides a light-weight front-end to use

Apache Spark from R

Exposes the Spark API and allows users to interactively run jobs

Provides a distributed data frame implementation that supports

operations like selection, filtering, aggregation etc. (similar to R data

frames, dplyr) but on large datasets– Conceptually equivalent to a table in a relational database or a data frame in R,

but with richer optimizations under the hood

– DataFrames can be constructed from a wide array of sources such as:

structured data files, tables in Hive, external databases, or existing local R data

frames

Supports operations like selection, filtering, aggregation etc. (similar

to R data frames) but on large datasets

SparkR also supports distributed machine learning using MLlib

29 © 2016 IBM Corporation

Running SQL Queries from SparkR

A SparkR DataFrame can also be registered as a temporary table in

Spark SQL

Registering a DataFrame as a table allows you to run SQL queries

over its data

The sql function enables applications to run SQL queries

programmatically and returns the result as a DataFrame.

30 © 2016 IBM Corporation

RStudio

RStudio is a free and open-source integrated development environment (IDE) for R

31 © 2016 IBM Corporation

Web-Based Notebooks

Notebooks:

“interactive computational environment, in which you can combine code

execution, rich text, mathematics, plots and rich media”

Zeppelin– Apache incubator project

– Suport multiple interpreters• Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown

and Shell

Jupyter– Based on IPython

– Supports multiple interpreters• Python, Scala, R

32 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

33 © 2016 IBM Corporation

Demonstration: SparkR, using a Jupyter Notebook

What this demonstration will cover:

Explore the concept of a notebook

Walk-through high-level SparkR concepts– Data access (file)

– Aggregated Calculations

– Transformations

– Enriching data (adding weather)

– Visualizing data (ggplot.SparkR)

34 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

35 © 2016 IBM Corporation

Demonstration: SparkR with dashDB, using RStudio

What this demonstration will cover:

Walk-through high-level concepts–Connecting R to Spark– Data access (dashDB)

– Aggregated Calculations

– Enriching data (adding weather via API)

– Visualizing data (ggplot)

36 © 2016 IBM Corporation

Wrap-up

37 © 2016 IBM Corporation

Summary

Key Concepts Covered:

1. R is a popular and powerful data science tool

2. R has limitations– Single threaded, in-memory solution

– Not scalable on its own

3. IBM’s dashDB provides a means of scaling R– Can use R to develop in-database analytics applications

– Leverage the powerful MPP capabilities

4. Spark (SparkR) provides another means of scaling R– Can use R to develop within the Spark framework

– Take advantage of clustered computing power

– Parallels between R data frame and Spark DataFrame

38 © 2016 IBM Corporation

Next Big Data Developers Meetup

Building a Recommendation Engine

with Spark MLlib

Tuesday, June 6, 2016 @ 6 PM

IBM Client Center–1 Rogers St.

–Cambridge, MA

39 © 2016 IBM Corporation

40 © 2016 IBM Corporation

Backup

41 © 2016 IBM Corporation

top related