predictive analytics and machine learning…with sas and apache hadoop

32
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Predictive Analytics and Machine Learning …with SAS and Apache Hadoop Spring 2014 Version 1.5 We do Hadoop.

Upload: hortonworks

Post on 27-Aug-2014

1.239 views

Category:

Software


0 download

DESCRIPTION

In this interactive webinar, we'll walk through use cases on how you can use advanced analytics like SAS Visual Statistics and In-Memory Statistic with Hortonworks’ data platform (HDP) to reveal insights in your big data and redefine how your organization solves complex problems.

TRANSCRIPT

Page 1: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Predictive Analytics and Machine Learning …with SAS and Apache Hadoop

Spring 2014 Version 1.5

We do Hadoop.

Page 2: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Your speakers…

Ofer Mendelevitch, Director of Data Science Hortonworks

Wayne Thompson, Chief Data Scientist SAS

Page 3: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

A data architecture under pressure from new data AP

PLICAT

IONS  

DATA

   SYSTEM  

REPOSITORIES  

SOURC

ES  

Exis4ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

RDBMS   EDW   MPP  

Business    Analy4cs  

Custom  Applica4ons  

Packaged  Applica4ons  

Source: IDC

2.8  ZB  in  2012  

85%  from  New  Data  Types  

15x  Machine  Data  by  2020  

40  ZB  by  2020  

OLTP,  ERP,  CRM  Systems  

Unstructured  documents,  emails  

Clickstream  

Server  logs  

Sen>ment,  Web  Data  

Sensor.  Machine  Data  

Geo-­‐loca>on  

Page 4: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop within an emerging Modern Data Architecture

OPERATIONS  TOOLS  

Provision, Manage & Monitor

DEV  &  DATA  TOOLS  

Build & Test

DATA

   SYSTEM  

REPOSITORIES  

SOURC

ES  

RDBMS   EDW   MPP  

OLTP,  ERP,  CRM  Systems  

Documents,    Emails  

Web  Logs,  Click  Streams  

Social  Networks  

Machine  Generated  

Sensor  Data  

Geoloca>on  Data  

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

APPLICAT

IONS  

Business    Analy4cs  

Custom  Applica4ons  

Packaged  Applica4ons  

Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale

Page 5: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop unlocks a new approach: Iterative Analytics

Hadoop  Mul>ple  Query  Engines  Itera>ve  Process:  Explore,  Transform,  Analyze  

SQL  Single  Query  Engine  Repeatable  Linear  Process  

Determine  list  of  ques4ons  

Design  solu4ons  

Collect  structured  data  

Ask  ques4ons  from  list  

Detect  addi4onal  ques4ons  

Batch   Interac4ve   Real-­‐4me   Streaming  

Current Reality Apply schema on write

Dependent on IT

Augment w/ Hadoop

Apply schema on read

Support range of access patterns to data stored in HDFS: polymorphic access

Page 6: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Hadoop for Data Science

• Hadoop’s schema on read reduces cycle times • Hadoop is ideal for pre-processing of raw data • Improved models with larger datasets

Page 7: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop’s schema-on-read accelerates innovation

I  need  new  data  

Finally,  we  start  

collec>ng  

Let  me  see…  is  it  any  good?  

Start 6 months 9 months

“Schema change” project

Let’s  just  put  it  in  a  folder  on  HDFS  

Let  me  see…  is  it  any  good?  

3 months

My  model  is  awesome!  

Page 8: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop ideal for large scale pre-processing

Join  

Normalize  

OCR  

Sample  

Aggregate  

Raw  Data  Feature  Matrix  

NLP  

Transform  

Page 9: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Why big data science? Larger datasets à better outcomes

Banko & Brill, 2001 • More examples • More features

Page 10: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

A (partial) map of data science “tasks”

Discovery

Clustering Detect natural groupings

Outlier detection Detect anomalies

Association rule mining Co-occurrence patterns

Prediction

Classification Predict a category

Regression Predict a value

Recommendation Predict a preference

Big Data Science: High energy physics, Genomics, etc.

Page 11: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Typical iterative flow in data science

Page 11

Visualize, Explore

Hypothesize; Model

Measure/Evaluate

Acquire Data

Clean Data

Deploy & Monitor

Page 12: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

SAS in-memory and Visual Statistics

HDP 2.1 Hortonworks Data Platform

   

Provision,  Manage  &  Monitor  

 Ambari  

Zookeeper  

Scheduling    

Oozie  

Data  Workflow,  Lifecycle  &  Governance  

 Falcon  Sqoop  Flume  NFS  

WebHDFS  YARN  :  Data  Opera4ng  System  

DATA    MANAGEMENT  

SECURITY  DATA    ACCESS  GOVERNANCE  &  INTEGRATION  

Authen4ca4on  Authoriza4on  Accoun4ng  

Data  Protec4on    

Storage:  HDFS  Resources:  YARN  Access:  Hive,  …    Pipeline:  Falcon  Cluster:  Knox  

OPERATIONS  

Script    Pig      

Search    

Solr      

SQL    

Hive/Tez,  HCatalog  

   

NoSQL    

HBase  Accumulo  

   

Stream      

Storm        

Others    

In-­‐Memory  Analy>cs,    ISV  engines  

1   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

Batch    

Map  Reduce  

   

Deployment  Choice  Linux Windows On-Premise Cloud

SAS® Visual Statistics

SAS® In-Memory Statistics for Hadoop

•  Provide powerful advanced analytics integrated directly on HDP

Page 13: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

BIG ANALYTICS+ HORTONWORKS DATA PLATFORM (HDP) = BIG OPPORTUNITIES

Page 14: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

WHAT IS IT?

Provides a single interactive analytical platform on Hadoop to perform

•  analytical data preparation •  variable transformations •  exploratory analysis •  statistical modeling and machine learning •  integrated modeling comparison and scoring

•  Takes advantage of distributed in-memory computing optimized for analytical workloads

TEXT

PREPARE DATA EXPLO

RE

DATA

DEVELOP MODELS

SCO

RE

SAS® IN-MEMORY

ANALYTICS

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns Data Access

Data Management

Page 15: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

SAS® IN-MEMORY

ANALYTICS

INTEGRATED USER EXPERIENCE

Data Preparation Exploration/Visualization Modeling Deployment

DATA SCIENTIST /PROGRAMMER

SAS® Visual Statistics

SAS® In-Memory

Statistics for Hadoop

GUI GUI

STATISTICIAN

PROGRAMMING

Page 16: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

SAS IN-MEMORY STATISTICS FOR HADOOP Data Management •  Aggregate •  Compute •  Update

•  Append •  Set •  Schema

•  DeleteRows •  DropTables •  PurgeTempTables

Data Exploration •  Boxplot •  Corr •  Crosstab •  Distinct •  Fetch •  Frequency •  Histogram •  KDE •  MDSummary •  Percentile •  Summary •  TopK

Descriptive Modeling •  Association •  Path Analysis •  Clustering (k-means) •  Clustering (DBSCAN)

Evaluation, Deployment •  Assess Misclassification matrix Lift, ROC, Concordance •  Score •  Training / Validation

Data Management &

Exploration

Modeling

Model Evaluation & Deployment

ANALYTICAL LIFE CYCLE

Utilities •  Where •  GroupBy •  TableInfo, ColumnInfo, ServerInfo •  Partition, Balance •  Store, Replay, Free •  Table, Promote

Text Analytics •  Parsing •  SVD •  Topic generation •  Document projection

Recommendation Systems •  Association •  Clustering •  kNN •  SVD •  Ensemble

Predictive Modeling •  Decision Tree •  Forecast •  Gen Linear Model •  Linear Regression •  Logistic Regression •  Random Forests

HDFS I/O •  Sasiola •  Sashdat •  Anyfile Reader

Page 17: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

SAS ON HADOOP

Memory Hortonworks Data Platform

SAS® LASR™ Analytic Server

Head node

Data Nodes

Data

Data

Data

Data

Edge Node

SAS®  Visual  Analy>cs  

SAS®  Visual  Sta>s>cs  

SAS®  In-­‐Memory  Sta>s>cs  

SAS ® In-Memory Analytic Products

Web Clients

IN-MEMORY, CLIENT-SERVER, WEB-BASED

Page 18: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

SAS ON HADOOP

Memory Hortonworks Data Platform

SAS® LASR™ Analytic Server

Head node

Data Nodes

Data

Data

Data

Data

Edge Node

SAS®  Visual  Analy>cs  

SAS®  Visual  Sta>s>cs  

SAS®  In-­‐Memory  Sta>s>cs  

SAS ® In-Memory Analytic Products

Web Clients

IN-MEMORY, CLIENT-SERVER, WEB-BASED

Page 19: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

SAS ON HADOOP

Memory Hortonworks Data Platform

SAS® LASR™ Analytic Server

Head node

Data Nodes

Data

Data

Data

Data

Edge Node

SAS®  Visual  Analy>cs  

SAS®  Visual  Sta>s>cs  

SAS®  In-­‐Memory  Sta>s>cs  

result task

SAS ® In-Memory Analytic Products

Web Clients

IN-MEMORY, CLIENT-SERVER, WEB-BASED

Page 20: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

SAS ON HADOOP

broadcasts

SAS® LASR™ Analytic Server

Head node

Data Nodes

Data

Data

Data

Data

Edge Node result task

SAS ® In-Memory Analytic Products

SUMMARY STATISTICS

Web Clients

proc imstat; table dat1; summary X / mean; run; OUTPUT

Send request SampleMean(X) to LASR Waiting..

Receive 𝑿 

A) Request 𝑺↓𝑿 =∑𝒊↑▒𝒙↓𝒊   from data nodes

C) Aggregate 𝑿 = ∑𝒋↑▒𝑺↓𝑿,𝒋  ⁄𝑵  D) Send 𝑿  back to Edge

B) Data node 𝒋 computes 𝑺↓𝑿,𝒋 =∑𝒊↑▒𝒙↓𝒊,𝒋  , 𝒋=𝟏,𝟐,𝟑,𝟒

Broadcast..

Memory

Page 21: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

SAS ON HADOOP

broadcasts

SAS® LASR™ Analytic Server

Head node

Data Nodes

Data

Data

Data

Data

Edge Node result task

SAS ® In-Memory Analytic Products

PRINCIPLES OF THE DESIGN

Web Clients

Thin Clients Multi-user Interactive Real-time Point-and-click or programing

Receive requests from a UI or SAS program.

• NO MAP REDUCE • One data copy • Concurrency • Temporary tables or

columns • MPP or SMP

Memory

Work on light computations (interactive trees)

Page 22: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Use Case #1: Recommendation systems

Why recommender systems? •  5 – 20% increase in sales •  60% use “recommendations” to

determine suitable product •  In 2011 15% of customers

admitted to buying recommended products, 2013 nearly 30%

36 Million subscribers 60-70% view results from recommendation

Tens of Billions “Thumbs up” 60 Million active users 3.8 billion hours of music (last Qtr) 47% up-tic in active users 67% increase in music served

25% YOY Growth

Trip Advisor collaborates with EBAY, ORBITZ and others.

Page 23: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Pre-processing raw data for recommendation

• Inputs: • Explicit product ratings (when provided) • Implicit information: purchase transactions, page views, comments

5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5

Epic  

X-­‐Men

 

Hobb

it  

Argo  

Pirates  

U101  

U102  

U103  

U104  

U105  

…  

Ratings

Page views

Forum Comments

Page 24: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Goal: predict a preference

Epic  

X-­‐Men

 

Hobb

it  

Argo  

Pirates  

U101  

U102  

U103  

U104  

U105  

…  

Epic  

X-­‐Men

 

Hobb

it  

Argo  

Pirates  

5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5

U101  

U102  

U103  

U104  

U105  

…  

5 2 4 1 3 4 1 5 2 3 1 2 4 1 3 3 2 3 1 5

Page 25: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

MACHINE LEARNING INTEGRATION

PREDICTIVE ANALYTICS &

MACHINE LEARNING RECOMMENDATION SYSTEM DEMO

SAS Visual Analytics

LOUNGE

PUB BEER

DRINK GAME

MUSIC Deployment

PINT

BAND

PLAY

GLASS

Relevant, Real-time,

Interactions

VODKA

PATIO KARAOKE

COCKTAIL

WINGS

DATA WRANGLING

Data Director*

Convert Json Files

Load LASR

Standardize

SAS In-Memory Statistics

Tony’s Bar

Trees Lounge

The Tropicana

Blue Parrot

Tony

Patty

George

Use

rs

Business

Beer & Wine

Chinese Food

Mexican Food

LIQUOR

ALCOHOL

BARTENDER

DRAFT

Topics

TAP

FUN

LIVE

SCENE POOL

Bus

ines

s

REVIEWS

* New SAS Product

Page 26: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

PREDICTIVE ANALYTICS &

MACHINE LEARNING RECOMMENDATION SYSTEM DEMO

John Clark

Recommendation History

1. Oyster Bar 2. The Brick

3. Trees Lounge

4. Blue Parrot 5. Winchester Club

6. Starlight Lounge

7. Tony’s Bar 8. Lucy’s

9. The Tropicana

Rank

1

2

3

Recommendation

Review History

1.  Oyster Bar 2.  The Brick 3.  Trees Lounge 4.  Blue Parrot 5.  Winchester Club 6.  Starlight Lounge 7.  Tony’s Bar 8.  Lucy’s 9.  The Tropicana

Rank 1,2, 3, …

Recommendation

Page 27: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Use Case # 2: Building a prediction model

Customer ID Age Gender Loyalty Card More features… Buys organic

11001 45 M Yes Yes 11002 43 M No Yes 11003 65 F Yes No … … … …

Unseen data

Model

Buys organic

Labeled Data

Customer ID Age Gender Home Owner

More features…

11004 33 M No … 11005 25 F No …

Page 28: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demo #2: Predicting who buys organic products?

•  Dataset: grocery transaction and customer data

•  Goals: •  Understand customer propensity to buy organic products •  Develop segments using an interactive decision •  Develop stratified models to predict organic purchases

•  Why is it useful? •  Inventory strategy •  Store layout planning •  Provider management

Page 29: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Copy r i gh t © 2012 , SAS Ins t i t u te I nc . A l l r i gh t s rese rved .

SAS VISUAL STATISTICS 6.4 – ORGANICS PURCHASE DEMO PREDICTIVE

ANALYTICS & MACHINE LEARNING

Page 30: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Wrap up: SAS and Hortonworks Data Platform

•  Increase productivity for data scientists •  Users can concurrently & interactively analyze traditional & new data sets in HDP to help

businesses quickly discover and capitalize on new business insights from their data

•  Increase efficiency •  Avoid unnecessary, multiple passes through the data •  SAS in-memory infrastructure running on top of Hadoop eliminates costly data movement and

persists data in-memory for the entire analytics session

•  Capture and analyze new data types •  HDP + SAS enables data scientists to look at more of their enterprise data

•  Leverage 100 percent open-source Apache Hadoop •  SAS customers can now embrace Hadoop as a core platform in their data architecture

Page 31: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

How should you get started? Next steps…

•  Get the Data

•  Formulate a well defined business objective

•  Data exploration: integrate and fuse heterogeneous data types

•  Pre-process: generate features from raw data

•  Manage the long-tail distribution and data imbalance

•  Modeling: remember model building is cyclical

•  Evaluate your results

•  Work with IT to move analytics from research and into operations

Page 32: Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

More details..

Download the Hortonworks Sandbox

Learn Hadoop

Build Your Analytic App

Try Hadoop 2

More about SAS Software & Hortonworks http://hortonworks.com/partner/SAS/

Contact us: [email protected]