data science at scale on mpp databases - use cases & open source tools

26
1 © Copyright 2016 Pivotal. All rights reserved. Esther Vasiete Pivotal Data Scientist Structure Data 2016 Data Science at Scale on MPP Databases – Use Cases & Open Source Tools Joint work with Pivotal Data Science

Upload: esther-vasiete

Post on 09-Jan-2017

552 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

1 © Copyright 2016 Pivotal. All rights reserved. 1 © Copyright 2016 Pivotal. All rights reserved.

Esther Vasiete Pivotal Data Scientist Structure Data 2016

Data Science at Scale on MPP Databases – Use Cases & Open Source Tools

Joint work with Pivotal Data Science

Page 2: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

2 © Copyright 2016 Pivotal. All rights reserved.

Agenda �  Introduction

�  Open Source Data Science Toolkit

�  Real world applications –  Predictive maintenance of automobiles –  Predicting insurance claims –  Predicting customer churn

�  Data science deep-dive with Jupyter notebooks –  Text analytics on MPP (github.com/vatsan) –  Image processing on MPP (github.com/gautamsm)

Page 3: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

3 © Copyright 2016 Pivotal. All rights reserved.

Pivotal Data Science Our Charter: Pivotal Data Science is Pivotal’s differentiated and highly opinionated data-centric service delivery organization (part of Pivotal Labs)

Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies.

Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in-class data science and data engineering services, with a deep emphasis on knowledge transfer.

Data Science Data Engineering

App Dev

Page 4: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

4 © Copyright 2016 Pivotal. All rights reserved.

Pivotal Data Science Knowledge Development

Page 5: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

5 © Copyright 2016 Pivotal. All rights reserved.

Use Case: Preventive Maintenance for Connected Vehicles � Customer vehicles transmit Diagnostic Trouble Codes (DTC)

and vehicle status data to the Pivotal analytics environment

� Can the DTC data be leveraged to predict the presence of potential problems in vehicles?

� Set up a data science framework on the Pivotal analytics environment that would enable the customer data science team to continuously monitor problems in their vehicles using DTC data

Page 6: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

6 © Copyright 2016 Pivotal. All rights reserved.

Problem Setup – Predicting Job Type from Diagnostic Trouble Codes (DTCs)

Time

Job Type: Transmission

Job Type: Transmission

Engine Job Type:

Body

DTC: B DTC: B,

P, C

DTC: U DTC: B

DTC: B

DTC: B, P, C, U

DTC: P, B, U

DTC: P

DTC: B

DTC: B,P

DTC: B,P

Can the DTCs observed here predict

this Job Type?

Can the DTCs observed here predict this Job

Type?

Can the DTCs observed here predict this Job

Type?

Page 7: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

7 © Copyright 2016 Pivotal. All rights reserved.

Data Parallelism One or more job on the same day

Multi-labeling problem

One-vs-rest classifiers

built in parallel

1

0

0

1

0 1

0

Class 1

Class 2

Class 3

One-vs-Rest Classification

Red vs. Non Red

On Segment 1

Green vs. Non Green

On Segment 2

Blue vs. Non Blue

On Segment N

Page 8: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

8 © Copyright 2016 Pivotal. All rights reserved.

Model Scoring Pipeline

DTC: B DTC: B, P, C

DTC: U

Body

Axle

Engine

Prob >= Threshold

Prob >= Threshold

Prob >= Threshold

Model Caching

(GPDB/ HAWQ)

Real time scoring

web or mobile app dashboard

Ingest

Sink

Page 9: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

9 © Copyright 2016 Pivotal. All rights reserved.

MPP Architectural Overview Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Page 10: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

10 © Copyright 2016 Pivotal. All rights reserved.

IT TAKES MORE THAN

ONE TOOL

Page 11: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

11 © Copyright 2016 Pivotal. All rights reserved.

Open Source Data Science Toolkit

KEY LANGUAGES

P L A T F O R M

KEY TOOLS

MLlib

PL/X

Pivotal Big Data Suite

Mod

elin

g To

ols

Visu

aliz

atio

n To

ols

Platform

GemFire

Page 12: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

12 © Copyright 2016 Pivotal. All rights reserved.

Scalable, In-Database Machine Learning

•  Open Source https://github.com/madlib/madlib •  Works on Greenplum DB, Apache HAWQ and PostgreSQL •  In active development by Pivotal •  MADlib is now an Apache Software Foundation incubator project!

Apache (incubating)

Page 13: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

13 © Copyright 2016 Pivotal. All rights reserved.

Functions

Supervised Learning Regression Models •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Generalized Linear Models •  Linear Regression •  Logistic Regression •  Marginal Effects •  Multinomial Regression •  Ordinal Regression •  Robust Variance, Clustered Variance •  Support Vector Machines Tree Methods •  Decision Tree •  Random Forest Other Methods •  Conditional Random Field •  Naïve Bayes

Unsupervised Learning •  Association Rules (Apriori) •  Clustering (K-means) •  Topic Modeling (LDA)

Statistics Descriptive •  Cardinality Estimators •  Correlation •  Summary Inferential •  Hypothesis Tests Other Statistics •  Probability Functions

Other Modules •  Conjugate Gradient •  Linear Solvers •  PMML Export •  Random Sampling •  Term Frequency for Text

Time Series •  ARIMA

Aug 2015

Data Types and Transformations •  Array Operations •  Dimensionality Reduction (PCA) •  Encoding Categorical Variables •  Matrix Operations •  Matrix Factorization (SVD, Low Rank) •  Norms and Distance Functions •  Sparse Vectors

Model Evaluation •  Cross Validation

Predictive Analytics Library

@MADlib_analytic

Page 14: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

14 © Copyright 2016 Pivotal. All rights reserved.

Use Case: Predicting insurance claim amounts using structured and unstructured data � Using features from structured and unstructured data

sources associated with claims, build the capability to predict claim amounts

Page 15: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

15 © Copyright 2016 Pivotal. All rights reserved.

Text analytics on MPP

� Unstructured data in the form of claim comments and claim descriptions (text)

� Use a bag-of-words approach (unigrams, bigrams)

�  tf-idf for more meaningful insights

Page 16: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

16 © Copyright 2016 Pivotal. All rights reserved.

Code walkthrough: Text analytics on MPP

github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models

We’ll walk through this Jupyter notebook

Page 17: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

17 © Copyright 2016 Pivotal. All rights reserved.

Use Case: Churn prediction

� Build a churn model to predict which customers are most likely to churn

� Provide insights into key factors responsible for churn to potentially intervene prior to churn

Page 18: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

18 © Copyright 2016 Pivotal. All rights reserved.

Usage Time Series Data

� Aggregate weekly usage by user

� Compute descriptive statistics

� Extract features based on business expertise

Page 19: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

19 © Copyright 2016 Pivotal. All rights reserved.

Open Source Analytics Ecosystem

Companies benefit from algorithmic breadth and scalability for building and socializing data science models

MLlib

PL/X

Algorithms Visualization

Best of breed in-memory and in-database tools for an MPP platform

Page 20: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

20 © Copyright 2016 Pivotal. All rights reserved.

•  For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R, pgSQL or C/C++

•  The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment

Standby Master

Master Host

SQL

Interconnect

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Data Parallelism through PL/X : X in Python, R, Java, C/C++ and pgSQL

•  plpython and python are loaded as dynamic libraries on the master and segment nodes (libpython.so and plpython.so are under $GPHOME/ext/python)

Page 21: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

21 © Copyright 2016 Pivotal. All rights reserved.

User Defined Functions (UDFs) in PL/Python �  Procedural languages need to be installed on each database used.

�  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside.

CREATE  FUNCTION  seasonality  (x  float[])      RETURNS  float[]  AS  $$      import  statsmodels.api  as  sm      s  =  sm.tsa.seasonal_decompose(x).seasonal        return  s  $$  LANGUAGE  plpythonu;  

SQL wrapper

SQL wrapper

Normal Python

Page 22: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

22 © Copyright 2016 Pivotal. All rights reserved.

Usage Time Series Data with PL/X �  Easily harness your UDF with open source libraries (for machine learning,

signal processing...)

�  Runs at scale through data parallelism

Page 23: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

23 © Copyright 2016 Pivotal. All rights reserved.

Code walkthrough: Image processing on MPP

github.com/gautamsm/data-science-on-mpp/tree/master/image_processing

In-database Canny edge detection with OpenCV inside a PL/C function

Page 24: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

24 © Copyright 2016 Pivotal. All rights reserved.

Pivotal Data Science Blogs

1.  Scaling native (C++) apps on Pivotal MPP

2.  Predicting commodity futures through Tweets

3.  A pipeline for distributed topic & sentiment analysis of tweets on Greenplum

4.  Using data science to predict TV viewer behavior

5.  Twitter NLP: Scaling part-of-speech tagging

6.  Distributed deep learning on MPP and Hadoop

7.  Multi-variate time series forecasting

8.  Pivotal for good – Crisis Textline

http://blog.pivotal.io/data-science-pivotal

Page 25: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

25 © Copyright 2016 Pivotal. All rights reserved.

Thank You!

Page 26: Data Science at Scale on MPP databases - Use Cases & Open Source Tools

A NEW PLATFORM FOR A NEW ERA