data science at scale on mpp databases - use cases & open source tools

1 © Copyright 2016 Pivotal. All rights reserved. 1 © Copyright 2016 Pivotal. All rights reserved.

Esther Vasiete Pivotal Data Scientist Structure Data 2016

Data Science at Scale on MPP Databases – Use Cases & Open Source Tools

Joint work with Pivotal Data Science

2 © Copyright 2016 Pivotal. All rights reserved.

Agenda �  Introduction

�  Open Source Data Science Toolkit

�  Real world applications –  Predictive maintenance of automobiles –  Predicting insurance claims –  Predicting customer churn

�  Data science deep-dive with Jupyter notebooks –  Text analytics on MPP (github.com/vatsan) –  Image processing on MPP (github.com/gautamsm)


Pivotal Data Science Our Charter: Pivotal Data Science is Pivotal’s differentiated and highly opinionated data-centric service delivery organization (part of Pivotal Labs)

Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies.

Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in-class data science and data engineering services, with a deep emphasis on knowledge transfer.

Data Science Data Engineering

App Dev


Pivotal Data Science Knowledge Development


Use Case: Preventive Maintenance for Connected Vehicles � Customer vehicles transmit Diagnostic Trouble Codes (DTC)

and vehicle status data to the Pivotal analytics environment

� Can the DTC data be leveraged to predict the presence of potential problems in vehicles?

� Set up a data science framework on the Pivotal analytics environment that would enable the customer data science team to continuously monitor problems in their vehicles using DTC data


Problem Setup – Predicting Job Type from Diagnostic Trouble Codes (DTCs)

Time

Job Type: Transmission

Job Type: Transmission

Engine Job Type:

Body

DTC: B DTC: B,

P, C

DTC: U DTC: B

DTC: B

DTC: B, P, C, U

DTC: P, B, U

DTC: P

DTC: B

DTC: B,P

DTC: B,P

Can the DTCs observed here predict

this Job Type?

Can the DTCs observed here predict this Job

Type?

Can the DTCs observed here predict this Job

Type?


Data Parallelism One or more job on the same day

Multi-labeling problem

One-vs-rest classifiers

built in parallel

1

0

0

1

0 1

0

Class 1

Class 2

Class 3

One-vs-Rest Classification

Red vs. Non Red

On Segment 1

Green vs. Non Green

On Segment 2

Blue vs. Non Blue

On Segment N


Model Scoring Pipeline

DTC: B DTC: B, P, C

DTC: U

Body

Axle

Engine

Prob >= Threshold

Prob >= Threshold

Prob >= Threshold

Model Caching

(GPDB/ HAWQ)

Real time scoring

web or mobile app dashboard

Ingest

Sink


MPP Architectural Overview Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)


IT TAKES MORE THAN

ONE TOOL


Open Source Data Science Toolkit

KEY LANGUAGES

P L A T F O R M

KEY TOOLS

MLlib

PL/X

Pivotal Big Data Suite

Mod

elin

g To

ols

Visu

aliz

atio

n To

ols

Platform

GemFire


Scalable, In-Database Machine Learning

•  Open Source https://github.com/madlib/madlib •  Works on Greenplum DB, Apache HAWQ and PostgreSQL •  In active development by Pivotal •  MADlib is now an Apache Software Foundation incubator project!

Apache (incubating)


Functions

Supervised Learning Regression Models •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Generalized Linear Models •  Linear Regression •  Logistic Regression •  Marginal Effects •  Multinomial Regression •  Ordinal Regression •  Robust Variance, Clustered Variance •  Support Vector Machines Tree Methods •  Decision Tree •  Random Forest Other Methods •  Conditional Random Field •  Naïve Bayes

Unsupervised Learning •  Association Rules (Apriori) •  Clustering (K-means) •  Topic Modeling (LDA)

Statistics Descriptive •  Cardinality Estimators •  Correlation •  Summary Inferential •  Hypothesis Tests Other Statistics •  Probability Functions

Other Modules •  Conjugate Gradient •  Linear Solvers •  PMML Export •  Random Sampling •  Term Frequency for Text

Time Series •  ARIMA

Aug 2015

Data Types and Transformations •  Array Operations •  Dimensionality Reduction (PCA) •  Encoding Categorical Variables •  Matrix Operations •  Matrix Factorization (SVD, Low Rank) •  Norms and Distance Functions •  Sparse Vectors

Model Evaluation •  Cross Validation

Predictive Analytics Library

@MADlib_analytic


Use Case: Predicting insurance claim amounts using structured and unstructured data � Using features from structured and unstructured data

sources associated with claims, build the capability to predict claim amounts


Text analytics on MPP

� Unstructured data in the form of claim comments and claim descriptions (text)

� Use a bag-of-words approach (unigrams, bigrams)

�  tf-idf for more meaningful insights


Code walkthrough: Text analytics on MPP

github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models

We’ll walk through this Jupyter notebook


Use Case: Churn prediction

� Build a churn model to predict which customers are most likely to churn

� Provide insights into key factors responsible for churn to potentially intervene prior to churn


Usage Time Series Data

� Aggregate weekly usage by user

� Compute descriptive statistics

� Extract features based on business expertise


Open Source Analytics Ecosystem

Companies benefit from algorithmic breadth and scalability for building and socializing data science models

MLlib

PL/X

Algorithms Visualization

Best of breed in-memory and in-database tools for an MPP platform


•  For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R, pgSQL or C/C++

•  The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment

Standby Master

…

Master Host

SQL

Interconnect

Segment Host Segment Segment




Data Parallelism through PL/X : X in Python, R, Java, C/C++ and pgSQL

•  plpython and python are loaded as dynamic libraries on the master and segment nodes (libpython.so and plpython.so are under $GPHOME/ext/python)


User Defined Functions (UDFs) in PL/Python �  Procedural languages need to be installed on each database used.

�  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside.

CREATE FUNCTION seasonality (x float[]) RETURNS float[] AS $$ import statsmodels.api as sm s = sm.tsa.seasonal_decompose(x).seasonal return s $$ LANGUAGE plpythonu;

SQL wrapper

SQL wrapper

Normal Python


Usage Time Series Data with PL/X �  Easily harness your UDF with open source libraries (for machine learning,

signal processing...)

�  Runs at scale through data parallelism


Code walkthrough: Image processing on MPP

github.com/gautamsm/data-science-on-mpp/tree/master/image_processing

In-database Canny edge detection with OpenCV inside a PL/C function


Pivotal Data Science Blogs

1.  Scaling native (C++) apps on Pivotal MPP

2.  Predicting commodity futures through Tweets

3.  A pipeline for distributed topic & sentiment analysis of tweets on Greenplum

4.  Using data science to predict TV viewer behavior

5.  Twitter NLP: Scaling part-of-speech tagging

6.  Distributed deep learning on MPP and Hadoop

7.  Multi-variate time series forecasting

8.  Pivotal for good – Crisis Textline

http://blog.pivotal.io/data-science-pivotal


Thank You!

A NEW PLATFORM FOR A NEW ERA

data science at scale on mpp databases - use cases & open source tools

Data & Analytics