data science at scale on mpp databases - use cases & open source tools
TRANSCRIPT
1 © Copyright 2016 Pivotal. All rights reserved. 1 © Copyright 2016 Pivotal. All rights reserved.
Esther Vasiete Pivotal Data Scientist Structure Data 2016
Data Science at Scale on MPP Databases – Use Cases & Open Source Tools
Joint work with Pivotal Data Science
2 © Copyright 2016 Pivotal. All rights reserved.
Agenda � Introduction
� Open Source Data Science Toolkit
� Real world applications – Predictive maintenance of automobiles – Predicting insurance claims – Predicting customer churn
� Data science deep-dive with Jupyter notebooks – Text analytics on MPP (github.com/vatsan) – Image processing on MPP (github.com/gautamsm)
3 © Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Our Charter: Pivotal Data Science is Pivotal’s differentiated and highly opinionated data-centric service delivery organization (part of Pivotal Labs)
Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in-class data science and data engineering services, with a deep emphasis on knowledge transfer.
Data Science Data Engineering
App Dev
4 © Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Knowledge Development
5 © Copyright 2016 Pivotal. All rights reserved.
Use Case: Preventive Maintenance for Connected Vehicles � Customer vehicles transmit Diagnostic Trouble Codes (DTC)
and vehicle status data to the Pivotal analytics environment
� Can the DTC data be leveraged to predict the presence of potential problems in vehicles?
� Set up a data science framework on the Pivotal analytics environment that would enable the customer data science team to continuously monitor problems in their vehicles using DTC data
6 © Copyright 2016 Pivotal. All rights reserved.
Problem Setup – Predicting Job Type from Diagnostic Trouble Codes (DTCs)
Time
Job Type: Transmission
Job Type: Transmission
Engine Job Type:
Body
DTC: B DTC: B,
P, C
DTC: U DTC: B
DTC: B
DTC: B, P, C, U
DTC: P, B, U
DTC: P
DTC: B
DTC: B,P
DTC: B,P
Can the DTCs observed here predict
this Job Type?
Can the DTCs observed here predict this Job
Type?
Can the DTCs observed here predict this Job
Type?
7 © Copyright 2016 Pivotal. All rights reserved.
Data Parallelism One or more job on the same day
Multi-labeling problem
One-vs-rest classifiers
built in parallel
1
0
0
1
0 1
0
Class 1
Class 2
Class 3
One-vs-Rest Classification
Red vs. Non Red
On Segment 1
Green vs. Non Green
On Segment 2
Blue vs. Non Blue
On Segment N
8 © Copyright 2016 Pivotal. All rights reserved.
Model Scoring Pipeline
DTC: B DTC: B, P, C
DTC: U
Body
Axle
Engine
Prob >= Threshold
Prob >= Threshold
Prob >= Threshold
Model Caching
(GPDB/ HAWQ)
Real time scoring
web or mobile app dashboard
Ingest
Sink
9 © Copyright 2016 Pivotal. All rights reserved.
MPP Architectural Overview Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
10 © Copyright 2016 Pivotal. All rights reserved.
IT TAKES MORE THAN
ONE TOOL
11 © Copyright 2016 Pivotal. All rights reserved.
Open Source Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
Pivotal Big Data Suite
Mod
elin
g To
ols
Visu
aliz
atio
n To
ols
Platform
GemFire
12 © Copyright 2016 Pivotal. All rights reserved.
Scalable, In-Database Machine Learning
• Open Source https://github.com/madlib/madlib • Works on Greenplum DB, Apache HAWQ and PostgreSQL • In active development by Pivotal • MADlib is now an Apache Software Foundation incubator project!
Apache (incubating)
13 © Copyright 2016 Pivotal. All rights reserved.
Functions
Supervised Learning Regression Models • Cox Proportional Hazards Regression • Elastic Net Regularization • Generalized Linear Models • Linear Regression • Logistic Regression • Marginal Effects • Multinomial Regression • Ordinal Regression • Robust Variance, Clustered Variance • Support Vector Machines Tree Methods • Decision Tree • Random Forest Other Methods • Conditional Random Field • Naïve Bayes
Unsupervised Learning • Association Rules (Apriori) • Clustering (K-means) • Topic Modeling (LDA)
Statistics Descriptive • Cardinality Estimators • Correlation • Summary Inferential • Hypothesis Tests Other Statistics • Probability Functions
Other Modules • Conjugate Gradient • Linear Solvers • PMML Export • Random Sampling • Term Frequency for Text
Time Series • ARIMA
Aug 2015
Data Types and Transformations • Array Operations • Dimensionality Reduction (PCA) • Encoding Categorical Variables • Matrix Operations • Matrix Factorization (SVD, Low Rank) • Norms and Distance Functions • Sparse Vectors
Model Evaluation • Cross Validation
Predictive Analytics Library
@MADlib_analytic
14 © Copyright 2016 Pivotal. All rights reserved.
Use Case: Predicting insurance claim amounts using structured and unstructured data � Using features from structured and unstructured data
sources associated with claims, build the capability to predict claim amounts
15 © Copyright 2016 Pivotal. All rights reserved.
Text analytics on MPP
� Unstructured data in the form of claim comments and claim descriptions (text)
� Use a bag-of-words approach (unigrams, bigrams)
� tf-idf for more meaningful insights
16 © Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Text analytics on MPP
github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models
We’ll walk through this Jupyter notebook
17 © Copyright 2016 Pivotal. All rights reserved.
Use Case: Churn prediction
� Build a churn model to predict which customers are most likely to churn
� Provide insights into key factors responsible for churn to potentially intervene prior to churn
18 © Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data
� Aggregate weekly usage by user
� Compute descriptive statistics
� Extract features based on business expertise
19 © Copyright 2016 Pivotal. All rights reserved.
Open Source Analytics Ecosystem
Companies benefit from algorithmic breadth and scalability for building and socializing data science models
MLlib
PL/X
Algorithms Visualization
Best of breed in-memory and in-database tools for an MPP platform
20 © Copyright 2016 Pivotal. All rights reserved.
• For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R, pgSQL or C/C++
• The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment
Standby Master
…
Master Host
SQL
Interconnect
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Data Parallelism through PL/X : X in Python, R, Java, C/C++ and pgSQL
• plpython and python are loaded as dynamic libraries on the master and segment nodes (libpython.so and plpython.so are under $GPHOME/ext/python)
21 © Copyright 2016 Pivotal. All rights reserved.
User Defined Functions (UDFs) in PL/Python � Procedural languages need to be installed on each database used.
� Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside.
CREATE FUNCTION seasonality (x float[]) RETURNS float[] AS $$ import statsmodels.api as sm s = sm.tsa.seasonal_decompose(x).seasonal return s $$ LANGUAGE plpythonu;
SQL wrapper
SQL wrapper
Normal Python
22 © Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data with PL/X � Easily harness your UDF with open source libraries (for machine learning,
signal processing...)
� Runs at scale through data parallelism
23 © Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Image processing on MPP
github.com/gautamsm/data-science-on-mpp/tree/master/image_processing
In-database Canny edge detection with OpenCV inside a PL/C function
24 © Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Blogs
1. Scaling native (C++) apps on Pivotal MPP
2. Predicting commodity futures through Tweets
3. A pipeline for distributed topic & sentiment analysis of tweets on Greenplum
4. Using data science to predict TV viewer behavior
5. Twitter NLP: Scaling part-of-speech tagging
6. Distributed deep learning on MPP and Hadoop
7. Multi-variate time series forecasting
8. Pivotal for good – Crisis Textline
http://blog.pivotal.io/data-science-pivotal
25 © Copyright 2016 Pivotal. All rights reserved.
Thank You!
A NEW PLATFORM FOR A NEW ERA