flink forward 2016
TRANSCRIPT
AMIDST Toolbox – Flink Forward 1
A Java Toolbox for Scalable Probabilistic Machine Learning
12-14 SEP 2016, Berlin
Ana M. MartínezAalborg University
RUNNING USE CASE
5
Predicting Defaulting Clients
Predicts probability a customer will default within 2 years
AMIDST Toolbox – Flink Forward
RUNNING USE CASE
§ Daily data for millions of clients § Tons of missing data.§ Odd distributions.
6AMIDST Toolbox – Flink Forward
AMIDST APPROACH
9
Data Knowledge
Openbox Models
Blackbox Inference Engine(Powered by Flink)
AMIDST Toolbox – Flink Forward
PGMS
11
Probabilistic graphical models (PGMs)Specify your model using probabilistic graphical models with latent variables and temporal dependencies
AMIDST Toolbox – Flink Forward
PGMS
13
Custom Gaussian Mixture ModelHij defines local mixtureHi defines a global mixture.
AMIDST Toolbox – Flink Forward
PGMS
14
RUNNING CODE EXAMPLE
//Set-up Flink session.final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
//Load the data streamString filename = “hdfs://dataFlink_month0.arff";DataFlink<DataInstance> data =
DataFlinkLoader.loadDataFromFolder(env, filename, false);
//Build the modelModel model = new CustomGaussianMixture(data.getAttributes());
AMIDST Toolbox – Flink Forward
SCALABLE INFERENCE
15
Scalable Learning
Perform Bayesian inference on your probabilistic models with powerful approximate and scalable algorithms.
AMIDST Toolbox – Flink Forward
SCALABLE INFERENCE
16
d-VMP Algorithm - Coded as iterative map-reduce task
A state-of-the-art distributed variational message passing algorithm.
AMIDST Toolbox – Flink Forward
SCALABLE INFERENCE
17
RUNNING CODE EXAMPLE
AMIDST Toolbox – Flink Forward
//Set-up Flink session.final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//Load the data streamString filename = “hdfs://dataFlink_month0.arff";DataFlink<DataInstance> data =
DataFlinkLoader.loadDataFromFolder(env, filename, false);
//Build the modelModel model = new CustomGaussianMixture(data.getAttributes());
//Learn the modelmodel.updateModel(data);
DATA STREAMS
18
Data Streams
Update your models when new data is available. This makes our toolbox appropriate for learning from data streams.
AMIDST Toolbox – Flink Forward
DATA STREAMS
19
//Set-up Flink session.final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//Load the data streamString filename = “hdfs://dataFlink_month0.arff";DataFlink<DataInstance> data =
DataFlinkLoader.loadDataFromFolder(env, filename, false);
//Build the modelModel model = new CustomGaussianMixture(data.getAttributes());
//Learn the modelmodel.updateModel(data);
//Update your modelfor(int i=1; i<12; i++) {
filename = “dataFlink_month"+i+".arff";data = DataFlinkLoader.loadDataFromFolder(env, filename,false);model.updateModel(data);
}
RUNNING CODE EXAMPLE
AMIDST Toolbox – Flink Forward
RUNNING USE CASE
20
Predicting Defaulting Clients§ Old BCC’s models based on logistic regression got an AUC of 0.816.§ AMIDST’s models gets an AUC of 0.952.
AMIDST Toolbox – Flink Forward
LARGE-SCALE DATA
21
Scalability analysis
Use your defined models to process massive data sets in a distributed computer cluster using Flink.
AMIDST Toolbox – Flink Forward
LARGE-SCALE DATA
22
One billion node probabilistic modelExperiment on a Flink cluster with 16 nodes on AWS.
AMIDST Toolbox – Flink Forward
§ Speedup(withrespectto2nodes)
AMIDST Toolbox – Flink Forward 23/32
0
1
2
3
4
5
6
7
8
4nodes 8nodes 16nodes
MODULAR DESIGN
24
Modular Design
The AMIDST Toolbox has been designed following a modular structure.This makes easier:
§ The maintenance and enhancement of the software§ The integration with external software: HUGIN, MOA, Weka, R.
AMIDST Toolbox – Flink Forward
CONCEPT DRIFT DETECTION
27
Tracking Concept DriftDetects changes in customer profiles during Spanish financial crisis
AMIDST Toolbox – Flink Forward
CONCEPT DRIFT DETECTION
28
Hidden Variables are used to capture changes in customer profile
MODEL
AMIDST Toolbox – Flink Forward
CONCEPT DRIFT DETECTION
29
RUNNING CODE
AMIDST Toolbox – Flink Forward
//Set-up Flink session.final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//Load the data streamString filename = “hdfs://dataFlink_month0.arff";DataFlink<DataInstance> data =
DataFlinkLoader.loadDataFromFolder(env, filename, false);
//Build the modelModel model = new ConceptDriftDetector(data.getAttributes());
//Learn the modelmodel.updateModel(data);
//Update your modelfor(int i=1; i<12; i++) {
filename = “dataFlink_month"+i+".arff";data = DataFlinkLoader.loadDataFromFolder(env, filename,false);model.updateModel(data);System.out.println(model.getPosteriorDistribution(“hiddenVar”).
toString());}
CONCEPT DRIFT DETECTION
30
Hidden Variable Captures Concept Drift
Drift Pattern: Seasonal + Global trend
RESULTS
AMIDST Toolbox – Flink Forward
CONCEPT DRIFT DETECTION
31
Unemployment Rate main driver of Concept DriftHidden Variable correlates with unemployment rate (rho = 0.961)
RESULTS
AMIDST Toolbox – Flink Forward
COLLABORATE
32
www.amidsttoolbox.com github.com/amidst/toolbox
AppacheLicense 2.0
AMIDST Toolbox – Flink Forward
AMIDST Toolbox – Flink Forward 33
Thanks for your attention
@AmidstToolbox
www www.amidsttoolbox.com