visualizing the model selection process
TRANSCRIPT
![Page 1: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/1.jpg)
Visualizing the Model Selection
ProcessBenjamin Bengfort
@bbengfort District Data Labs
![Page 2: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/2.jpg)
Abstract
Machine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it's beginnings in academia, and with tools like Scikit-Learn, it's easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is model selection. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model's evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models.
![Page 3: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/3.jpg)
So I read about this great ML model
![Page 4: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/4.jpg)
Koren, Yehuda, Robert Bell, and Chris Volinsky. "Matrix factorization techniques for recommender systems." Computer 42.8 (2009): 30-37.
![Page 5: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/5.jpg)
def nnmf(R, k=2, steps=5000, alpha=0.0002, beta=0.02):
n, m = R.shape
P = np.random.rand(n,k)
Q = np.random.rand(m,k).T
for step in range(steps):
for idx in range(n):
for jdx in range(m):
if R[idx][jdx] > 0:
eij = R[idx][jdx] - np.dot(P[idx,:], Q[:,jdx])
for kdx in range(K):
P[idx][kdx] = P[idx][kdx] + alpha * (2 * eij * Q[kdx][jdx] - beta * P[idx][kdx])
Q[kdx][jdx] = Q[kdx][jdx] + alpha * (2 * eij * P[idx][kdx] - beta * Q[kdx][jdx])
e = 0
for idx in range(n):
for jdx in range(m):
if R[idx][jdx] > 0:
e += (R[idx][jdx] - np.dot(P[idx,:], Q[:,jdx])) ** 2
if e < 0.001:
break
return P, Q.T
![Page 6: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/6.jpg)
Life with Scikit-Learn
![Page 7: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/7.jpg)
from sklearn.decomposition import NMF
model = NMF(n_components=2, init='random', random_state=0)
model.fit(R)
![Page 8: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/8.jpg)
from sklearn.decomposition import NMF, TruncatedSVD, PCA
models = [
NMF(n_components=2, init='random', random_state=0),
TruncatedSVD(n_components=2),
PCA(n_components=2),
]
for model in models:
model.fit(R)
![Page 9: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/9.jpg)
So now I’m all
![Page 10: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/10.jpg)
Made Possible by the Scikit-Learn API
Buitinck, Lars, et al. "API design for machine learning software: experiences from the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013).
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
class Pipeline(Transfomer):
@property
def named_steps(self):
"""
Returns a sequence of estimators
"""
return self.steps
@property
def _final_estimator(self):
"""
Terminating estimator
"""
return self.steps[-1]
![Page 11: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/11.jpg)
Algorithm design stays in the hands of
Academia
![Page 12: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/12.jpg)
Wizardry When Applied
![Page 13: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/13.jpg)
The Model Selection TripleArun Kumar http://bit.ly/2abVNrI
Feature Analysis
Algorithm Selection
Hyperparameter Tuning
![Page 14: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/14.jpg)
The Model Selection Triple
- Define a bounded, high dimensional feature space that can be effectively modeled.
- Transform and manipulate the space to make modeling easier.
- Extract a feature representation of each instance in the space.
Feature Analysis
![Page 15: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/15.jpg)
Algorithm Selection
The Model Selection Triple
- Select a model family that best/correctly defines the relationship between the variables of interest.
- Define a model form that specifies exactly how features interact to make a prediction.
- Train a fitted model by optimizing internal parameters to the data.
![Page 16: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/16.jpg)
Hyperparameter Tuning
The Model Selection Triple
- Evaluate how the model form is interacting with the feature space.
- Identify hyperparameters (parameters that affect training or the prior, not prediction)
- Tune the fitting and prediction process by modifying these params.
![Page 17: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/17.jpg)
Can it be automated?
![Page 18: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/18.jpg)
Regularization is a form of automatic feature analysis.
X0
X1
X0
X1
L1 NormalizationPossibility that a feature is eliminated by setting its
coefficient equal to zero.
L2 NormalizationFeatures are kept balanced by minimizing the relative change of coefficients during learning.
![Page 19: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/19.jpg)
Automatic Model Selection Criteria
from sklearn.cross_validation import KFold
kfolds = KFold(n=len(X), n_folds=12)
scores = [
model.fit(
X[train], y[train]
).score(
X[test], y[test]
)
for train, test in kfolds
]
F1
R2
![Page 20: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/20.jpg)
Automatic Model Selection: Try Them All!
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation as cv
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = cv.KFold(len(X), n_folds=12)
max([
cv.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
![Page 21: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/21.jpg)
Automatic Model Selection: Search Param Space
from sklearn.feature_extraction.text import *
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('model', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000),
'tfidf__use_idf': (True, False),
'tfidf__norm': ('l1', 'l2'),
'model__alpha': (0.00001, 0.000001),
'model__penalty': ('l2', 'elasticnet'),
}
search = GridSearchCV(pipeline, parameters)
search.fit(X, y)
![Page 22: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/22.jpg)
Maybe not so Wizard?
![Page 23: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/23.jpg)
Automatic Model Selection: Search?
Search is difficult particularly in high dimensional space.
Even with techniques like genetic algorithms or particle swarm optimization, there is no guarantee of a solution.
As the search space gets larger, the amount of time increases exponentially.
![Page 24: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/24.jpg)
Anscombe, Francis J. "Graphs in statistical analysis." The American Statistician 27.1 (1973): 17-21.
Anscombe’s Quartet
![Page 25: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/25.jpg)
Through visualization we can steer the model
selection process
![Page 26: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/26.jpg)
Model Selection Management Systems
Kumar, Arun, et al. "Model selection management systems: The next frontier of advanced analytics." ACM SIGMOD Record 44.4 (2016): 17-22.
Optimized Implementations
User Interfaces and DSLs
Model Selection Triples{ {FE} x {AS} X {HT} }
![Page 27: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/27.jpg)
Can we visualize machine learning?
![Page 28: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/28.jpg)
Data ManagementWrangling
StandardizationNormalization
Selection & Joins
Model Evaluation + Hyperparameter Tuning
Model Selection
Feature Analysis
Linear Models
Nearest Neighbors SVM
Ensemble Trees Bayes
Feature Analysis
Feature Selection
Model Selection
Revisit Features
Iterate!
Initial Model
Model Storage
![Page 29: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/29.jpg)
Data and Model Management
![Page 30: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/30.jpg)
Is “GitHub for Data” Enough?
![Page 31: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/31.jpg)
Visualizing Feature Analysis
![Page 32: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/32.jpg)
SPLOM (Scatterplot Matrices)
![Page 33: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/33.jpg)
Seo, Jinwook, and Ben Shneiderman. "A rank-by-feature framework for interactive exploration of multidimensional data." Information visualization 4.2 (2005): 96-113.
Visual Rank by Feature: 1 Dimension
Rank by:
1. Normality of distribution (Shapiro-Wilk and Kolmogorov-Smirnov)
2. Uniformity of distribution (entropy)
3. Number of potential outliers 4. Number of hapaxes 5. Size of gap
![Page 34: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/34.jpg)
Seo, Jinwook, and Ben Shneiderman. "A rank-by-feature framework for interactive exploration of multidimensional data." Information visualization 4.2 (2005): 96-113.
Visual Rank by Feature: 1 Dimension
Rank by:
1. Normality of distribution (Shapiro-Wilk and Kolmogorov-Smirnov)
2. Uniformity of distribution (entropy)
3. Number of potential outliers 4. Number of hapaxes 5. Size of gap
![Page 35: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/35.jpg)
Seo, Jinwook, and Ben Shneiderman. "A rank-by-feature framework for interactive exploration of multidimensional data." Information visualization 4.2 (2005): 96-113.
Visual Rank by Feature: 2 Dimensions
Rank by:
1. Correlation Coefficient (Pearson, Spearman)
2. Least-squares error3. Quadracity 4. Density based outlier
detection. 5. Uniformity (entropy of grids)6. Number of items in the most
dense region of the plot.
![Page 36: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/36.jpg)
Joint Plots: Diving Deeper after Rank by FeatureSpecial thanks to Seaborn for doing statistical visualization right!
![Page 37: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/37.jpg)
Detecting Separablity
![Page 38: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/38.jpg)
Radviz: Radial Visualization
![Page 39: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/39.jpg)
Parallel Coordinates
![Page 40: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/40.jpg)
Decomposition (PCA, SVD) of Feature Space
![Page 41: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/41.jpg)
Visualizing Model Selection
![Page 42: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/42.jpg)
Confusion Matrices
![Page 43: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/43.jpg)
Receiver Operator Characteristic (ROC) and Area Under Curve (AUC)
![Page 44: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/44.jpg)
Prediction Error Plots
![Page 45: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/45.jpg)
Visualizing Residuals
![Page 46: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/46.jpg)
Model Families vs. Model Forms vs. Fitted ModelsRebecca Bilbro http://bit.ly/2a1YoTs
![Page 48: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/48.jpg)
Visualizing Evaluation/Tuning
![Page 49: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/49.jpg)
Cross Validation Curves
![Page 50: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/50.jpg)
Visual Grid Search
![Page 51: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/51.jpg)
Integrating Visual Model Selection with Scikit-Learn
Yellowbrick
![Page 52: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/52.jpg)
Scikit-Learn Pipelines: fit() and predict()
Data Loader
Transformer
Transformer
Estimator
Data Loader
Transformer
Transformer
Estimator
Transformer
![Page 53: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/53.jpg)
Yellowbrick Visual Transformers
Data Loader
Transformer(s)
Feature Visualization
Estimator
fit() draw()
predict()
Data Loader
Transformer(s)
EstimatorCV
Evaluation Visualization
fit() predict()score()draw()
![Page 54: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/54.jpg)
Model Selection Pipelines
Multi-Estimator Visualization
Data Loader
Transformer(s)
EstimatorEstimatorEstimatorEstimator
Cross Validation Cross Validation Cross Validation Cross Validation
![Page 55: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/55.jpg)
Employ Interactivity to Visualize More
Health and Wealth of Nations Recreated by Mike Bostock Originally by Hans Rosling http://bit.ly/29RYBJD
![Page 56: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/56.jpg)
Visual Analytics Mantra: Overview First; Zoom & Filter; Details on Demand
Heer, Jeffrey, and Ben Shneiderman. "Interactive dynamics for visual analysis." Queue 10.2 (2012): 30.
![Page 57: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/57.jpg)
Codename TrinketVisual Model Management System
![Page 58: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/58.jpg)
Yellowbrickhttp://bit.ly/2a5otxB
DDL Trinkethttp://bit.ly/2a2Y0jy
DDL Open Source Projects on GitHub
![Page 59: Visualizing the Model Selection Process](https://reader034.vdocuments.site/reader034/viewer/2022042907/5870669c1a28ab48378b5127/html5/thumbnails/59.jpg)
Questions!