uc berkeley mlbase$ - massachusetts institute of...
Post on 10-May-2018
219 Views
Preview:
TRANSCRIPT
Denny Britz, John Duchi, Michael J. Franklin, Rean Griffith, Joseph Gonzalez, Michael I. Jordan, Tim Kraska, Xinghao Pan, Virginia Smith,
Evan Sparks, Ameet Talwalkar, Andre Wibisono
MLbase
UC Berkeley
<www.mlbase.org>
The Problem
Build a Classifier for X
What you want to do What you have to do • Learn the internals of ML
classifica=on algorithms, sampling, feature selec=on, X-‐valida=on,….
• Poten=ally learn Spark/Hadoop/… • Implement 3-‐4 algorithms • Implement grid-‐search to find the
right algorithm parameters • Implement valida=on algorithms • Experiment with different sampling-‐
sizes, algorithms, features • ….
and in the end
Ask For Help
Use Cases
var X = load("als_clinical", 2 to 10) var y = load("als_clinical", 1) var (fn-‐model, summary) = top(doClassify(X, y), 5min)
var G = loadGraph("twi`er_network") var hubs-‐nodes = findTopKDegreeNodes(G, k = 1000) var text-‐features = textFeaturize(load("twi`er_tweet_data")) var T-‐hub = join(hub-‐nodes, "u-‐id", text-‐features, "u-‐id") findTopFeatures(T-‐hub)
Supervised ClassificaMon: ALS PredicMon
Unsupervised Feature ExtracMon: Twi`er
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
COML(Optimizer)
Parser
Executor/Monitoring
Binders of Algorithms
Runtime Runtime Runtime Runtime
LLP
PLP
Master
MLbase Architecture
Binders full of algorithms allows to add more operators
Result No single correct answer – instead approximaMon and conMnuous refinement
Sta=s=cs about algorithms and data
Adap=ve Op=mizer esMmates run-‐Mme and quality improvement
Distributed Run=me build for fast (in-‐memory) iteraMon
Binders Full of Algorithms
Implementa=on Using high-‐level pa`erns provided by MLbase
Contract • Type (e.g., classificaMon) • Parameters • RunMme (e.g., O(n)) • Input-‐SpecificaMon • Output-‐SpecificaMon • …
+
ML Developer
Common to state-‐of-‐the-‐art algorithms • SVMs, LogisMc Regression, Naïve Bayes, LogitBoost, Linear Regression, Ridge Regression, LASSO, Matrix FactorizaMon via SGD, DFC, K-‐Means, DP-‐Means
• More to come • Standalone implementaMons on Spark (soon!)
Today: Half-‐Full Binders
var X = load("als_clinical",2 to 10) var y = load("als_clinical", 1) var (fn-‐model, summary) = top(doClassify(X, y), 10min)
(1) MQL
OpMmizaMon (2) Generic Logical Plan
grid-search
configure model
train model
down-sample
model/data interpretation
(X', y')
load (als_clinical)
SVM Adaboost
RBF linear stumps
regularization rounds
technique
kernel
params
...
...
...
(fn-model, summary)
down-sample
(X, y)
originalfeaturization ...normalizedbin
(X, y)
fn-model
cross-validate
top-1
fn-model
summary
var X = load("als_clinical",2 to 10) var y = load("als_clinical", 1) var (fn-‐model, summary) = top(doClassify(X, y), 10min)
(1) MQL
OpMmizaMon (2) Generic Logical Plan
grid-search
configure model
train model
down-sample
model/data interpretation
(X', y')
load (als_clinical)
SVM Adaboost
RBF linear stumps
regularization rounds
technique
kernel
params
...
...
...
(fn-model, summary)
down-sample
(X, y)
originalfeaturization ...normalizedbin
(X, y)
fn-model
cross-validate
top-1
fn-model
summary
var X = load("als_clinical",2 to 10) var y = load("als_clinical", 1) var (fn-‐model, summary) = top(doClassify(X, y), 10min)
(1) MQL
OpMmizaMon (2) Generic Logical Plan
grid-search
configure model
train model
down-sample
model/data interpretation
(X', y')
load (als_clinical)
SVM Adaboost
RBF linear stumps
regularization rounds
technique
kernel
params
...
...
...
(fn-model, summary)
down-sample
(X, y)
originalfeaturization ...normalizedbin
(X, y)
fn-model
cross-validate
top-1
fn-model
summary
Possible OpMmizaMons (classificaMon)
standard feature normalizer
create 10-folds
cross validation
folds
(X'', y'')
SVMkernel: RBF
λ=10⁶ σ = 1/d ⨉ 10⁶
(model-params, cross-validation-summary)
top-1
train model
calculate misclassification
rate
(fn-model, summary)
(X, y)
….
cross validation
SVMkernel: RBF
λ=10³ σ= 1/d ⨉ 10⁶
cross validation
AdaBoostrounds = 20
baseline-check: nearest neighbor
baseline-check: most common
label
(model-params, cross-validation-summary)
fn-model
load (als_clinical)
down-sample 10%
(X, y)
(X', y')
store normalized folds
fn-model
Rela=onal Op=miza=ons (Top-‐K Pushdown, Join-‐Ordering,…)
Sta=c ML Selec=on Rules • Imbalance of labels • SVMs are more sensiMve to the scale-‐parameter than
AdaBoost to rounds • If SVM à normalize data between [-‐1, 1] • If data contains outliers à pre-‐clean data or forego
AdaBoost • …
Run-‐Time Op=miza=on Rules • Caching: If 2nd run and determinisMc, start with previously
most successful model • Set sample-‐size to fit Input-‐Data as well as intermediate
result in memory • ParMMon data according to cross-‐validaMon • …
Cost-‐based Op=miza=on Rules • Expected quality improvement based on the history • Consider cost of pre-‐cleaning, normalizaMon, algorithm
complexity,… • …
Why OpMmize?
SVM AdaBoostoriginal scaled
a1a 82.93 82.93 82.87australian 85.22 85.51 86.23
breast 70.13 97.22 96.48diabetes 76.44 77.61 76.17fourclass 100.00 99.77 91.19splice 88.00 87.60 91.20
Figure 3: Classifier accuracy using SVM with an RBF kernel and using AdaBoost
3.6.3 Preliminary Results
To demonstrate the possible advantages of an optimizer just for selecting among di↵erent MLalgorithms even without considering the system aspect, we implemented a prototype using twoalgorithms: SVM and AdaBoost. For both algorithms, we used publicly available implementations:LIBSVM [33] for SVM and the ML AdaBoost Toolbox [1] for AdaBoost. We evaluated the op-timizer for a classification task similar to the one in Figure 2 with 6 datasets from the LIBSVMwebsite: ‘a1a’, ‘australian’, ‘breast-cancer’, ‘diabetes’, ‘fourclass’, and ‘splice’. To better visualizethe impact of finding the best ML model, we performed a full grid search over a fixed set of algo-rithm parameters, i.e., number of rounds (r) for AdaBoost and regularization (�) and RBF scale(�) parameters for SVM. Specifically, we tested r = {25, 50, 100, 200}, � = {10�6
, 10�3, 1, 103, 106},
and � = 1d ⇥ {10�6
, 10�3, 1, 103, 106}, where d is the number of features in the dataset. For each
algorithm, set of features and parameter settings, we performed 5-fold cross validation, and reportthe average results across the held-out fold.
Table 3 shows the best accuracy after tuning the parameters using grid search for the di↵erentdatasets and algorithms, with and without scaling the features (the best combination is marked inbold). The results show first that there is no dominant combination for all datasets. SometimesAdaBoost outperforms SVM, sometimes scaling the features helps, sometimes it does not.
Next we turn to understanding the search problem for the parameters themselves, depicted inFigures 4(a) and 4(b). Figure 4(a) shows, for fixed regularization �, the impact of the � parameterin the RBF kernel on the accuracy, whereas Figure 4(b) visualizes the accuracy for varying thenumber of rounds r for AdaBoost. As shown in Figure 4(a), the choice of � in the SVM problemclearly has a huge impact on quality; automatically selecting � is important. On the other hand,for the same datasets, it appears that the number of rounds in AdaBoost is not quite as significantonce r � 25 (shown in Figure 4(b)). Hence, an optimizer might decide to initially use AdaBoost -without scaling and with a fixed round parameter - in order to quickly provide the user with a firstclassifier. Afterwards, the system might explore SVMs with scaled features to improve the model,before extending the search space to the remaining combinations.
The general accuracy of algorithms is just one of the aspects an optimizer may take into account.Statistics about the dataset itself, di↵erent data layouts, algorithm speed and parallel executionstrategies (as described in the next section) are just a few additional dimensions the optimizer mayexploit to improve the learning process. In this project, we will evaluate these freedoms of choiceand build the foundation for cost-based (query) optimization for machine learning.
10
Why OpMmize?
0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#
a1a# australian# breast# diabetes# fourclass# splice#
Accuracy'
10^=6#
10^=3#
1#
10^3#
10^6#
(a) Impact of di↵erent � =
1d ⇥{10�6, 10�3, 1, 103, 106}
on the SVM accuracy with an RBF kernel and � = 10
�6
on LIBSVM data-sets
0%#
10%#
20%#
30%#
40%#
50%#
60%#
70%#
80%#
90%#
100%#
a1a# australian# breast# diabetes# fourclass# splice#
Accuracy' 25#
50#
100#
200#
(b) Impact of r = {25, 50, 100, 200} on AdaBoost on
LIBSVM data-sets
Figure 4: Parameter Impact
4 Related Work
COML is not the first system trying to make machine learning more accessible, but it is the first tofree users from algorithm choices and to automatically optimize for distributed execution. Probablymost related to COML are Weka [4], MADLib [47], and Mahout [3]. Weka is a collection of MLtools for data mining that simplifies their usage by providing a simple UI. Weka, however, requiresexpert knowledge to choose and configure the ML algorithm and is a single node system. On thedatabase and distributed side, Mahout’s goal is to build a scalable ML library on top of Hadoop,while MADLib provides an ML library for relational database systems. Neither system addressesthe (di�cult but necessary) challenge of optimizing the learning algorithms.
Google Predict [2] is Google’s proprietary web-service for prediction problems, but restricts themaximum training data-size to 250MB. In [18], the authors make the case that databases shouldnatively support predictive models and present a first prototype called Longview. We extend thisvision by supporting all kinds of ML algorithms, not just predictive models. Furthermore, thefocus of this project is on the optimization for ML instead of the language integration within therelational model.
Recently, there have been e↵orts to build distributed run-times for more advanced analytical tasks.For example, Hyracks [27], HaLoop [32] and AMPLab’s Spark [72, 6] have special iterative in-memory operations to better support ML algorithms. As mentioned earlier, the goal of this projectis not on inventing a new run-time for machine learning; instead we will use Spark.
SystemML [46] proposes an R-like language and shows how it can be optimized and compiled downto MapReduce. However, SystemML tries to support ML experts to develop e�cient distributedalgorithms and does not aim at simplifying the use of ML, for example, by automatically tuning thetraining step. Still, the ideas of SystemML are compelling and we might leverage them as part of ourphysical plan optimization. In [41], the authors show how many ML algorithms can be expressed asa relational-friendly convex-optimization problem, whereas the authors of [70] present techniquesto optimize inference algorithms in a probabilistic DBMS. We leverage these techniques in ourrun-time, but our system aims beyond a single machine and extends the presented optimizationtechniques.
This project builds on the strong foundation of declarative languages and query optimization.
11
0%#10%#20%#30%#40%#50%#60%#70%#80%#90%#100%#
a1a# australian# breast# diabetes# fourclass# splice#
Accuracy'
10^=6#
10^=3#
1#
10^3#
10^6#
(a) Impact of di↵erent � =
1d ⇥{10�6, 10�3, 1, 103, 106}
on the SVM accuracy with an RBF kernel and � = 10
�6
on LIBSVM data-sets
0%#
10%#
20%#
30%#
40%#
50%#
60%#
70%#
80%#
90%#
100%#
a1a# australian# breast# diabetes# fourclass# splice#
Accuracy' 25#
50#
100#
200#
(b) Impact of r = {25, 50, 100, 200} on AdaBoost on
LIBSVM data-sets
Figure 4: Parameter Impact
4 Related Work
COML is not the first system trying to make machine learning more accessible, but it is the first tofree users from algorithm choices and to automatically optimize for distributed execution. Probablymost related to COML are Weka [4], MADLib [47], and Mahout [3]. Weka is a collection of MLtools for data mining that simplifies their usage by providing a simple UI. Weka, however, requiresexpert knowledge to choose and configure the ML algorithm and is a single node system. On thedatabase and distributed side, Mahout’s goal is to build a scalable ML library on top of Hadoop,while MADLib provides an ML library for relational database systems. Neither system addressesthe (di�cult but necessary) challenge of optimizing the learning algorithms.
Google Predict [2] is Google’s proprietary web-service for prediction problems, but restricts themaximum training data-size to 250MB. In [18], the authors make the case that databases shouldnatively support predictive models and present a first prototype called Longview. We extend thisvision by supporting all kinds of ML algorithms, not just predictive models. Furthermore, thefocus of this project is on the optimization for ML instead of the language integration within therelational model.
Recently, there have been e↵orts to build distributed run-times for more advanced analytical tasks.For example, Hyracks [27], HaLoop [32] and AMPLab’s Spark [72, 6] have special iterative in-memory operations to better support ML algorithms. As mentioned earlier, the goal of this projectis not on inventing a new run-time for machine learning; instead we will use Spark.
SystemML [46] proposes an R-like language and shows how it can be optimized and compiled downto MapReduce. However, SystemML tries to support ML experts to develop e�cient distributedalgorithms and does not aim at simplifying the use of ML, for example, by automatically tuning thetraining step. Still, the ideas of SystemML are compelling and we might leverage them as part of ourphysical plan optimization. In [41], the authors show how many ML algorithms can be expressed asa relational-friendly convex-optimization problem, whereas the authors of [70] present techniquesto optimize inference algorithms in a probabilistic DBMS. We leverage these techniques in ourrun-time, but our system aims beyond a single machine and extends the presented optimizationtechniques.
This project builds on the strong foundation of declarative languages and query optimization.
11
SVM
AdaBoost regularizaMon
Scale-‐factor
Summary • MLbase is a first declaraMve machine-‐learning system
• It simplifies ML in the same way as databases simplify data management
• MLBase is currently under heavy-‐development, but a first prototype is up and running: – 1st iniMal language for classificaMon, clustering, and matrix
factorizaMon – Rule-‐based opMmizer – Distributed algorithms for: k-‐means clustering, LogitBoost, various
matrix factorizaMon algorithms and support vector machines. – Spark and GraphLab on Spark as the run-‐Mme
• We plan to release a first prototype by this summer
Mm_kraska@brown.edu <www.mlbase.org>
top related