simpler machine learning with skll 1.0

46
Simpler Machine Learning with SKLL 1.0 Dan Blanchard Educational Testing Service [email protected] PyData NYC 2014

Upload: daniel-blanchard

Post on 02-Jul-2015

1.914 views

Category:

Data & Analytics


0 download

DESCRIPTION

As the popularity of machine learning techniques spreads to new areas of industry and science, the number of potential machine learning users is growing rapidly. While the fantastic scikit-learn library is widely used in the Python community for tackling such tasks, there are two significant hurdles in place for people working on new machine learning problems: • Scikit-learn requires writing a fair amount of boilerplate code to run even simple experiments. • Obtaining good performance typically requires tuning various model parameters, which can be particularly challenging for beginners. SciKit-Learn Laboratory (SKLL) is an open source Python package, originally developed by the NLP & Speech group at the Educational Testing Service (ETS), that addresses these issues by providing the ability to run scikit-learn experiments with tuned models without writing any code beyond what generates the features. This talk will provide an overview of performing common machine learning tasks with SKLL, and highlight some of the new features that are present as of the 1.0 release.

TRANSCRIPT

Page 1: Simpler Machine Learning with SKLL 1.0

Simpler Machine Learning with SKLL 1.0

Dan Blanchard Educational Testing Service

[email protected]

PyData NYC 2014

Page 2: Simpler Machine Learning with SKLL 1.0
Page 3: Simpler Machine Learning with SKLL 1.0
Page 4: Simpler Machine Learning with SKLL 1.0
Page 5: Simpler Machine Learning with SKLL 1.0

Survived Perished

Page 6: Simpler Machine Learning with SKLL 1.0

Survived Perishedfirst class, female,

1 sibling, 35 years old

third class, female,

2 siblings, 18 years old

second class, male,

0 siblings, 50 years old

Page 7: Simpler Machine Learning with SKLL 1.0

Survived Perishedfirst class, female,

1 sibling, 35 years old

third class, female,

2 siblings, 18 years old

second class, male,

0 siblings, 50 years old

Can we predict survival from data?

Page 8: Simpler Machine Learning with SKLL 1.0

SciKit-Learn Laboratory

Page 9: Simpler Machine Learning with SKLL 1.0

SKLL

It's where the learning happens

Page 10: Simpler Machine Learning with SKLL 1.0

Learning to Predict Survival$ ./make_titanic_example_data.py Loading train.csv... done Writing titanic/train/socioeconomic.csv...done Writing titanic/train/family.csv...done Writing titanic/train/vitals.csv...done Writing titanic/train/misc.csv...done Writing titanic/train+dev/socioeconomic.csv...done Writing titanic/train+dev/family.csv...done Writing titanic/train+dev/vitals.csv...done Writing titanic/train+dev/misc.csv...done Writing titanic/dev/socioeconomic.csv...done Writing titanic/dev/family.csv...done Writing titanic/dev/vitals.csv...done Writing titanic/dev/misc.csv...done Loading test.csv... done Writing titanic/test/socioeconomic.csv...done Writing titanic/test/family.csv...done Writing titanic/test/vitals.csv...done Writing titanic/test/misc.csv...done

1. Split up given training set: train (80%) and dev (20%)

Page 11: Simpler Machine Learning with SKLL 1.0

Learning to Predict Survival2. Pick classifiers to try:

1. Decision Tree

2. Naive Bayes

3. Random forest

4. Support Vector Machine (SVM)

Page 12: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

Page 13: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

directory with feature files for training learner

Learning to Predict Survival3. Create configuration file for SKLL

Page 14: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival

directory with feature files for evaluating performance

3. Create configuration file for SKLL

Page 15: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

Page 16: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival

# of siblings, spouses, parents, children

3. Create configuration file for SKLL

Page 17: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival

departure port

3. Create configuration file for SKLL

Page 18: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival

fare & passenger class

3. Create configuration file for SKLL

Page 19: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival

sex & age

3. Create configuration file for SKLL

Page 20: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

Page 21: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival

directory to store evaluation results

3. Create configuration file for SKLL

Page 22: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate

[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

directory to store trained models

Page 23: Simpler Machine Learning with SKLL 1.0

Learning to Predict Survival4. Run the configuration file with run_experiment$ run_experiment evaluate.cfg

Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done Loading dev/misc.csv... done Loading dev/socioeconomic.csv... done Loading dev/vitals.csv... done Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done ...

Page 24: Simpler Machine Learning with SKLL 1.0

Learning to Predict SurvivalExperiment Name: Titanic_Evaluate_Untuned SKLL Version: 1.0.0 Training Set: train (712) Test Set: dev (179) Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Scikit-learn Version: 0.15.2 Total Time: 0:00:02.065403

+-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8100558659217877

5. Examine results

Page 25: Simpler Machine Learning with SKLL 1.0

Aggregate Evaluation Results

Dev. Accuracy Learner

0.8101 RandomForestClassifier

0.7989 DecisionTreeClassifier

0.7709 SVC

0.7095 MultinomialNB

Page 26: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Evaluate task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Tuning] grid_search = true objective = accuracy

[Output] results = output

Tuning learnerCan we do better than default hyperparameters?

Page 27: Simpler Machine Learning with SKLL 1.0

Tuned Evaluation Results

Untuned Accuracy Tuned Accuracy Learner

0.8101 0.8380 RandomForestClassifier

0.7989 0.7989 DecisionTreeClassifier

0.7709 0.8156 SVC

0.7095 0.7095 MultinomialNB

Page 28: Simpler Machine Learning with SKLL 1.0

[General] experiment_name = Titanic_Predict task = predict [Input] train_directory = train+dev test_directory = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived

[Tuning] grid_search = true objective = accuracy

[Output] results = output

Using All Available DataUse training and dev to generate predictions on test

Page 29: Simpler Machine Learning with SKLL 1.0

Test Set Accuracy

Train only Train + DevLearner

Untuned Tuned Untuned Tuned

0.727 0.756 0.746 0.780 RandomForestClassifier

0.703 0.742 0.670 0.742 DecisionTreeClassifier

0.608 0.679 0.612 0.679 SVC

0.627 0.627 0.622 0.622 MultinomialNB

Page 30: Simpler Machine Learning with SKLL 1.0

Advanced SKLL Features• Read & write .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, and .tsv data

• Parameter grids for all supported scikit-learn learners

• Custom learners• Parallelize experiments on

DRMAA clusters via GridMap• Ablation experiments

• Collapse/rename classes from config file

• Feature scaling• Rescale predictions to be closer

to observed data• Command-line tools for joining,

filtering, and converting feature files

• Python API

Page 31: Simpler Machine Learning with SKLL 1.0

Currently Supported LearnersClassifiers Regressors

Linear Support Vector Machine Elastic Net

Logistic Regression Lasso

Multinomial Naive Bayes Linear

AdaBoost

Decision Tree

Gradient Boosting

K-Nearest Neighbors

Random Forest

Stochastic Gradient Descent

Support Vector Machine

Page 32: Simpler Machine Learning with SKLL 1.0

Contributors• Nitin Madnani

• Mike Heilman

• Nils Murrugarra Llerena

• Aoife Cahill

• Diane Napolitano

• Keelan Evanini

• Ben Leong

Page 33: Simpler Machine Learning with SKLL 1.0

References• Dataset: kaggle.com/c/titanic-gettingStarted

• SKLL GitHub: github.com/EducationalTestingService/skll

• SKLL Docs: skll.readthedocs.org

• Titanic configs and data splitting script in examples dir on GitHub

@dsblanch

dan-blanchard

Page 34: Simpler Machine Learning with SKLL 1.0

Bonus Slides

Page 35: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

Page 36: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

confusion matrix

Page 37: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

overall accuracy on test set

Page 38: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

precision, recall, f-score for each class

Page 39: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

tuned model parameters

Page 40: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

objective function score on test set

Page 41: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

Page 42: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)

Page 43: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)

per-fold evaluation results

Page 44: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)

per-fold training set obj. scores

Page 45: Simpler Machine Learning with SKLL 1.0

SKLL APIfrom skll import Learner, Reader

# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)

Page 46: Simpler Machine Learning with SKLL 1.0

SKLL APIimport numpy as npfrom os.path import joinfrom skll import FeatureSet, NDJWriter, Writer

# Create some training exampleslabels = []ids = []features = []for i in range(num_train_examples): labels.append("dog" if i % 2 == 0 else "cat") ids.append("{}{}".format(y, i)) features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)})feat_set = FeatureSet('training', ids, labels=labels, features=features)

# Write them to a filetrain_path = join(_my_dir, 'train', 'test_summary.jsonlines')Writer.for_path(train_path, feat_set).write()# OrNDJWriter.(train_path, feat_set).write()