simpler machine learning with skll 1.0
DESCRIPTION
As the popularity of machine learning techniques spreads to new areas of industry and science, the number of potential machine learning users is growing rapidly. While the fantastic scikit-learn library is widely used in the Python community for tackling such tasks, there are two significant hurdles in place for people working on new machine learning problems: • Scikit-learn requires writing a fair amount of boilerplate code to run even simple experiments. • Obtaining good performance typically requires tuning various model parameters, which can be particularly challenging for beginners. SciKit-Learn Laboratory (SKLL) is an open source Python package, originally developed by the NLP & Speech group at the Educational Testing Service (ETS), that addresses these issues by providing the ability to run scikit-learn experiments with tuned models without writing any code beyond what generates the features. This talk will provide an overview of performing common machine learning tasks with SKLL, and highlight some of the new features that are present as of the 1.0 release.TRANSCRIPT
Simpler Machine Learning with SKLL 1.0
Dan Blanchard Educational Testing Service
PyData NYC 2014
Survived Perished
Survived Perishedfirst class, female,
1 sibling, 35 years old
third class, female,
2 siblings, 18 years old
second class, male,
0 siblings, 50 years old
Survived Perishedfirst class, female,
1 sibling, 35 years old
third class, female,
2 siblings, 18 years old
second class, male,
0 siblings, 50 years old
Can we predict survival from data?
SciKit-Learn Laboratory
SKLL
It's where the learning happens
Learning to Predict Survival$ ./make_titanic_example_data.py Loading train.csv... done Writing titanic/train/socioeconomic.csv...done Writing titanic/train/family.csv...done Writing titanic/train/vitals.csv...done Writing titanic/train/misc.csv...done Writing titanic/train+dev/socioeconomic.csv...done Writing titanic/train+dev/family.csv...done Writing titanic/train+dev/vitals.csv...done Writing titanic/train+dev/misc.csv...done Writing titanic/dev/socioeconomic.csv...done Writing titanic/dev/family.csv...done Writing titanic/dev/vitals.csv...done Writing titanic/dev/misc.csv...done Loading test.csv... done Writing titanic/test/socioeconomic.csv...done Writing titanic/test/family.csv...done Writing titanic/test/vitals.csv...done Writing titanic/test/misc.csv...done
1. Split up given training set: train (80%) and dev (20%)
Learning to Predict Survival2. Pick classifiers to try:
1. Decision Tree
2. Naive Bayes
3. Random forest
4. Support Vector Machine (SVM)
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
directory with feature files for training learner
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
directory with feature files for evaluating performance
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
# of siblings, spouses, parents, children
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
departure port
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
fare & passenger class
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
sex & age
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival
directory to store evaluation results
3. Create configuration file for SKLL
[General] experiment_name = Titanic_Evaluate_Untuned task = evaluate
[Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Output] results = output models = output
Learning to Predict Survival3. Create configuration file for SKLL
directory to store trained models
Learning to Predict Survival4. Run the configuration file with run_experiment$ run_experiment evaluate.cfg
Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done Loading dev/misc.csv... done Loading dev/socioeconomic.csv... done Loading dev/vitals.csv... done Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done ...
Learning to Predict SurvivalExperiment Name: Titanic_Evaluate_Untuned SKLL Version: 1.0.0 Training Set: train (712) Test Set: dev (179) Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Scikit-learn Version: 0.15.2 Total Time: 0:00:02.065403
+-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8100558659217877
5. Examine results
Aggregate Evaluation Results
Dev. Accuracy Learner
0.8101 RandomForestClassifier
0.7989 DecisionTreeClassifier
0.7709 SVC
0.7095 MultinomialNB
[General] experiment_name = Titanic_Evaluate task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Tuning] grid_search = true objective = accuracy
[Output] results = output
Tuning learnerCan we do better than default hyperparameters?
Tuned Evaluation Results
Untuned Accuracy Tuned Accuracy Learner
0.8101 0.8380 RandomForestClassifier
0.7989 0.7989 DecisionTreeClassifier
0.7709 0.8156 SVC
0.7095 0.7095 MultinomialNB
[General] experiment_name = Titanic_Predict task = predict [Input] train_directory = train+dev test_directory = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived
[Tuning] grid_search = true objective = accuracy
[Output] results = output
Using All Available DataUse training and dev to generate predictions on test
Test Set Accuracy
Train only Train + DevLearner
Untuned Tuned Untuned Tuned
0.727 0.756 0.746 0.780 RandomForestClassifier
0.703 0.742 0.670 0.742 DecisionTreeClassifier
0.608 0.679 0.612 0.679 SVC
0.627 0.627 0.622 0.622 MultinomialNB
Advanced SKLL Features• Read & write .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, and .tsv data
• Parameter grids for all supported scikit-learn learners
• Custom learners• Parallelize experiments on
DRMAA clusters via GridMap• Ablation experiments
• Collapse/rename classes from config file
• Feature scaling• Rescale predictions to be closer
to observed data• Command-line tools for joining,
filtering, and converting feature files
• Python API
Currently Supported LearnersClassifiers Regressors
Linear Support Vector Machine Elastic Net
Logistic Regression Lasso
Multinomial Naive Bayes Linear
AdaBoost
Decision Tree
Gradient Boosting
K-Nearest Neighbors
Random Forest
Stochastic Gradient Descent
Support Vector Machine
Contributors• Nitin Madnani
• Mike Heilman
• Nils Murrugarra Llerena
• Aoife Cahill
• Diane Napolitano
• Keelan Evanini
• Ben Leong
References• Dataset: kaggle.com/c/titanic-gettingStarted
• SKLL GitHub: github.com/EducationalTestingService/skll
• SKLL Docs: skll.readthedocs.org
• Titanic configs and data splitting script in examples dir on GitHub
@dsblanch
dan-blanchard
Bonus Slides
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
confusion matrix
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
overall accuracy on test set
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
precision, recall, f-score for each class
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
tuned model parameters
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
objective function score on test set
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
per-fold evaluation results
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
per-fold training set obj. scores
SKLL APIfrom skll import Learner, Reader
# Load training examplestrain_examples = Reader.for_path('myexamples.megam').read()
# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)
# Load test examples and evaluatetest_examples = Reader.for_path('test.tsv').read()conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
# Generate predictions from trained modelpredictions = learner.predict(test_examples)
# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL APIimport numpy as npfrom os.path import joinfrom skll import FeatureSet, NDJWriter, Writer
# Create some training exampleslabels = []ids = []features = []for i in range(num_train_examples): labels.append("dog" if i % 2 == 0 else "cat") ids.append("{}{}".format(y, i)) features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)})feat_set = FeatureSet('training', ids, labels=labels, features=features)
# Write them to a filetrain_path = join(_my_dir, 'train', 'test_summary.jsonlines')Writer.for_path(train_path, feat_set).write()# OrNDJWriter.(train_path, feat_set).write()