nobody knows what it’s like to be the bad man: the development process for the caret package

19
Nobody Knows What It’s Like To Be the Bad Man The Development Process for the caret Package Max Kuhn Pfizer Global R&D max.kuhn@pfizer.com Zachary Deane–Mayer Cognius [email protected]

Upload: mickey-graham

Post on 17-Jul-2015

9.739 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Nobody Knows What It’s Like To Be the Bad ManThe Development Process for the caret Package

Max KuhnPfizer Global R&D

[email protected]

Zachary Deane–MayerCognius

[email protected]

Page 2: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Model Function Consistency

Since there are many modeling packages in R written by di↵erentpeople, there are inconsistencies in how models are specified andpredictions are created.

For example, many models have only one method of specifying themodel (e.g. formula method only)

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 2 / 19

Page 3: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Generating Class Probabilities Using Di↵erent

Packages

Function predict Function Syntax

MASS::lda predict(obj) (no options needed)

stats:::glm predict(obj, type = "response")

gbm::gbm predict(obj, type = "response", n.trees)

mda::mda predict(obj, type = "posterior")

rpart::rpart predict(obj, type = "prob")

RWeka::Weka predict(obj, type = "probability")

caTools::LogitBoost predict(obj, type = "raw", nIter)

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 3 / 19

Page 4: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

The caret Package

The caret package was developed to:

create a unified interface for modeling and prediction (interfacesto 180 models)

streamline model tuning using resampling

provide a variety of “helper” functions and classes forday–to–day model building tasks

increase computational e�ciency using parallel processing

First commits within Pfizer: 6/2005, First version on CRAN: 10/2007

Website: http://topepo.github.io/caret/

JSS Paper: http://www.jstatsoft.org/v28/i05/paper

Model List: http://topepo.github.io/caret/bytag.html

Many computing sections in APM

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 4 / 19

Page 5: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Package Dependencies

One thing that makes caret di↵erent from most other packages isthat it uses code from an abnormally large number (> 80) of otherpackages.

Originally, these were in the Depends field of the DESCRIPTION filewhich caused all of them to be loaded with caret.

For many years, they were moved to Suggests, which solved thatissue.

However, their formal dependency in the DESCRIPTION file requiredCRAN to install hundreds of other packages to check caret. Themaintainers were not pleased.

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 5 / 19

Page 6: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Package Dependencies

This problem was somewhat alleviated at the end of 2013 whencustom methods were expanded.

Although this functionality had already existed in the package forsome time, it was refactored to be more user friendly.

In the process, much of the modeling code was moved out of caret’sR files and into R objects, eliminating the formal dependencies.

Right now, the total number of dependencies is much smaller (2Depends, 7 Imports, and 25 Suggests).

This still a↵ects testing though (described later). Also:

1 package is needed for this model and is not installed. (gbm).

Would you like to try to install it now?

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 6 / 19

Page 7: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

The Basic Release Process

1 create a few dynamic man pages2 use R CMD check --as-cran to ensure passing CRAN tests

and unit tests3 update all packages (and R)4 run regression tests and evaluate results5 send to CRAN6 repeat7 repeat8 install passed caret version9 generate HTML documentation and sync github io branch10 profit!

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 7 / 19

Page 8: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Toolbox

RStudio projects: a clean, self contained package developmentenvironment

INo more setwd(’/path/to/some/folder’) in scripts

IKeep track of project-wide standards, e.g. code formatting

IAn RStudio project was the first thing we added after moving

the caret repository to github

devtools: automate boring tasks

testthat: automated unit testing

roxygen2: combine source code with documentation

github: source control

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 8 / 19

Page 9: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

devtools

devtools::install builds package and installs it locally

devtools::check:1 Builds documentation2 Runs unit tests3 Builds tarball4 Runs R CMD CHECK

devtools::release builds package and submits it to CRAN

devtools::install github enables non-CRAN code distributionor distribution of private packages

devtools::use travis enables automated unit testing throughtravis-CI and test coverage reports through coveralls

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 9 / 19

Page 10: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Testing

Testing for the package occurs in a few di↵erent ways:

units tests via testthat and travis-CI

regression tests for consistency

Automated unit testing via testthat

testthat::use testhat

Unit tests prevent new features from breaking old code

All functions should have associated tests

Run during R CMD check --as-cran

Can specify that certain tests be skipped on CRAN

caret is slowly adding more testthat tests

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 10 / 19

Page 11: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

github + travis + coveralls

Travis and coveralls are tools for automated unit testing

Travis reports test failutes and CRAN errors/warnings

Coveralls reports % of code covered by unit tests

Both automatically comment on the PR itself

Contributor submits code via a pull request

Travis notifies them of test failures

Coveralls notifies them to write tests for new functions

Automated feedback on code quality allows rapid iteration

Code review once unit tests and R CMD CHECK pass

github supports line-by-line comments

Usually several more iterations here

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 11 / 19

Page 12: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Regression Testing

Prior to CRAN release (or whenever required), a comprehensive set ofregression tests can be conducted.

All modeling packages are updated to their current CRAN versions.

For each model accessed by train, rfe, and/or sbf, a set of testcases are computed with the production version of caret and thedevel version.

First, test cases are evaluated to make sure that nothing has beenbroken by updated versions of the consistuent packages.

Di↵s of the model results are computed to assess any di↵erences incaret versions.

This process takes approximately 3hrs to complete using make -j 12

on a Mac Pro.

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 12 / 19

Page 13: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Regression Testing

$ R CMD BATCH move_files.R

$ cd ~/tmp/2015_04_19_09__6.0-41/

$ make -j 12 -i

2015-04-19 09:13:44: Starting ada

2015-04-19 09:13:44: Starting AdaBag

2015-04-19 09:13:44: Starting AdaBoost.M1

2015-04-19 09:13:44: Starting ANFIS

:

make: [FH.GBML.RData] Error 1 (ignored)

:

2015-04-19 12:03:52: Finished WM

2015-04-19 12:04:48: Finished xyf

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 13 / 19

Page 14: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Documentation

caret originally contained four package vignettes with in–depthdescriptions of functionality with examples.

However, this added time to R CMD check and was a general pain forCRAN.

E↵orts to make the vignettes more computationally e�cient (e.g.reducing the number of examples, resamples, etc.) diminished thee↵ectiveness of the documentation.

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 14 / 19

Page 15: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Documentation

The documentation was moved out of the package and to the githubIO page.

These pages are built using knitr whenever a new version is sent toCRAN. Some advantages are:

longer and more relevant examples are available

update schedule is under my control

dynamic documentation (e.g. D3 network graphs, JS tables)

better formatting

It currently takes about 4hr to create these (using parallel processingwhen possible).

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 15 / 19

Page 16: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Backup Slides

Page 17: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

roxygen2

Simplified package documentation

Automates many parts of the documentation process

Special comment block above each function

Name, description, arguments, etc.

Code and documentation are in the same source file

A must have for new packages but hard to convert existing packages

caret has 92 .Rd files

I’m not in a hurry to re-write them all in roxygen2 format

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 17 / 19

Page 18: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Required “Optimizations”

For example, there is one check that produces a large number of falsepositive warnings. For example:

> bwplot.diff.resamples <- function (x, data, metric = x$metric, ...) {+ ## some code

+ plotData <- subset(plotData, Metric %in% metric)

+ ## more code

+ }

will trigger a warning that “bwplot.diff.resamples: no visible

binding for global variable ’Metric’”.

The “solution” is to have a file that is sourced first in the package(e.g. aaa.R) with the line

> Metric <- NULL

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 18 / 19

Page 19: Nobody Knows What It’s Like To Be the Bad Man: The Development Process for the Caret Package

Judging the Severity of Problems

It’s hard to tell which warnings should be ignored and which shouldnot. There is also the issue of inconsistencies related to who is “onduty” when you submit your package.

It is hard to believe that someone took the time for this:

Description Field: "My awesome R package"

R CMD check: "Malformed Description field: should contain one

or more complete sentences."

Description Field: "This package is awesome"

R CMD check: "The Description field should not start with the

package name, ’This package’ or similar."

Description Field: "Blah blah blah."

R CMD check: PASS

Kuhn & Deane–Mayer (Pfizer / Cognius) caret 19 / 19