scientific software development

41
Jeff Allen Quantitative Biomedical Research Center UT Southwestern Medical Center BSCI5096 - 3.26.2013 Avoiding Big Mistakes in Scientific Computing Or: How to Write Code That Doesn’t Jeopardize Your Professional Reputation or Patient’s Lives

Upload: jalle6

Post on 01-Nov-2014

603 views

Category:

Technology


1 download

DESCRIPTION

Introduction to proper software development practices in scientific computing -- revision control, unit testing in R, code reviews, reproducibility, and replicability.

TRANSCRIPT

Page 1: Scientific Software Development

Jeff AllenQuantitative Biomedical Research Center

UT Southwestern Medical Center BSCI5096 - 3.26.2013

Avoiding Big Mistakes in Scientific ComputingOr: How to Write Code That Doesn’t Jeopardize

Your Professional Reputation or Patient’s Lives

Page 2: Scientific Software Development

Motivation

• Anil Potti scandal at Duke– Genomic signature identified that would

identify the best chemo based on a patient’s genes.

– Over 100 patients enrolled in clinical trials.– Later discovered gross mishandling of data

and invalidating bugs in software– Alleged manipulation of data– Watch: Lecture from Keith Baggerly

Page 3: Scientific Software Development

Outline

• Revision Control• Reproducibility and Replicability• Ensuring Code Quality• Resources

Page 4: Scientific Software Development

Outline

• Revision Control– Introduction & Concepts– Git & GitHub

• Reproducibility and Replicability• Ensuring Code Quality• Resources

Page 5: Scientific Software Development

Revision Control

• Tracks changes to files over time• Keeps a complete log of all changes ever

made to any file in a project• Supports more collaboration on projects

– Provides an authoritative repository for the code

– Gracefully catch and handle conflicts in files• Various forms in use today including

Mercurial, Git, Subversion

Page 6: Scientific Software Development

Git

• Modern distributed revision control system– “Distributed” means you have the entire

history of the project on your local machine.– Don’t have to be online to develop.

• Makes improvements in performance and usability on past systems.

• Open-Source and free

Page 7: Scientific Software Development

GitHub

• A website that hosts Git repositories.• You can “push” your own Git repositories

to their site to gain:– A web interface – easier way to view your

files and track changes– Control who has access to which projects– Project organization – hosts documentation,

bug-tracking, etc.– Social platform – the “Facebook” of coding– Client-Side graphical user interface

Page 8: Scientific Software Development

GITHUB DEMONSTRATION

Page 9: Scientific Software Development

GitHub Client - GUI

• Only works with GitHub.• Much easier to use and navigate.• Mac and Windows versions.• On campus: Need to open Git Shell and

run:git config --global http.proxy http://proxy.swmed.edu:3128

Page 10: Scientific Software Development

GitHub Client

Page 11: Scientific Software Development

GITHUB CLIENT DEMO

Page 12: Scientific Software Development

Use Cases

• “This function used to work.”– Look at the changes made to that file since

it last worked.• “Please send me the code used in this

publication.”– Revert the project back to any point in its

history• “I found a bug and fixed it.”

– (Optionally) Allow others to contribute to your projects.

Page 13: Scientific Software Development

Outline

• Revision Control• Reproducibility and Replicability

– Replicability– Reproducibility

• Ensuring Code Quality• Resources

Page 14: Scientific Software Development

C. TITUS BROWN http://ivory.idyll.org/blog/replication-i.html

“‘Replicable’ means ‘other people get exactly the same results when doing exactly the same thing’, while ‘reproducible’ means ‘something similar happens in other people's hands.’ The latter is far stronger, in general, because it indicates that your results are not merely some quirk of your setup and may actually be right.”

Page 15: Scientific Software Development

Replicability

• In order for analysis to be replicable, another researcher must have access to:– The exact same code you used– The exact same data you used

• Any changes (including bug-fixes and other corrections) in your code or data from what you provide will make your results irreplicable. – Must track in a revision control system

Page 16: Scientific Software Development

Reproducibility

• Requires much more time and effort• Independently arrive at the same

conclusions– Potentially using the same data– Using different techniques and parameters

• May take as much time to reproduce results as it did to produce them the first time

• Should be done in high-stakes (i.e. clinical) applications

Page 17: Scientific Software Development

Recommended Practices

a. Use a revision control system such as GitHub

b. To ensure replicability, clone your repository on another computer and re-run all your analysis. Ensure you get the same results.• This is a good test of replicability.• Knowing you’ll have to do this will make

you write better organized code.

c. If it’s really important, ask a colleague to reproduce.

Page 18: Scientific Software Development

Outline

• Revision Control• Reproducibility and Replicability• Ensuring Code Quality

– Automated Testing– Code reviews

• Resources

Page 19: Scientific Software Development

Automated Testing

• Unit testing– Very specific target– May have multiple

tests per function• Many unit testing

frameworks– In R: testthat, and

Runit

install.packages(“testthat”)

library(testthat)

Page 20: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- 0 for (i in 1:x){ sq <- sq + x } return(sq)}

Page 21: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- 0 for (i in 1:x){ sq <- sq + x } return(sq)}

Tests

expect_that( square(3), equals(9)) #Passes

Page 22: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- 0 for (i in 1:x){ sq <- sq + x } return(sq)}

Tests

expect_that(square(3), equals(9)) #Passesexpect_that(square(5), equals(25)) #Passes

Page 23: Scientific Software Development

Test-Driven Development (TDD)

• If you see a bug:1. Write a test that fails2. Fix the bug3. Show that the test now passes4. Commit to revision control

Page 24: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- 0 for (i in 1:x){ sq <- sq + x } return(sq)}

Tests

expect_that(square(3), equals(9)) #Passesexpect_that(square(5), equals(25)) #Passes

Page 25: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- 0 for (i in 1:x){ sq <- sq + x } return(sq)}

Tests

expect_that(square(3), equals(9)) #Passesexpect_that(square(5), equals(25)) #Passesexpect_that(square(2.5), equals(6.25)) #Fails

Page 26: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- 0 for (i in 1:x){ sq <- sq + x } return(sq)}

Tests

expect_that(square(3), equals(9)) #Passesexpect_that(square(5), equals(25)) #Passesexpect_that(square(2.5), equals(6.25)) #Failsexpect_that(square(-2), equals(4)) #Fails

Page 27: Scientific Software Development

Test-Driven Development (TDD)

• If you see a bug:1. Write a test that fails2. Fix the bug3. Show that the test now passes4. Commit to revision control

Page 28: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- x * x return(sq)}

Page 29: Scientific Software Development

Test-Driven Development (TDD)

• If you see a bug:1. Write a test that fails2. Fix the bug3. Show that the test now passes4. Commit to revision control

Page 30: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- x * x return(sq)}

Page 31: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- x * x return(sq)}

Tests

expect_that(square(3), equals(9)) #Passesexpect_that(square(5), equals(25)) #Passesexpect_that(square(2.5), equals(6.25)) #Passesexpect_that(square(-2), equals(4)) #Passes

Page 32: Scientific Software Development

Test-Driven Development (TDD)

• If you see a bug:1. Write a test that fails2. Fix the bug3. Show that the test now passes4. Commit to revision control

Page 33: Scientific Software Development

Test-Driven Development (TDD)

• Advantages– Ensure that problematic areas are well-

tested– Regression testing – ensure old bugs don’t

ever come back– Confidently approach old code– More assured in handling someone else’s

code– Saves you time over manual testing

Page 34: Scientific Software Development

Code Reviews

• Get more than one set of eyes on your code

• Lightweight– Email to get quick feedback– GitHub is great for this

• Formal– Have a meeting to audit– Less than 500 LOC per meeting

Page 35: Scientific Software Development

Extreme – Pair Programming• Two programmers share a single workstation

• Both participate, though only one can type

• Significant learning opportunities for both

• Can strategically pair:–Senior with Junior, mentoring–Statistician with Developer, mutual

learning• Improvements in code quality

compensate for short-term efficiency loss– fewer bugs, easier code to maintain

Page 36: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ sq <- x * x return(sq)}

Tests

expect_that(square(3), equals(9)) #Passesexpect_that(square(5), equals(25)) #Passesexpect_that(square(2.5), equals(6.25)) #Passesexpect_that(square(-2), equals(4)) #Passes

Page 37: Scientific Software Development

Testing Example - Square

Code

square <- function(x){ x^2}

Tests

expect_that(square(3), equals(9)) #Passesexpect_that(square(5), equals(25)) #Passesexpect_that(square(2.5), equals(6.25)) #Passesexpect_that(square(-2), equals(4)) #Passes

Page 38: Scientific Software Development

Outline

• Revision Control• Reproducibility and Replicability• Ensuring Code Quality• Resources

Page 39: Scientific Software Development

Resources

• Software Carpentry– www.software-carpentry.org – Volunteer organization focused on teaching

these topics to scientific audiences– Contact us (

[email protected]) if you’d be interested in attending a local Boot Camp

• GitHub Documentation– https://help.github.com/ – Great documentation on how to use Git

and/or GitHub

Page 41: Scientific Software Development

Suggested Next Steps

• Watch Lecture from Keith Baggerly• Register for a GitHub account (free),

explore• Write an R function and cover it with unit

tests using the test_that framework• Then check into a public GitHub repo