research code andrew rosenberg with ra manual: notes on writing code by matthew gentzkow and jesse...

40
Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and http://betterexplained.com/articles/a-visual-guide-to- version-control/

Upload: neil-mathews

Post on 28-Dec-2015

239 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Research Code

Andrew Rosenberg

with

RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth

and

http://betterexplained.com/articles/a-visual-guide-to-version-control/

Page 2: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Research Code has a Bad Reputation

• Research coding is not done with the purpose of being robust, or reusable, or long-lived in development and versioning repositories.

• It is usually the code’s writer who is the consumer, or in some cases a few others in the lab.

• http://bytesizebio.net/index.php/2012/08/24/can-we-make-research-software-accountable/

Page 3: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Mistakes (Research) Programmers Make

• I just need to do this specific thing one time.

Page 4: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Mistakes (Research) Programmers Make

• I’ll remember what I did, if I need to do it again.

Page 5: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Mistakes (Research) Programmers Make

• No one is interested in this code.

Page 6: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Mistakes (Research) Programmers Make

• No one will ever see this code.

Page 7: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

What research code looks like

• This is not application development.

• Often research code involves:– A series of small scripts,– linking together existing open source

toolkits,– reformatting input and output,– generating plots and graphs.

• Where is the “software”?

Page 8: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

What research code looks like

• The contribution of the paper may be– Extension of an existing codebase– a set of small scripts and reformatting

one-liners.– implemented in multiple languages.

Page 9: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

A new way of doing business

• These are bad excuses.

• There is movement to encourage and incentivize the distribution of source code with publications.

• And facilities to encourage it.

Page 10: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Source Code dissemination

• Host it yourself.

• www.runmycode.org

• http://www.ipol.im/

• (many, many more)

Page 11: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

What is good enough?

• Right now:– ANYTHING.

• Ideally:– “production level” Code that can be run

or compiled on a standard configuration.– Thorough documentation.

Page 12: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Intellectual Property and Licensing

• GPL – copyleft

• Apache• many many more

• You have copyright over your code.• A license allows someone else to use it.

• Disclosures can limit your ability to patent.

Page 13: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Version Control

• Version control allows multiple users to edit the same content.

• Allows for coding in the open.• subversion, git, many more.

Page 14: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Version Control

Page 15: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Version Control

Page 16: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Version Control

Page 17: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Version Control

Page 18: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Version Control

Page 19: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Coding for the User

• Code for your future self.

• You are your most important user.

Page 20: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Don’t try to be clever

• Write simple, understandable code.

• Efficiency in number of lines is not important.

• Efficiency in number of operations or memory also might not be important.

Page 21: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

There are many ways to skin a cat

print “Just another Perl hacker,”;

$_='987;s/^(d+)/$1-1/e;$1?eval:print"Just another Perl hacker,"';eval;

$_ = "wftedskaebjgdpjgidbsmnjgc";tr/a-z/oh, turtleneck Phrase Jar!/; print;

Page 22: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Establish a coding style.

• ClassName• nameMethodsUsingVerbs• underscored_lowercase_variable_names• CONSTANTS

• Spacing– x_mean=x_total/n– x_mean = x_total / n

• More than anything, be consistent

Page 23: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Testing

• Unit tests.– Small pieces of code that test “atomic”

functionality of a program.

void testAddWorksCorrectly() {assertEquals(4, add(2,2));

}

void testConstructorInitializesNameFieldToDefault() {Person p = new Person()assertEquals(“John Smith”, p.getName());

}

Page 24: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Why write tests?

• Identify problems.

• Easier Changes.

• Simple integration.

• Documentation.

Page 25: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Test Driven Development

• Write a Test• Run tests to see if it fails• Write as little code as possible• Make the tests pass (go green)• Refactor code• Repeat

[wikipedia]

Page 26: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Bug fixes and Testing

• When you find a bug in your code.

• Write a test that “catches the bug”.– It fails.

• The bug is fixed when the test passes.

• And it’ll never happen again.

Page 27: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Refactoring

• Just because code works, it doesn’t mean it’s done.

• Consolidate code to increase modularity– Eliminate code duplication.

• Some examples– Extract Classes– Extract Method– Move/Rename Method

Page 28: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Code Review

• Give your code to another person for feedback.

• Companies do this to ensure consistent style and correctness.

• Research labs rarely do.

Page 29: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Some specific advice.

• Take an enormous amount of notes.

– What did you do?– What did you learn?– What bugs did you fix?– What new issues did you find?– What questions did you come up with?

Page 30: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• Copy and Paste is your enemy.– If you are copying and pasting in code,

you have probably made a mistake.

Page 31: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• Use CONSTANTS– Never encode constants inline in your

code.

mean_height = total_height / 15

num_people = 13mean_height = total_height / num_people

Page 32: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• Use CONSTANTS– Never encode constants inline in your

code.

data[17] = ‘Andrew’data[18] = 1.78

name_idx = 17score_idx = 18data[name_idx] = ‘Andrew’data[score_idx] = 1.78

Page 33: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• Don’t use global variables

Page 34: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• Use sensible function names

start()step1()step2()step3()wrapup()

Page 35: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• Use sensible function names

initializeParameters()setPaths()calculateRHS()calculateLHS()writeResults()

Page 36: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• Use sensible variable names

x1 = income / populationipc = income / populationincome_per_capita = income / population

Page 37: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• Serialize Frequently.

main() {preprocessData()extractFeatures()runBaselineExperiment()runNewExperiment()evaluateResults()

}

Page 38: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• Serialize Frequently.

preprocess files.data > clean_files.dataextractFeatures clean_files.data > features.csvrunBaseline features.csv > baseline.resultsrunNewExperiment features.csv > new.resultsevaluate baseline.results > baseline.reportevaluate new.results > new.report

Page 39: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Specifics

• When things get slow, use a profiler.– Identify slow functions, and fix them.– Some code needs to do a lot, so it can

be slow

Page 40: Research Code Andrew Rosenberg with RA Manual: Notes on Writing Code by Matthew Gentzkow and Jesse Shapiro Chicago Booth and

Recap

• Research Code should be released– This is becoming more common,

expected and, sometimes, required.

• Research Code needs to be good code.– So you can reuse it.– So you can release it.