code and data management

24
CODE AND DATA MANAGEMENT Toni Rosati Lynn Yarmey

Upload: lethuan

Post on 01-Jan-2017

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CODE AND DATA MANAGEMENT

CODE AND DATA MANAGEMENT Toni Rosati Lynn Yarmey

Page 2: CODE AND DATA MANAGEMENT

… Reproducibility is the foundation of science

… Journals are starting to require data deposit

… You want to get credit for producing data (data citations)

… Others can use and build on your work (data reuse)

… Recreating a figure from a 2006 paper shouldn’t be painful

… Funders tell us so (See NSF, NIH, NOAA, etc)

Data Management is Important! Because…

Page 3: CODE AND DATA MANAGEMENT

Outline • Back up often • Sharing code •  File naming • Metadata • Sharing data • A data search tool

Page 4: CODE AND DATA MANAGEMENT
Page 5: CODE AND DATA MANAGEMENT

But why would you only backup when you can do so much more?...

Tips: - 1 working copy on your computer - 1 copy on infrastructure near you - 1 copy on infrastructure far away

Back up

SHARE!!

Page 6: CODE AND DATA MANAGEMENT

• Good backup • Collaboration • People don’t have to contact you to get and understand the code

• Faster and easier than other options (emailing individuals or sharing on servers)

• ……

Why Share Code?

Page 7: CODE AND DATA MANAGEMENT

Why Share Code? • Version control • Commenting gives public and brief history • Work on multiple computers with the same code– flexibility in where you work (no USB drive necessary)

• Keep code with metadata/user instructions • No bureaucracy • FREE!

Page 8: CODE AND DATA MANAGEMENT

What is Git?

• Git is a distributed revision control and source code management (SCM) system capable of dealing with non-linear workflows

•  “As with most other distributed revision control systems,

and unlike most client-server systems, every Git working directory is a full-fledged repository with complete history and full version tracking capabilities, independent of network access or a central server.” (Wikipedia)

Page 9: CODE AND DATA MANAGEMENT

GitHub

Page 10: CODE AND DATA MANAGEMENT

Sharing Code – GitHub.com

Page 11: CODE AND DATA MANAGEMENT

Sharing Code – GitHub.com

GitHub serves as the location of record for VIC at: https://github.com/UW-Hydro/VIC

Page 12: CODE AND DATA MANAGEMENT
Page 13: CODE AND DATA MANAGEMENT

File Naming • Make names unique and meaningful! •  Include (as appropriate):

- Project name or acronym - Study title - Location - Data type - Researcher initials - Date - Data stage - Version number - File type

Think “long-term”

Page 14: CODE AND DATA MANAGEMENT

Metadata What would someone unfamiliar with your data need in order to evaluate, understand, and reuse them? How about someone:

- who works in your lab? - from a different lab in your field? - who is in a related interdisciplinary field? - who researches a completely different area? - who works for a newspaper? Congress?

Page 15: CODE AND DATA MANAGEMENT

Metadata is the difference between:

Page 16: CODE AND DATA MANAGEMENT

Metadata is Data about Data • Units? • Resolution? • What do the Column names mean? • Caveats? Known data issues or missing values? • How data were collected? • Where forcing data came from? • How many layers were used in this model?

“Information that describes the content, quality, condition, origin, and other characteristics of data or other pieces of information. Metadata for spatial data may describe and document its subject matter; how, when, where, and by whom the data was collected; availability and distribution information; its projection, scale, resolution, and accuracy; and its reliability with regard to some standard. Metadata consists of properties and documentation. Properties are derived from the data source (for example, the coordinate system and projection of the data), while documentation is entered by a person (for example, keywords used to describe the data).” Esri

Page 17: CODE AND DATA MANAGEMENT

Metadata • What happens without good

metadata?

•  You have no idea what the data mean

•  You think you understand the data, so you use it… •  …but you use it totally wrong

•  You waste hours (or days) trying to find out more about the data

Page 18: CODE AND DATA MANAGEMENT

Sharing Data

These days, Dr. Hodes said, “the old model in which researchers jealously guarded their data is no longer applicable.” http://www.nytimes.com/2011/04/04/health/04alzheimer.html

Page 19: CODE AND DATA MANAGEMENT

Sharing/Finding Data

www.nsidc.org/acadis/search

Page 20: CODE AND DATA MANAGEMENT

Organize now…. or….

Thank you!

Page 21: CODE AND DATA MANAGEMENT

Data Reuse

•  Our team enables Arctic sciences by ensuring datasets are well documented and can be understood by re-users.

•  The trick with data re-use is to

find the dataset… •  then become familiar enough

with a dataset… •  to be able to combine it with

other data … •  and extract accurate results.

Page 22: CODE AND DATA MANAGEMENT

Data Curation

• Metadata • Usability • Documentation •  Training • Re-use •  Tools • A little marketing • Partnering

•  Consensus building •  Data management plans

for grant proposals •  Integrating social and

physical sciences •  Data quality checks •  Data analysis

Page 23: CODE AND DATA MANAGEMENT

DOIs and Citations • Digital Object Identifiers (DOI) officially name a resource. • A DOI is essentially a stable, permanent URL.

•  Information about a digital object may change over time, including where to find it, but its DOI name will not change.

•  “The DOI System provides a framework for persistent identification, managing intellectual content, managing metadata, linking customers with content suppliers, facilitating electronic commerce, and enabling automated management of media.” (DataCite.org)

Page 24: CODE AND DATA MANAGEMENT

Beyond ACADIS – Other Resources General Info and help -

Earth Science Information Partners (ESIP): http://wiki.esipfed.org/ UVA Libraries: http://www2.lib.virginia.edu/brown/data/

Data Management Plan and other tools – DMP Tool: https://dmp.cdlib.org/ DataOne: https://www.dataone.org/cattools/Data%20and%20Metadata

%20Management Metadata -

Excel Plug-in tool (in development): http://www.cdlib.org/cdlinfo/2011/09/01/facilitating-data-management-dcxl/ Lists of Standards (not complete!) for bio, climate, ecology, oceanography - http://

marinemetadata.org/conventions Stanford-based portal for medical/bio -

http://bioportal.bioontology.org/resources