data management (from day 0)
DESCRIPTION
Practical experiences around data handling form a chemist, now working in biology. The story starts on the day that I started managing my PhD thesis research in a version control system, originally Subversion, later Git. It them moves on to all the issues around data in publishing, data licensing (be sure you understand what you're using), online repositories, a bit of data citation, repository-integrated data analysis, finding a conclusion, and returning to day 1.TRANSCRIPT
Department of Bioinformatics - BiGCaT 1
Data Management (from day 0)
Egon Willighagen (@egonwillighagen)3 April 2014, Masterclass RDM in NL
Department of Bioinformatics - BiGCaT 2
Day 0: data plan
Before you start doing an experiment, you get a lab notebook.
(Some universities already require electronic lab notebooks!)
Department of Bioinformatics - BiGCaT 3
Day 1: the electronic lab notebook
• Version Control System–Allows backups–Allows
annotation–Dated changes
Department of Bioinformatics - BiGCaT 4
Day 2: be careful what data you use
• Availability in 4 years?–Your Library/University has a copy?
• Can you read the format?• Can you copy the data and share (e.g.
with collaborators)?• What if the journal you publish in
requires you to share data?
Department of Bioinformatics - BiGCaT 5
Day 3: store everything
• Experiments–Description–Results (images,
measurements, …)
• Written output–Reports, papers,
presentations
Department of Bioinformatics - BiGCaT 6
Day 4: Analyse data directly from a repository
Willighagen E. (2014) Accessing biological data in R with semantic webtechnologies. PeerJ PrePrints 2:e185v3. 10.7287/peerj.preprints.185v3
mart = biomaRt::useMart(biomart="snp", dataset="hsapiens_snp")
brca1 = c("rs16940","rs16941", "rs16942", "rs799916", "rs799917")
data = biomaRt::getBM(attributes=attribs, filters=c("snp_filter"),
values=brca1, mart=mart)
results = sparql.remote(
"http://rdf.farmbio.uu.se/chembl/sparql", paste(
"SELECT DISTINCT ?predicate ?object WHERE {",
" ?assay <http://www.w3.org/2000/01/rdf-schema#label> \"CHEMBL615603\" ;",
" ?predicate ?object . }"
))
Department of Bioinformatics - BiGCaT 7
Day 4: Analyses inside your report
http://yihui.name/knitr/
<p>We can also produce plots (centered by the option <code>fig.align='center'</code>):</p>
<!--begin.rcode html-cars-scatter, message=FALSE, fig.align='center' library(ggplot2) plot(mpg~hp, mtcars) qplot(hp, mpg, data=mtcars)+geom_smooth() end.rcode-->
Department of Bioinformatics - BiGCaT 8
Day 5: Large Repositories
• Uniprot, ChEMBL, Gene Ontology– Is there a deposition
workflow?
• Growing repositories–WikiPathways
• Set up a new database (paper+1)–e.g. DrugMet– Problem: what about
small data?
• Journal driven–CSD–PDB
Department of Bioinformatics - BiGCaT 9
Day 5: Database Seeds
• Set up a new database (paper += 1)–e.g. DrugMet
S. Lampa + meCC-SA, but data CC0
Department of Bioinformatics - BiGCaT 10
Day 5: National Repositories
Department of Bioinformatics - BiGCaT 11
Day 5: Small Data @ FigShare
Department of Bioinformatics - BiGCaT 12
Day 5: Scientific dissemination
• Data sharing: copyright– Can data be copyrighted?– Data Source: you, lab mates, others?– Ownership
• Data sharing: license– Do you want your data reused?– And be modified (format!)?– Commercial use?
Department of Bioinformatics - BiGCaT 13
Day 6: Format? Why not SemWeb?
• 5 Star Open Data (5stardata.info)open available, reusable, open format,URIs (ontologies etc),linkeddata
Department of Bioinformatics - BiGCaT 14
Linked Open Data Cloud
Department of Bioinformatics - BiGCaT 15
Day 7: are people using your work?
Department of Bioinformatics - BiGCaT 16
Day 8: back to step 0
• Take feedback (“peer review”), study new uses
• Plan your next study
CC-BYfrankensteinnn@flickr