laurie goodman at ndic: big data publishing, handling & reuse
TRANSCRIPT
Big Data Publishing, Handling, & Reuse
Laurie Goodman, PhDEditor-in-Chief, GigaScience
[email protected] ID: 0000-0001-9724-5976
Beyond Data Release Mandates
What is the point of publishing?• To disseminate
information/knowledge/ideas.
• To present material so it can be reasonably assessed for its level of quality (and interest).
• To gain credit for career advancement.
What goes into a research article?
+ Area of Interest/Question
What goes into a research article?+ Area of Interest/
Question
Data & Metadata Collection
Analysis/Hypothesis/Analysis
Conclusions
What goes into a research article?
Analysis/Hypothesis/Analysis
Conclusions
+ Area of Interest/Question
Data & Metadata Collection
Scientific Communication Via Publication
• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
• Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives
• Lack of transparency, lack of credit for anything other than “regular” dead tree publication
Kahn, Goodman, & Mittleman. Dragging Scientific Publishing into the 21st Century 2014 http://genomebiology.com/2014/15/12/556
From Journal Delivery to PDF Delivery
Lack of Data and Software Availability Impacts Reproducibility
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
Retractions are on the Rise>15X increase in last decade
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?
Deconstructing a paper into accessible, useable, trackable, interlinked units
Need to provide credit to reward sharing and proper organization of:• Narrative• Data/Metadata
availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses
Data/MetaData
Software
Methods
Narrative
Deconstructing a paper into accessible, useable, trackable, interlinked units
Currently we provide credit for this:
• Narrative• Data/Metadata
availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses
Data/MetaData
Software
Methods
Narrative
Sometimes we publish these as Methods Papers
Beyond the NarrativeData And Tools
Promoting Data Release
Data Citation
But- publishing ‘Data’ is “Salami Slicing”!!
What is Salami Slicing? • Publishing research in several different papers
that should form a single cohesive paper
Why is it ‘unethical’• It fragments the scientific literature, wasting
researcher’s time as they try to get all the information related to a very specific topic/dataset/method
• It can give the appearance (given there are multiple publications) that there is large support for a particular hypothesis
• It pads a researcher’s publication record unfairly
Publishing ‘Data’ is “Salami Slicing”!Baloney
1. Those guidelines were developed prior to the year 2000: • More than 15 years ago: at a time when data set sizes and data
types collected in the life sciences by a single research group were relatively small and primarily suitable for a single or narrow range of disciplines or hypotheses.
• Most journals were not online (which allows easier identification and access to closely related articles ) until the late ‘90s.
2. In 2005, COPE* ruled that a paper that had data that had been used and described, at least in part, in a previous publication was not unethical *Council of Publication Ethics. http://www.publicationethics.org/case/salami-publication
3. Data collection can be (should be!!) a scholarly pursuit: • Data that is broadly reusable requires care, thought, training,
time, and money to be properly collected, curated, stored, and shared.
Contrary to popular belief…
There are very few —if any—
‘push-a-button-and-get-it’ reuseable data resources
Your not supposed to just collect samples!*Collect ALL available metadata*
Help Develop a Digital Data Curation Team at your Institution’s Library (they may already have one…)
Back to Darwin
Data & Metadata Collection/Experiments
Analysis/Hypothesis/Analysis
Conclusions
+ Area of Interest/Question
1839
1859
20 Yrs.
Say… was this a Data Publication?
Data & Metadata Collection/Experiments
Analysis/Hypothesis/Analysis
Conclusions
+ Area of Interest/Question
1839
1859
The most curious fact is the perfect gradation in the size of the beaks in the different species of Geospiza, from one as large as that of a hawfinch to that of a chaffinch, and (if
Mr. Gould is right in including his sub-group, Certhidea, in the main group) even to that of a warbler. The largest beak in the genus Geospiza is shown in Fig. 1, and the smallest in Fig. 3; but instead of there being only one intermediate species, with a beak of the size shown in Fig. 2, there are no less than six species with insensibly graduated beaks. (Chapter 17)
DataCite and DOIs
• Aims to “increase acceptance of research data as legitimate, citable contributions to the scholarly record”.
• “data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.
Citing Data Isn’t NewThe Physical Sciences have been doing this for a while…
What we’re doing:Mandating and Aiding for Data Release
Requiring all data supporting work to be Freely available in a publically available repository
– How we’re helping to do this:• Journal-dedicated data and software repository GigaDB
that hosts ALL data types.• Have Biocurators to aid in handling Metadata• All Datasets are provided a Digital Object Identifier
(DOI) making them citable and countable• All Material in GigaDB is available under a CC0 Waiver• Data with a publically approved database must be
submitted there as well • Provide Direct links to all associated information
Requiring all software and work to be Freely available in a publically available repository
– How we’re promoting this:• All software created by authors must be 100% OSI
compliant• Journal-Dedicated repository GigaDB hosts software so
it can be downloaded.• Software and Workflows are provided a DOI making
them citable and countable (reward)• Journal-dedicated Galaxy Platform to run tools• Have a Data Manager and Data Scientist to wrap and
deploy software tools• Have our own Github Repository
What we’re doingMandating and Aiding Software Release
Data Sets inGigaDB
Analyses/Workflows inGigaGalaxy
Paper inGigaScience
(Narrative + Methods)
Linked to
Linked to
Open-access journal Data Publishing Platform
Data Computation Analysis Platform
How we view publishing at GigaScience
Making the Data Itself CitableWe provide a linked journal database- this is done to link the data directly to our papers to ease reproducibility, make it available at the time of review, and provide authors a place to submit data with no sustainable ‘home’.
Note: there are many community available databases- so in principle- any journal can do this by taking advantage of such available resources.
These include the usual suspects: EBI, NCBI, DDBJ etc.Databases that take all data types and provide Data DOIs: Dryad, FigShare, etc.There are also numerous smaller community databases specific to different fields or data types.
Some of the Journals Currently Doing Data Publication
http://proj.badc.rl.ac.uk/preparde/blog/DataJournalsList
Citing Data in the References Allows Tracking
This rewards authors for making data available AND makes it easier to find
But is this being done?
Yes:
Yes:
Is Cited Data Being Tracked?
Yes:
Improving Quality as Well as Availability
How Hard is Data and Software Review?
Not really that much harder than narrative review.
Valid
ation
chec
ks
Fail – submitter is provided error report
Pass – dataset is uploaded to GigaDB.
Curator makes dataset public (can be set as future date if required)
DataCite XML file
Submission
Submitter logs in to GigaDB website and uploads Excel submission or uses online wizard
DOI assigned
FilesSubmitter provides files by ftp or Aspera
XML is generated and registered with DataCite
Curator Review
Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).
DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
Public GigaDB dataset
Data must be available for review with the manuscript (and at the very least get a sanity check…)
Reviewing Data in More DetailIssue: We can’t ask our reviewers to do that! Our finding: Reviewers don’t mindReviewer Dr. Christophe Pouzat on neuroscience manuscript: “In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewers job more fun!” Can also use specific Data Reviewers (we have)
Reviewing Data AND Software
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>5000 downloads
http://homolog.us/wiki/index.php?title=SOAPdenovo2
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18>35,000 accesses
Open-Code
8 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-PipelinesOpen-Workflows
DOI:10.5524/100038Open-Data
78GB CC0 data
Enabled code to being picked apart by bloggers in wiki
8 Reviewers! Holy Cow- that must have taken forever!!
SubmissionJuly 24
Final reviewAug 28
These were reviewingteams from different labs, assessing the materials at multiple levels
Is this really worth the effort?
Beyond Reproducibility:
REUSEData Availability and Tools
http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/
These data were released THREE YEARS before publication of the analysis article
The polar bear DATA were released –prepublication- in 2011They were used and cited in the following studies- before the main paper on the sequencing was published
Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424.
Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345. doi:10.1371/journal.pgen.1003345.
Morgan, CC et al., Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117.
Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133.
Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109
http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/
Even though the data had been released over 2 years earlier and cited in other papers- the main analysis paper was published in Cell
Cell Press Journals had indicated publishing a dataset prior to publication could be considered as prior publication
• New Sequencing technology• minION Oxford-Nanopore
• New Sequence Data Type• EBI and NCBI Databases not ready
• High community interest for testing data
• >100 GB of data
Real time use during the publication process
• Uploaded prior to publication• Deployed on Amazon Cloud Front• Ongoing
testing/comparison/information sharing prior to publication
• When ready for data EBI used our cloud to upload data
• EBI transferred the data to NCBI when they were ready
Getting past…
…look but don't touch
Reproduce and Reuse Needs Much More
• Data: GigaDB• Software: Github• Workflows
– Galaxy – Executable Docs– VMs
• Images: OMERO• Cloud storage, tools, and
compute power…• Need this to reach the
smaller labs
github.com/gigascience/gigadb-cogini
More Journals have or are starting to introduce these and other tools: More is needed…
Currently… it feels like this…
Well… …because it is like this
If we want to move forward, we need to go through that to reach this:It will require researchers, institutions, publishers, and funders working together.
Thanks to:Scott Edmunds, Executive EditorNicole Nogoy, Commissioning EditorPeter Li, Lead Data ManagerChris Hunter, Lead BioCuratorRob Davidson, Data ScientistXiao (Jesse) Si Zhe, Database DeveloperAmye Kenall, Journal Development Manager
[email protected]@gigasciencejournal.com
@GigaScience
facebook.com/GigaScienceblogs.openaccesscentral.com/blogs/gigablog
Contact us:
Follow us:
www.gigasciencejournal.comwww.gigadb.org