laurie goodman at ndic: big data publishing, handling & reuse

48
Big Data Publishing, Handling, & Reuse Laurie Goodman, PhD Editor-in-Chief, GigaScience [email protected] ORCID ID: 0000-0001-9724-5976 Beyond Data Release Mandates

Upload: gigascience-bgi-hong-kong

Post on 03-Aug-2015

319 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Big Data Publishing, Handling, & Reuse

Laurie Goodman, PhDEditor-in-Chief, GigaScience

[email protected] ID: 0000-0001-9724-5976

Beyond Data Release Mandates

Page 2: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

What is the point of publishing?• To disseminate

information/knowledge/ideas.

• To present material so it can be reasonably assessed for its level of quality (and interest).

• To gain credit for career advancement.

Page 3: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

What goes into a research article?

+ Area of Interest/Question

Page 4: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

What goes into a research article?+ Area of Interest/

Question

Data & Metadata Collection

Analysis/Hypothesis/Analysis

Conclusions

Page 5: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

What goes into a research article?

Analysis/Hypothesis/Analysis

Conclusions

+ Area of Interest/Question

Data & Metadata Collection

Page 6: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Scientific Communication Via Publication

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives

• Lack of transparency, lack of credit for anything other than “regular” dead tree publication

Page 7: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Kahn, Goodman, & Mittleman. Dragging Scientific Publishing into the 21st Century 2014 http://genomebiology.com/2014/15/12/556

From Journal Delivery to PDF Delivery

Page 8: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Lack of Data and Software Availability Impacts Reproducibility

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 9: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Retractions are on the Rise>15X increase in last decade

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Page 10: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Deconstructing a paper into accessible, useable, trackable, interlinked units

Need to provide credit to reward sharing and proper organization of:• Narrative• Data/Metadata

availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses

Data/MetaData

Software

Methods

Narrative

Page 11: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Deconstructing a paper into accessible, useable, trackable, interlinked units

Currently we provide credit for this:

• Narrative• Data/Metadata

availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses

Data/MetaData

Software

Methods

Narrative

Sometimes we publish these as Methods Papers

Page 12: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Beyond the NarrativeData And Tools

Page 13: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Promoting Data Release

Data Citation

Page 14: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

But- publishing ‘Data’ is “Salami Slicing”!!

What is Salami Slicing? • Publishing research in several different papers

that should form a single cohesive paper

Why is it ‘unethical’• It fragments the scientific literature, wasting

researcher’s time as they try to get all the information related to a very specific topic/dataset/method

• It can give the appearance (given there are multiple publications) that there is large support for a particular hypothesis

• It pads a researcher’s publication record unfairly

Page 15: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Publishing ‘Data’ is “Salami Slicing”!Baloney

1. Those guidelines were developed prior to the year 2000: • More than 15 years ago: at a time when data set sizes and data

types collected in the life sciences by a single research group were relatively small and primarily suitable for a single or narrow range of disciplines or hypotheses.

• Most journals were not online (which allows easier identification and access to closely related articles ) until the late ‘90s.

2. In 2005, COPE* ruled that a paper that had data that had been used and described, at least in part, in a previous publication was not unethical *Council of Publication Ethics. http://www.publicationethics.org/case/salami-publication

3. Data collection can be (should be!!) a scholarly pursuit: • Data that is broadly reusable requires care, thought, training,

time, and money to be properly collected, curated, stored, and shared.

Page 16: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Contrary to popular belief…

There are very few —if any—

‘push-a-button-and-get-it’ reuseable data resources

Page 17: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Your not supposed to just collect samples!*Collect ALL available metadata*

Help Develop a Digital Data Curation Team at your Institution’s Library (they may already have one…)

Page 18: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Back to Darwin

Data & Metadata Collection/Experiments

Analysis/Hypothesis/Analysis

Conclusions

+ Area of Interest/Question

1839

1859

20 Yrs.

Page 19: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Say… was this a Data Publication?

Data & Metadata Collection/Experiments

Analysis/Hypothesis/Analysis

Conclusions

+ Area of Interest/Question

1839

1859

The most curious fact is the perfect gradation in the size of the beaks in the different species of Geospiza, from one as large as that of a hawfinch to that of a chaffinch, and (if

Mr. Gould is right in including his sub-group, Certhidea, in the main group) even to that of a warbler. The largest beak in the genus Geospiza is shown in Fig. 1, and the smallest in Fig. 3; but instead of there being only one intermediate species, with a beak of the size shown in Fig. 2, there are no less than six species with insensibly graduated beaks. (Chapter 17)

Page 20: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

DataCite and DOIs

• Aims to “increase acceptance of research data as legitimate, citable contributions to the scholarly record”.

• “data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.

Citing Data Isn’t NewThe Physical Sciences have been doing this for a while…

Page 21: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Page 22: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

What we’re doing:Mandating and Aiding for Data Release

Requiring all data supporting work to be Freely available in a publically available repository

– How we’re helping to do this:• Journal-dedicated data and software repository GigaDB

that hosts ALL data types.• Have Biocurators to aid in handling Metadata• All Datasets are provided a Digital Object Identifier

(DOI) making them citable and countable• All Material in GigaDB is available under a CC0 Waiver• Data with a publically approved database must be

submitted there as well • Provide Direct links to all associated information

Page 23: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Requiring all software and work to be Freely available in a publically available repository

– How we’re promoting this:• All software created by authors must be 100% OSI

compliant• Journal-Dedicated repository GigaDB hosts software so

it can be downloaded.• Software and Workflows are provided a DOI making

them citable and countable (reward)• Journal-dedicated Galaxy Platform to run tools• Have a Data Manager and Data Scientist to wrap and

deploy software tools• Have our own Github Repository

What we’re doingMandating and Aiding Software Release

Page 24: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Data Sets inGigaDB

Analyses/Workflows inGigaGalaxy

Paper inGigaScience

(Narrative + Methods)

Linked to

Linked to

Open-access journal Data Publishing Platform

Data Computation Analysis Platform

How we view publishing at GigaScience

Page 25: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Making the Data Itself CitableWe provide a linked journal database- this is done to link the data directly to our papers to ease reproducibility, make it available at the time of review, and provide authors a place to submit data with no sustainable ‘home’.

Note: there are many community available databases- so in principle- any journal can do this by taking advantage of such available resources.

These include the usual suspects: EBI, NCBI, DDBJ etc.Databases that take all data types and provide Data DOIs: Dryad, FigShare, etc.There are also numerous smaller community databases specific to different fields or data types.

Page 26: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Some of the Journals Currently Doing Data Publication

http://proj.badc.rl.ac.uk/preparde/blog/DataJournalsList

Page 27: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Citing Data in the References Allows Tracking

This rewards authors for making data available AND makes it easier to find

But is this being done?

Page 28: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Page 29: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Yes:

Page 30: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Page 31: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Yes:

Page 32: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Is Cited Data Being Tracked?

Yes:

Page 33: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Improving Quality as Well as Availability

How Hard is Data and Software Review?

Not really that much harder than narrative review.

Page 34: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Curator makes dataset public (can be set as future date if required)

DataCite XML file

Submission

Submitter logs in to GigaDB website and uploads Excel submission or uses online wizard

DOI assigned

FilesSubmitter provides files by ftp or Aspera

XML is generated and registered with DataCite

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Data must be available for review with the manuscript (and at the very least get a sanity check…)

Page 35: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Reviewing Data in More DetailIssue: We can’t ask our reviewers to do that! Our finding: Reviewers don’t mindReviewer Dr. Christophe Pouzat on neuroscience manuscript: “In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewers job more fun!” Can also use specific Data Reviewers (we have)

Page 36: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Reviewing Data AND Software

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>5000 downloads

http://homolog.us/wiki/index.php?title=SOAPdenovo2

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18>35,000 accesses

Open-Code

8 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Enabled code to being picked apart by bloggers in wiki

Page 37: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

8 Reviewers! Holy Cow- that must have taken forever!!

SubmissionJuly 24

Final reviewAug 28

These were reviewingteams from different labs, assessing the materials at multiple levels

Page 38: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Is this really worth the effort?

Beyond Reproducibility:

REUSEData Availability and Tools

Page 39: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/

These data were released THREE YEARS before publication of the analysis article

Page 40: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

The polar bear DATA were released –prepublication- in 2011They were used and cited in the following studies- before the main paper on the sequencing was published

Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424.

Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345. doi:10.1371/journal.pgen.1003345.

Morgan, CC et al., Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117.

Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133.

Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109

http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/

Page 41: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Even though the data had been released over 2 years earlier and cited in other papers- the main analysis paper was published in Cell

Page 42: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Cell Press Journals had indicated publishing a dataset prior to publication could be considered as prior publication

Page 43: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

• New Sequencing technology• minION Oxford-Nanopore

• New Sequence Data Type• EBI and NCBI Databases not ready

• High community interest for testing data

• >100 GB of data

Real time use during the publication process

• Uploaded prior to publication• Deployed on Amazon Cloud Front• Ongoing

testing/comparison/information sharing prior to publication

• When ready for data EBI used our cloud to upload data

• EBI transferred the data to NCBI when they were ready

Page 44: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Getting past…

…look but don't touch

Page 45: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Reproduce and Reuse Needs Much More

• Data: GigaDB• Software: Github• Workflows

– Galaxy – Executable Docs– VMs

• Images: OMERO• Cloud storage, tools, and

compute power…• Need this to reach the

smaller labs

github.com/gigascience/gigadb-cogini

More Journals have or are starting to introduce these and other tools: More is needed…

Page 46: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Currently… it feels like this…

Well… …because it is like this

Page 47: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

If we want to move forward, we need to go through that to reach this:It will require researchers, institutions, publishers, and funders working together.

Page 48: Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse

Thanks to:Scott Edmunds, Executive EditorNicole Nogoy, Commissioning EditorPeter Li, Lead Data ManagerChris Hunter, Lead BioCuratorRob Davidson, Data ScientistXiao (Jesse) Si Zhe, Database DeveloperAmye Kenall, Journal Development Manager

[email protected]@gigasciencejournal.com

@GigaScience

facebook.com/GigaScienceblogs.openaccesscentral.com/blogs/gigablog

Contact us:

Follow us:

www.gigasciencejournal.comwww.gigadb.org