laurie goodman at ndic: big data publishing, handling & reuse

Big Data Publishing, Handling, & Reuse

Laurie Goodman, PhDEditor-in-Chief, GigaScience

[email protected] ID: 0000-0001-9724-5976

Beyond Data Release Mandates

mailto:[email protected]

http://orcid.org/0000-0001-9724-5976

What is the point of publishing?• To disseminate

information/knowledge/ideas.

• To present material so it can be reasonably assessed for its level of quality (and interest).

• To gain credit for career advancement.

What goes into a research article?

+ Area of Interest/Question

What goes into a research article?+ Area of Interest/

Question

Data & Metadata Collection

Analysis/Hypothesis/Analysis

Conclusions

What goes into a research article?


Conclusions


Data & Metadata Collection

Scientific Communication Via Publication

• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

• Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives

• Lack of transparency, lack of credit for anything other than “regular” dead tree publication

Kahn, Goodman, & Mittleman. Dragging Scientific Publishing into the 21st Century 2014 http://genomebiology.com/2014/15/12/556

From Journal Delivery to PDF Delivery

Lack of Data and Software Availability Impacts Reproducibility

1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Retractions are on the Rise>15X increase in last decade

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

http://www.nature.com/news/2011/111005/full/478026a.html

http://iai.asm.org/content/79/10/3855.abstract?%23fn-1

http://iai.asm.org/content/79/10/3855.abstract

Deconstructing a paper into accessible, useable, trackable, interlinked units

Need to provide credit to reward sharing and proper organization of:• Narrative• Data/Metadata

availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses

Data/MetaData

Software

Methods

Narrative

Deconstructing a paper into accessible, useable, trackable, interlinked units

Currently we provide credit for this:

• Narrative• Data/Metadata

availability/curation• Software availability• Interoperability• Availability of workflows• Transparent analyses

Data/MetaData

Software

Methods

Narrative

Sometimes we publish these as Methods Papers

Beyond the NarrativeData And Tools

Promoting Data Release

Data Citation

But- publishing ‘Data’ is “Salami Slicing”!!

What is Salami Slicing? • Publishing research in several different papers

that should form a single cohesive paper

Why is it ‘unethical’• It fragments the scientific literature, wasting

researcher’s time as they try to get all the information related to a very specific topic/dataset/method

• It can give the appearance (given there are multiple publications) that there is large support for a particular hypothesis

• It pads a researcher’s publication record unfairly

Publishing ‘Data’ is “Salami Slicing”!Baloney

1. Those guidelines were developed prior to the year 2000: • More than 15 years ago: at a time when data set sizes and data

types collected in the life sciences by a single research group were relatively small and primarily suitable for a single or narrow range of disciplines or hypotheses.

• Most journals were not online (which allows easier identification and access to closely related articles ) until the late ‘90s.

2. In 2005, COPE* ruled that a paper that had data that had been used and described, at least in part, in a previous publication was not unethical *Council of Publication Ethics. http://www.publicationethics.org/case/salami-publication

3. Data collection can be (should be!!) a scholarly pursuit: • Data that is broadly reusable requires care, thought, training,

time, and money to be properly collected, curated, stored, and shared.

Contrary to popular belief…

There are very few —if any—

‘push-a-button-and-get-it’ reuseable data resources

Your not supposed to just collect samples!*Collect ALL available metadata*

Help Develop a Digital Data Curation Team at your Institution’s Library (they may already have one…)

Back to Darwin

Data & Metadata Collection/Experiments


Conclusions


1839

1859

20 Yrs.

Say… was this a Data Publication?

Data & Metadata Collection/Experiments


Conclusions


1839

1859

The most curious fact is the perfect gradation in the size of the beaks in the different species of Geospiza, from one as large as that of a hawfinch to that of a chaffinch, and (if

Mr. Gould is right in including his sub-group, Certhidea, in the main group) even to that of a warbler. The largest beak in the genus Geospiza is shown in Fig. 1, and the smallest in Fig. 3; but instead of there being only one intermediate species, with a beak of the size shown in Fig. 2, there are no less than six species with insensibly graduated beaks. (Chapter 17)

DataCite and DOIs

• Aims to “increase acceptance of research data as legitimate, citable contributions to the scholarly record”.

• “data generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.

Citing Data Isn’t NewThe Physical Sciences have been doing this for a while…

What we’re doing:Mandating and Aiding for Data Release

Requiring all data supporting work to be Freely available in a publically available repository

– How we’re helping to do this:• Journal-dedicated data and software repository GigaDB

that hosts ALL data types.• Have Biocurators to aid in handling Metadata• All Datasets are provided a Digital Object Identifier

(DOI) making them citable and countable• All Material in GigaDB is available under a CC0 Waiver• Data with a publically approved database must be

submitted there as well • Provide Direct links to all associated information

Requiring all software and work to be Freely available in a publically available repository

– How we’re promoting this:• All software created by authors must be 100% OSI

compliant• Journal-Dedicated repository GigaDB hosts software so

it can be downloaded.• Software and Workflows are provided a DOI making

them citable and countable (reward)• Journal-dedicated Galaxy Platform to run tools• Have a Data Manager and Data Scientist to wrap and

deploy software tools• Have our own Github Repository

What we’re doingMandating and Aiding Software Release

Data Sets inGigaDB

Analyses/Workflows inGigaGalaxy

Paper inGigaScience

(Narrative + Methods)

Linked to

Linked to

Open-access journal Data Publishing Platform

Data Computation Analysis Platform

How we view publishing at GigaScience

Making the Data Itself CitableWe provide a linked journal database- this is done to link the data directly to our papers to ease reproducibility, make it available at the time of review, and provide authors a place to submit data with no sustainable ‘home’.

Note: there are many community available databases- so in principle- any journal can do this by taking advantage of such available resources.

These include the usual suspects: EBI, NCBI, DDBJ etc.Databases that take all data types and provide Data DOIs: Dryad, FigShare, etc.There are also numerous smaller community databases specific to different fields or data types.

Some of the Journals Currently Doing Data Publication

http://proj.badc.rl.ac.uk/preparde/blog/DataJournalsList

Citing Data in the References Allows Tracking

This rewards authors for making data available AND makes it easier to find

But is this being done?

Is Cited Data Being Tracked?

Yes:

Improving Quality as Well as Availability

How Hard is Data and Software Review?

Not really that much harder than narrative review.

Valid

ation

chec

ks

Fail – submitter is provided error report

Pass – dataset is uploaded to GigaDB.

Curator makes dataset public (can be set as future date if required)

DataCite XML file

Submission

Submitter logs in to GigaDB website and uploads Excel submission or uses online wizard

DOI assigned

FilesSubmitter provides files by ftp or Aspera

XML is generated and registered with DataCite

Curator Review

Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).

DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)

Public GigaDB dataset

Data must be available for review with the manuscript (and at the very least get a sanity check…)

http://dx.doi.org/10.5524/100003

Reviewing Data in More DetailIssue: We can’t ask our reviewers to do that! Our finding: Reviewers don’t mindReviewer Dr. Christophe Pouzat on neuroscience manuscript: “In addition to making the presented research trustworthy, the reproducible research paradigm definitely makes the reviewers job more fun!” Can also use specific Data Reviewers (we have)

Reviewing Data AND Software

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>5000 downloads

http://homolog.us/wiki/index.php?title=SOAPdenovo2

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18>35,000 accesses

Open-Code

8 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Enabled code to being picked apart by bloggers in wiki

http://soapdenovo2.sourceforge.net/

http://homolog.us/wiki/index.php?title=SOAPdenovo2

http://dx.doi.org/10.1186/2047-217X-1-18

http://dx.doi.org/10.5524/100044

http://dx.doi.org/10.5524/100038

8 Reviewers! Holy Cow- that must have taken forever!!

SubmissionJuly 24

Final reviewAug 28

These were reviewingteams from different labs, assessing the materials at multiple levels

Is this really worth the effort?

Beyond Reproducibility:

REUSEData Availability and Tools

http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/

These data were released THREE YEARS before publication of the analysis article

The polar bear DATA were released –prepublication- in 2011They were used and cited in the following studies- before the main paper on the sequencing was published

Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424.

Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345. doi:10.1371/journal.pgen.1003345.

Morgan, CC et al., Heterogeneous models place the root of the placental mammal phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117.

Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133.

Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109

http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/

Even though the data had been released over 2 years earlier and cited in other papers- the main analysis paper was published in Cell

Cell Press Journals had indicated publishing a dataset prior to publication could be considered as prior publication

• New Sequencing technology• minION Oxford-Nanopore

• New Sequence Data Type• EBI and NCBI Databases not ready

• High community interest for testing data

• >100 GB of data

Real time use during the publication process

• Uploaded prior to publication• Deployed on Amazon Cloud Front• Ongoing

testing/comparison/information sharing prior to publication

• When ready for data EBI used our cloud to upload data

• EBI transferred the data to NCBI when they were ready

Getting past…

…look but don't touch

Reproduce and Reuse Needs Much More

• Data: GigaDB• Software: Github• Workflows

– Galaxy – Executable Docs– VMs

• Images: OMERO• Cloud storage, tools, and

compute power…• Need this to reach the

smaller labs

github.com/gigascience/gigadb-cogini

More Journals have or are starting to introduce these and other tools: More is needed…

Currently… it feels like this…

Well… …because it is like this

If we want to move forward, we need to go through that to reach this:It will require researchers, institutions, publishers, and funders working together.

Thanks to:Scott Edmunds, Executive EditorNicole Nogoy, Commissioning EditorPeter Li, Lead Data ManagerChris Hunter, Lead BioCuratorRob Davidson, Data ScientistXiao (Jesse) Si Zhe, Database DeveloperAmye Kenall, Journal Development Manager

[email protected]@gigasciencejournal.com

@GigaScience

facebook.com/GigaScienceblogs.openaccesscentral.com/blogs/gigablog

Contact us:

Follow us:

www.gigasciencejournal.comwww.gigadb.org




http://www.gigasciencejournal.com/

laurie goodman at ndic: big data publishing, handling & reuse

Science

publishing data

lack of data

data release data citation

big data publishing

publishing research

data release mandates

scientific publishing

research article