an open access publisher’s perspective on data publishing matthew cockerill managing director,...

An Open Access publisher’s perspective on data publishing

Matthew CockerillManaging Director, BioMed Central

Dryad-UK meetingHEFCE, London, 28 April 2010

About BioMed Central

Largest publisher of peer-reviewed open access research journals

Launched first open access journals in 2000 Part of Springer since October 2008 Now publishes 207 OA titles ~70,000 peer-reviewed OA articles published All research articles Creative Commons licensed Costs covered by 'article processing charge’ (APC)

Data is a first class citizen in BioMed Central publications

Electronic version of article is authoritative “Additional files” not “Supplementary material” Additional files can be central to the reported findings

of the paper Where possible, file is presented in a convenient

embedded form (movies, chemical structures, KML etc) while also making downloadable

“Mini-websites” provide a generic (too generic?) approach for presentation of complex data

Efficient online publication processes can facilitate dataset publication

Only a fraction of experimental data sets make it into the literature

Many more datasets have the potential to be useful, but do not warrant a traditional publication

For certain standard types of data, appropriate databases exist (e.g. nucleotide sequences)

But if such databases do not exist, or if further description of the experimental context is required?

Plans to extend reusability of data

BioMed Central aims to provide more explicit guidelines to facilitate data reuse both generic, and specific to particular disciplines and formats

Making authors original vector-based figure files available expands their reuse capability.

Similar possibility with data:Make any table of data from within articles conveniently downloadable in spreadsheet form

Scientific cloud computing

Bioinformaticists have been rapid adopters of cloud computing (as they were of the web)

Cloud computing can reduce the barriers to reproducibility

Publications can include or refer to necessary datasets and the computational tools that can be fired up to carry out/reproduce the analysis

Large datasets can live in cloud – take analysis to the data, rather than vice versa

Preservation

Publishers not best placed to run repositories for long term preservation of large datasets

Mirrors of publisher content not able to accept arbitrary amounts of additional data

Long term preservation presents a challenge with respect to continuity

Redundant international mirrors with independent governance and funding could help to reduce risk

Huge culture variation between disciplines

Value is maximized if everyone shares data But cultural norms vary heavily by discipline Prisoner’s dilemma – if no one else is sharing their

data, you have little to gain, and much to lose by sharing your own data

Funders are theoretically well placed to enforce norms for sharing data

But effectiveness of funder data sharing policies is questionable

Data sharing in medicine

Clinical trial data is one example of data which presents challenges re: privacy and consent

Perfect anonymization often impossible - certainly not without losing key aspects of data

Increasing collection of genomic data in trials accentuates this issue

Trial consent should include info re: limits of anonymizability

Full access to underlying data set could be made available for approved research purposes

an open access publisher’s perspective on data publishing matthew cockerill managing director,...

Documents

data reuse

example of data

table of data

data funders

questionable slide

risk slide

medicine clinical trial

certain standard types