data publication coasp 2012. publications 26 million abstracts 2.2 million full text articles...
TRANSCRIPT
Data Publication
COASP 2012
Publications
26 million abstracts
2.2 million full text articles
Citation networksDatabase linksText-mining
20122006 2011 2016?
Europe PubMed Central
How many open access articles in UKPMC?PubMed (995K)
UKPMC (18%,182K)
OA (9.6%, 96K)
Big Data:Deposition
Primary
Research articles
Big Data:Curated
Annotation
Managing the public data ecosystem
Unstructured Data
1
2
12
3
Literature citation from data(data annotation)
Links from Literature to Databases
• Proteins• Nucleotides• OMIM• Chemicals• Structure• Clinical reviews• Protein families• Protein-protein interactions• Gene expression experiments
800 K
370 K
110 K
Database crosslinks
Bibliography from P25106
Data citation from literature(provenance)
Semantic Type Unique Terms Articles Annotations
Accession No. 233,017 66,356 387,787
Chemical 76,712 1,694,385 83,923,066
Disease 171,692 1,768,214 57,821,871
Gene/Protein 227,318 1,310,382 77,189,022
GO Terms 32,664 1,832,294 65,061,579
Organism 180,637 1,713,280 70,832,222
Text Mining in UKPMC (2.2 million articles)
Accession numbers stories: data citation in OA articles
Senay Kafkas Jee-Hyub Kim
gen
pdb
spro
t
genp
ept
geo
omim pir
embla
lign
pubc
hem
pmc
0
10
20
30
40
50
60
70
80
90
100
gen
pdb
spro
t
arra
yexp
ress
pfam
inter
pro
0
10
20
30
40
50
60
70
80
90
100
publisher-annotated text-mined
Annotation of accession numbers (OA)
~10,000 articles >25,000 articles
• Névéol A, Wilbur WJ, Lu Z (2012) Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE. Database 2012:bas026 (PMC3371192)
• Névéol A, Wilbur WJ, Lu Z (2011) Extraction of data deposition statements from the literature: a method for automatically tracking research results. Bioinformatics 27, 3306-3312 (PMC3223368)
bmc genomics
bmc evolutionary biology
the journal of cell biology
virology journal
bmc microbiology
the journal of experimental medicine
bmc bioinformatics
bmc plant biology
the journal of biological chemistry
bmc molecular biology
• plos one
acta crystallographica section e:
british journal of cancer
the journal of cell biology
environmental health perspectives
• nucleic acids research
the journal of experimental medicine
critical care
• emerging infectious diseases
bmc bioinformatics
• plos one
• nucleic acids research
bmc genomics
bmc evolutionary biology
the journal of cell biology
plos pathogens
bmc bioinformatics
virology journal
bmc microbiology
• emerging infectious diseases
Most publisher tags Most articlesMost text-mined tags
BMC Genomics: 1,484 TM tags*, 4,337 articlesPLoS One: 4,226 TM tags*, 42,888 articles
Efficacy of Accession number tagging (OA)
Scientific:
Linking articles that cite the same data
Citation:
Data Citation as measure of impact (Thomson: Data citation index)
Context of data citation: submission, reuse, analysis
Operational:
Services for publishers to improve Accession number tagging
Editorial policies and adherence
Extension of NLM DTD
Lessons learned for considering unstructured data
Why is this important? Implications
That we can perform this analysis at all highlights a benefit of Open Access
AY387398: needle in a haystack
Unstructured data
Articles with supplemental data (UKPMC)
• 235,000 articles (50K+ in 2011)
• 718, 511 files
• 459 extensions
• 0.8 TB (1200 CDs)• (However most data in ~60 extension types)
%
Pub Year
Big Data:Deposition
Primary
Research articles
Big Data:Curated
Annotation
Managing the public data ecosystem
Structured links
Unstructured Data
reuse
analysisprovenance
• Open• Citable • Discoverable• Reusable
People
• Paula Buttery• Andrew Caines• Norman Cobley• Yuci Gou• Senay Kafkas• Jyothi Katuri• Oliver Kilian• Jee-Hyub Kim• Nikos Marinos• Jo McEntyre• Xingjun Pi• Philip Rossiter
• Rebholz Group• Peter Stoehr
• University of Manchester• British Library
• OpenAIRE/OpenAIRE Plus
• NCBI, NLM