papermaker, beyondthepdf, rebholzschuhmann, 19jan2011
DESCRIPTION
Presentation on Whatizit, LexEBI, IeXML, CALBC, SESL, PaperMakerTRANSCRIPT
PaperMaker: Validation of biomedical scientific
publications
January 19th, 2011
Workshop: „BeyondThePdf“
Dietrich Rebholz-Schuhmann, MD, PhDGroup Leader Rebholz Group
European Bioinformatics Institute
Literature and Text MiningBioCreative III, Rebholz
Publishing is about …
• ... Agreeing / disagreeing about current science
• Only peer review can judge current science
• ... Bringing new results
• Conceptual results are more difficult than new data
• ... Gaining new knowledge
• New data and new results can imply new knowledge where even
the author is still unaware of
• ... Rewarding the scientist
• Count whatever you can count that could have an impact.
• Validating the scientist’s claim is the key reward.
• Any scientist can fool any system, but (hopefully) only short-term
20.01.20112
Literature and Text MiningBioCreative III, Rebholz
Future of biomedical text mining
Working towards ...
• ... Literature integration
• to have it full fledged as part of bioinformatics data resources
• ... Cross-domain support
• to deliver the content to different scientific communities.
• ... Provenance
• to carry credit of findings into analytical biomedical research
• ... Inference & Reasoning
• to make use of the full semantic support in the scientific literature
20.01.20113
Literature and Text Mining
Literature content in the Semantic Web
20.01.20114
Literature and Text Mining
Terminologies vs. Ontologies
Database type Resource building
Terminologies, collection of terms
Automatic generation
Exploitation of terminological features
Standardisation of TM solutions
Interoperability with database
resources
5
Ontological resources
Explicit semantics
Manual generation
Consistency, inference, reasoning
Interoperability with all semantic
resources
Working towards a reasoning
infrastructure
Literature and Text MiningBioCreative III, Rebholz
Efforts in the Rebholz group towards
interoperability of literature with bioinformatics
• Whatizit infrastructure
• Biomedical NER as a public, large-scale service
• LexEBI / BioLexicon (collab. w. NaCTeM, Pisa-U)
• Biomedical terminological resource, standardisation of semantics
• IeXML (BioLink SIG 2006, Brasil)
• Put the annotations into the document (inline annotations)
• CALBC project
• Collaborative annotation of a large-scale biomedical corpus
• UKPMC: U.K. Pubmed Central (collab. w. NaCTeM, BL)
• Use of Whatizit, BioLexicon, IeXML, CALBC alignments for the delivery of quality annotation services to the public
• SESL project
• Joint project with pharma & publishers, literature content in a triple store
• PaperMaker
• Validation of the scientific literature against the above
20.01.20116
Literature and Text MiningBioCreative III, Rebholz
1Whatizit
20.01.20117
Literature and Text Mining
Integrating biomedical literature and dataRebholz-Schuhmann, D., et
al. Text Processing through
Web Services: Calling
Whatizit. Bioinformatics 24,
no. 2 (2008): 296-98.
20.01.20118
Literature and Text MiningBioCreative III, Rebholz
2BioLexicon
LexEBI20.01.20119
Literature and Text Mining
LexEBI: content
20.01.201110
# Labels # Variants Total Total /
Labels
# Unique
terms
Uniq. T. /
Labels
GP 7.0 516,113 4,005,040 4,521,153 8.76 1,726,853 3.35
GP 6.0 488,577 3,389,316 3,877,893 7.94 1,564,436 3.20
Jochem 278,578 1,691,980 1,970,558 7.07 1,527,752 5.48
ChEBI 19,645 94,748 114,393 5.82 101,307 5.16
ChEBI (all) 549,838 1,187,322 1,737,160 3.16
Enzymes 4,905 8,082 12,987 2.65 12,377 2.52
Species 643,280 199,130 842,410 1.31 838,135 1.30
Interpro 20,671 0 20,671 1.00 20,671 1.00
Antineuro.,
Neo
4,718 6,488 11,206 2.38
Bio. Act. 54,148 87,209 141,357 2.61
Enzymes 26,065 56,332 82,397 3.16
Lipid, Carb. 11,518 9,770 21,288 1.85
Pharm. Act. 104,201 123,840 228,041 2.19
Vit., Horm. 6,877 10,258 17,135 2.49
Gen
e
/
Pro
t.
Ch
emi-
cals
Oth
erU
MLS
Literature and Text MiningBioCreative III, Rebholz
3IeXML
20.01.201111
Literature and Text MiningBioCreative III, Rebholz
IeXML: Annotating entities in text
• Inline annotations to any part of the document with the
annotations
• No hassle with character or byte counts or layout
modifications to the document
• “Alignment” of annotated documtents to
• Compare annotations
• Validate annotations
• Harmonise annotations (SESL project)
20.01.201112
Literature and Text MiningBioCreative III, Rebholz20.01.201113
4CALBC
Literature and Text MiningBioCreative III, Rebholz20.01.201114
The challenge
150,000 documents
or more ...
Test set for all systems
Assessment, benchmarking
Literature and Text MiningBioCreative III, Rebholz
CALBC Challenge II
(1) 75,000 documents training data
(2) 175,000 testing data
(3) Additional 700,000 testing data
• September 13th 2010: Second harmonized corpus available for CALBC Challenge II
• December 15th, 2010: Challenge II closes
• March 2011: CALBC Workshop II
• June 30th, 2011: Final harmonized corpus available
Literature and Text MiningBioCreative III, Rebholz20.01.201116
5Ukpmc/Elixir
Literature and Text MiningBioCreative III, Rebholz20.01.201117
Literature and Text MiningBioCreative III, Rebholz
UKPMC
20.01.201118
~ 10 % the size of PubMed
Literature and Text MiningBioCreative III, Rebholz20.01.201119
6sesl
Literature and Text Mining20
Assertions, SPARQL, Triple StoreIntegration, Inference, ReasoningSharing of data
Service Layer (RDF, Web 2.0) Common
Service
Broker
Multiple
Consumers
Std Public
Vocabularies
Knowledge
ApplicationsDisease
Dossier
Content
Suppliers
Business
Rules
Open
Stan-
dards
SESL Project: from publisher to pharma
20.01.201120
Literature and Text Mining
Literature content in the Semantic Web
20.01.201121
Literature and Text MiningBioCreative III, Rebholz20.01.201122
7Papermaker
Literature and Text MiningBioCreative III, Rebholz
PaperMaker - Overview
• Inte
• PaperMaker - a tool to support authors writing biomedical
papers:
• Interactive feedback on the contents of papers (related
work and concept annotations)
• Formal consistency criteria checking (spelling,
terminology, acronyms, references)
30.03.2009
Literature and Text MiningBioCreative III, Rebholz
Consistency parameters
Domain-independent
• General spelling and grammar
• General readability
• Appropriate use of references
• Finding and acknowledging related work
30.03.2009
Literature and Text MiningBioCreative III, Rebholz
Consistence parameters
Domain-specific
• The use of terminology:
• Should be consistent with naming domain-specific guidelines
• Should not be ambiguous
• Should conform to the conventional usage (possible clashes
between naming guidelines and common-sense convention)
• Useful to resolve terminology to reference databases (e. g.
UniProt for protein names, ChEBI chemical entities, etc.)
• The special case of acronyms
30.03.2009
Literature and Text MiningBioCreative III, Rebholz
Content feedback
• Resolving the contents to literature repositories• Finding related work (document retrieval)
• Finding related ideas (passage retrieval)
• Resolving the contents to ontological reference
databases• MeSH descriptors have been demonstrated to improve
biomedical information retrieval. Can we suggest MeSH terms
directly to the authors?
• Gene Ontology (GO) terms are increasingly used in information
extraction systems.
30.03.2009
Literature and Text MiningBioCreative III, Rebholz
PaperMaker workflow
30.03.2009
Literature and Text Mining
Literature and Text Mining
Literature and Text Mining
Literature and Text Mining
Literature and Text MiningBioCreative III, Rebholz
Conclusions
• PaperMaker can help the author conform to the formal
requirements of paper writing with special emphasis on
the domain
• It also provides feedback on the contents by relating it to
reference resources and literature repositories
• It may improve the indexing of a paper in literature
repositories (less ambiguous terminology)
• http://www.ebi.ac.uk/Rebholz-srv/PaperMaker
Work in progress
30.03.2009
Literature and Text MiningBioCreative III, Rebholz20.01.201133
8Summary
Literature and Text MiningBioCreative III, Rebholz
Efforts in the Rebholz group towards
interoperability of literature with bioinformatics
• Whatizit infrastructure
• Biomedical NER as a public, large-scale service
• LexEBI / BioLexicon (collab. w. NaCTeM, Pisa-U)
• Biomedical terminological resource, standardisation of semantics
• IeXML (BioLink SIG 2006, Brasil)
• Put the annotations into the document (inline annotations)
• CALBC project
• Collaborative annotation of a large-scale biomedical corpus
• UKPMC: U.K. Pubmed Central (collab. w. NaCTeM, BL)
• Use of Whatizit, BioLexicon, IeXML, CALBC alignments for the delivery of quality annotation services to the public
• SESL project
• Joint project with pharma & publishers, literature content in a triple store
• PaperMaker
• Validation of the scientific literature against the above
20.01.201134
Literature and Text MiningBioCreative III, Rebholz