ontology engineering approaches based on semi-automated curation of the primary literature

18
Ontology Engineering approaches based on semi-automated curation of the primary literature Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed Hovy Biomedical Knowledge Engineering Group, Information Sciences Institute, University of Southern California

Upload: blaine

Post on 20-Mar-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Ontology Engineering approaches based on semi-automated curation of the primary literature. Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed Hovy Biomedical Knowledge Engineering Group, Information Sciences Institute, University of Southern California. Where’s all the knowledge?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ontology Engineering approaches based on semi-automated curation of the primary literature

Ontology Engineering approaches based on semi-automated curation of the primary literature

Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed HovyBiomedical Knowledge Engineering Group,Information Sciences Institute,University of Southern California

Page 2: Ontology Engineering approaches based on semi-automated curation of the primary literature

Where’s all the knowledge?

Image taken from U.S. Geological Survey Energy Resource Surveys Program

The primary research literature...… is the end-product of all scientific research … forms the basis for human understanding of the subject... is written in natural language … is structured… is interpretable… is expensive… is terse

Page 3: Ontology Engineering approaches based on semi-automated curation of the primary literature

Precision and imprecision in biological representation

Assay:define model

system

Experiment: perform

measurements

Conceptual model

‘Stress’, ‘energy balance’,‘homeostasis’, ‘glucoprivation’

2-deoxyglucose (2DG) administrated intravenously to rats, look for activation in ‘stress-responsive’ neurons

MAP-K and pERK activate in neurons in PVH, BST and CEAl

High-level concepts

Independent variables

Dependent variables

Imprecise

Precise

Page 4: Ontology Engineering approaches based on semi-automated curation of the primary literature

Partitioning the literature

Page 5: Ontology Engineering approaches based on semi-automated curation of the primary literature

The problem with knowledge: an over-abundance of data

Page 6: Ontology Engineering approaches based on semi-automated curation of the primary literature

Corpus Preparation for Natural Language Processing

The Journal of Comparative Neurology is the foremost international journal for neuroanatomy. We downloaded ~12,000 PDFs in total from 1970-2005.

We preprocessed papers with consistent formatting from vol. 204 - 490 (1982-2005) providing a corpus of 9,474 PDF files. This corpus contains 99,094,318 words

Page 7: Ontology Engineering approaches based on semi-automated curation of the primary literature

Active Learning / Information Extraction Methodology

Page 8: Ontology Engineering approaches based on semi-automated curation of the primary literature

The logical structure of a tract-tracing experiment

Tracer Chemical [1] Injection Site [1]

Location brain structure topography side

Labeled region [1...*] Location

brain structure topography ipsi-contra relative to

injection site? Label type Label density

‘anterograde’

‘retrograde’

Page 9: Ontology Engineering approaches based on semi-automated curation of the primary literature

Annotated XML Example from Albanese & Minciacchi, 1983, JCN 216:406-420

expt. labeldelineation injectionlabelingdescription

Page 10: Ontology Engineering approaches based on semi-automated curation of the primary literature

Recall, Precision and F-Score

Page 11: Ontology Engineering approaches based on semi-automated curation of the primary literature

Field Labeling Results –overall label level

System Features Precision Recall F-Score

Baseline 0.3926 0.1673 0.2346

Lexicon 0.5689 0.3771 0.4536

Lexicon + Surface Words 0.7415 0.6817 0.7103

Lexicon + Surface Words + Window Words

0.7843 0.7039 0.7420

Lexicon + Surface + Window Words + Dependency features

0.7756 0.7347 0.7546

Preliminary data from a training set of 14 documents+ testing on 16 documents

Page 12: Ontology Engineering approaches based on semi-automated curation of the primary literature

Counts

O

injection Location

injection S

pread

labeling D

escription

labeling Location

tracer C

hemical

O 41087 141 97 338 1751 6 43420injectionLocation 545 744 48 6 820 1 2164injectionSpread 126 43 147 11 155 0 482labelingDescription 1121 5 0 3773 82 47 5028labelingLocation 1988 224 110 27 9251 0 11600tracerChemical 108 1 12 0 0 623 744

44975 1158 414 4155 12059 677

machine labels

human labels

Field Labeling Results-Confusion Matrices

Page 13: Ontology Engineering approaches based on semi-automated curation of the primary literature

Generalizing the methodology: ‘Histology’

[from Gonzalo-Ruiz et al 1992, JCN 321: 300-311]

Page 14: Ontology Engineering approaches based on semi-automated curation of the primary literature

The logical structure of a tract-tracing experiment

Tracer Chemical [1] Injection Site [1]

Location brain structure topography side

Labeled region [1...*] Location

brain structure topography ipsi-contra relative to

injection site? Label type Label density

‘anterograde’

‘retrograde’

Page 15: Ontology Engineering approaches based on semi-automated curation of the primary literature

Time and effort Current performance achieved by annotating 40

documents Each document contains 97 sentences (in results

section) on average Annotation rate

~ 40 Sent/hr (no support) ~115 Sent/hr (after 20 documents)

Time taken to annotate document to train system to perform at this standard ~65 hours with no support Estimate ~2 months for a 50% RA (20 hours / week)

Page 16: Ontology Engineering approaches based on semi-automated curation of the primary literature

Can we discover the schema from the text?

Given a large review or a grant proposal specific to a single laboratory

Annotate independent and dependent variables in papers.

Can we learn and extract these patterns?

Page 17: Ontology Engineering approaches based on semi-automated curation of the primary literature

An example from current set of annotations

10 independent variables:•age•species•sex•weight•agonist/antagonist combinations (9)•primary antibody•preparation•protocol•brain region

1 dependent variable:•signal density

Page 18: Ontology Engineering approaches based on semi-automated curation of the primary literature

Acknowledgements

Funding Information Sciences Institute, seed

funding * National Library of Medicine (RO1-

LM07061) * NSF (LONI MAP project) HBP (USCBP)

Neuroscience consultants Alan Watts * Larry Swanson * Arshad Khan * Rick Thompson * Joel Hahn * Lori Gorton * Kim Rapp *

Computer Scientists Eduard Hovy * Donghui Feng * Patrick Pantel *

Developers Tommy Ingulfsen * Wei-Cheng Cheng