biomedical information extraction. outline intro to biomedical information extraction pasta...

26
Biomedical Information Extraction

Post on 20-Dec-2015

226 views

Category:

Documents


4 download

TRANSCRIPT

Biomedical Information Extraction

Outline Intro to biomedical information

extraction PASTA [Demetriou and Gaizauskas]

Biomedical named entities Name variability [Cohen, Dolbey,

Acquaah-Mensah, and Hunter] Name tagging [Tanabe and Wilbur]

PASTA [Demetriou and Gaizauskas] Protein Active Site Template

Acquisition

Extraction Tasks Terminological Tagging

“entities” Template Filling

“relationships”

Terminology Tagging protein species residue site region secondary

structure

supersecondary structure

quaternary structure

base atom non-protein

compound interaction

Template Filling

residue :=NAME: stringSITE/FUN: stringSEC_STRUCT: stringQUAT_STRUCT: stringREGION: stringINTERACTION: string

in_protein :=RESIDUE: residuePROTEIN protein

protein :=NAME: string

species :=NAME: string

in_species :=PROTEIN: proteinSPECIES: species

PASTA Architecture Text Preprocessing

Title, author, abstract Tokenization, sentence boundaries

PASTA Architecture Terminological Processing

Morphological analysis biochemical morphemes “-ase”

Lexical lookup token lookup in databases token grammatical class tagging

Terminology parsing create multi-token terms, rule-based

parsing using grammatical tags

PASTA Architecture Syntactic and Semantic Processing

Part-of-speech tags Phrase structure Compositional semantics

Discourse Processing Semantic representations

incorporated into discourse model of concept hierarchy and inference rules

PASTA Architecture Template Extraction

Scan discourse model for template instances, check slots, build template

Performance

Dev Inter-annotator

Test

Terminology

88R/94P 92R/86P 82R/84P

Template 69R/79P 78R/80P 69R/64P

PASTAWeb

Index document -> terminology, template terms -> templates from multiple

documents

IE tools need to be incorporated into effective interfaces for biology researchers

Indexing Problem Variations in expression of same

protein name

Contrast and Variability [Cohen, Dolbey, Acquaah-Mensah, and

Hunter] Named Entities

location vs. identification

Variability somatotropin rat somatotropin growth hormone

Variability Non-contrast (synonyms)

tumor protein homolog vs tumour protein homologue

Contrast (diffonyms?) ACE1 vs ACE2

Transformations1. Remove first character2. Remove first word3. Remove last character4. Remove last word5. Replace sequence of vowels with one

letter6. Replace hyphen with space7. Remove parenthesized material8. Convert to lowercase

Experiment Collect groups of synonym gene

names Get mouse, rat, and human genes from

LocusLink Group OFFICIAL GENE NAME, PREFERRED

GENE NAME, OFFICIAL SYMBOL, PREFERRED SYMBOL, PRODUCT, PREFERRED PRODUCT, ALIAS SYMBOL, ALIAS PROT entries together as synonyms

Results LMW, RMC, RMW identify

contrastive variability Contrasts likely marked at name

boundaries VS, HYPH, CASE, PM identify non-

contrastive variability

Pattern Heuristics

1. Equivalence of vowel sequences2. Optionality of hyphens3. Optionality of parenthesized

material4. Case insensitivity

Tagging Genes and Proteins [Tanabe and Wilbur] ABGene

Trained on MEDLINE abstracts Tested on PUBMED full texts

ABGene Transformation-based tagger False-positive and false-negative

filters Compound term recovery Document ranking

Transformation-Based Tagging Learns sequence of transformation

rules of the form A -> B / C greedily, based on number of errors

corrected in training data tags Applies rules sequentially to tag

new text

Gene Transformations

GENE added as additional POS tag NNP -> GENE / gene fgoodleft * -> GENE / hassuf –A * -> GENE / haspref c- NNP -> GENE / prev1or2wd genes NNP -> GENE / nextbigram ( GENE VBG -> JJ nexttage GENE

Results Precision up to 0.74 Recall up to 0.64

depending on score threshold

Problems in Full Text Terms that do not appear in

abstracts restriction enzyme site, lab protocol

kits, primers, vectors, supply companies, chemical reagents

Figures and tables

Summary Common thread in biomedical

information extraction: normalization is hard!