bioi7791 spring 2005 projects in bioinformatics: natural language processing march 31, 2005 © kevin...
Post on 20-Dec-2015
216 views
TRANSCRIPT
BIOI7791 Spring 2005Projects in bioinformatics:
natural language processing
March 31, 2005
© Kevin Cohen
Full parse, or shallow parse?
• [[More important] [points [about information extraction]]]
• [[More] [important points [about information extraction]]]
Which full parse is right?
• [[More important] [points [about information extraction]]]
• [[More] [important points [about information extraction]]]
Here are some important points about information extraction, in addition to the ones that I gave you on Tuesday.
• Message Understanding Conferences (MUC)
• Context-free grammars
• Structural ambiguity
• Resolving structural ambiguity
• Semantic grammars
• Some things that neither FSAs nor CFGs can handle
The MUCs
• Information extraction and related tasks– Entity identification
• Central task definition, central data production and distribution, central scoring
• Classic tasks:– Terrorist attacks– Mergers, “executive succession”
MUC = Message Understanding
Conference
The MUCs
• Performance:– Best systems: ~.75– More typical: ~.65
• Why they ended:– Funding ended– Convergence of solutions led to loss of
interest
Context-free grammars: The Chomsky hierarchy
• Turing-equivalent
• Context-sensitive grammars
• Context-free grammars
• Regular grammars
“The older I get, the further down the Chomsky hierarchy I go.”
--Aravind Joshi
Increasing
“computational complexity”
Structural ambiguity
• Transport of [peptides from cerebrospinal fluid]
• Transport of [peptides] [from cerebrospinal fluid]
Resolving structural ambiguity
• Guess (e.g., first one)
• Purely statistical– P(parse | “head” word)
• Knowledge-based rules, top-down development– If the pattern is “translocation of protein X
from compartment Y…”
(shallow parsing can have ambiguity issues, too)
• Basic definition of the NounGroup: noun plus everything to its left
• gene
• liver gene
• human liver gene
• rat and human liver gene
Semantic grammarsmotivation
• Problems to solve:– Structural ambiguity– Irrelevant syntactic structure– How to make use of what we know about the
domain/words?
• Solution:– “rules and constituents…correspond directly to entities
and relations from the domain” (J&M)– “key semantic components occur together within single
rules” (J&M)– Terminals and non-terminals mix freely
Semantic grammarstypical example
• Restaurant-finding system– InfoRequest → USER want to go to eat
FOOD_TYPE TIME_EXPRESSION– I want to go to eat some Italian food today– FOOD_TYPE → (some)? Italian|Chinese
(food)?– TIME_EXPRESSION → today|now|tonight
Semantic grammarsbiomedical example
• GENE_DISEASE_ASSOCIATION → GENE ROLE_PLAY_VERB_GROUP DISEASE
• GENE_PROCESS_ASSOCIATION → GENE ROLE_PLAY_VERB_GROUP BIOLOGICAL_PROCESS
• ROLE_PLAY_VERB_GROUP → (might|may)? plays? a|an (important|crucial)? role in
Semantic grammars
• Scalability– How much has to be reinvented every time?– NP -> AdjectivePhrase NP– NP -> NP conjunction NP– NP -> ((NP, )+ (and)?) NP– NP -> NP PP– NP -> NP RelativeClause
• Extensibility– If it works on transport texts, will it work on signal
transduction?– If it works on GeneRIFs, will it work on abstracts?– If it works on abstracts, will it work on full text?
Do you want to repeat this for cargo and driver and cellular_compartment and cell_cycle_phase and species and cell line and posttranslational_modification and splice_variant and…
An interesting task: “tagging” roles
• ProteinX transports ProteinY
Subject DirectObject• ProteinY is transported by ProteinX
Subject Oblique
These are syntactic roles…How useful are they?
An interesting task: “tagging” roles
• ProteinX transports ProteinY
Driver Cargo• ProteinY is transported by ProteinX
Cargo Driver
These are semantic roles…How useful are they?
How do you do semantic role assignment?
• Learning-based– E.g. Gildea and Jurafsky– Features include:– Head word (semantic)– position in syntactic structure (syntactic)
Things that neither FSAs nor CFGs can handle
• “Template merging”– ProteinX is transported from the cell membrane
– Transport(ProteinX, cell membrane, _)– ProteinX is translocated to the nucleus
– Transport(ProteinX, _, nucleus)– Transport(ProteinX, cell membrane,
nucleus)
• Coreference/anaphora
Coreference/anaphora
• Coreference: two linguistic things refer to the same thing.
• Anaphora: one of them has no independent meaning.
This week’s programming assignment: learn to use Brill tagger• Download from
http://research.microsoft.com/%7Ebrill/• Molecular biology data in Brill format: on
compbio, /snurp/ie/GENETAG• Convert last week’s data files to
Brill format• Tag• Retrain tagger with molbio data• Re-tag• Turn in: a user manual—how to install, use, and
train the Brill tagger