bioi7791 spring 2005 projects in bioinformatics: natural language processing march 31, 2005 © kevin...

29
BIOI7791 Spring 2005 Projects in bioinformatics: natural language processing March 31, 2005 © Kevin Cohen

Post on 20-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

BIOI7791 Spring 2005Projects in bioinformatics:

natural language processing

March 31, 2005

© Kevin Cohen

Hong Yu job talk

Noon tomorrow!

More important points about information extraction

I’m SOOO frustrated…

Full parse, or shallow parse?

• [More important points] [about] [information extraction]

Full parse, or shallow parse?

• [[More important] [points [about information extraction]]]

• [[More] [important points [about information extraction]]]

Which full parse is right?

• [[More important] [points [about information extraction]]]

• [[More] [important points [about information extraction]]]

Here are some important points about information extraction, in addition to the ones that I gave you on Tuesday.

Thanks—I feel better now…

• Message Understanding Conferences (MUC)

• Context-free grammars

• Structural ambiguity

• Resolving structural ambiguity

• Semantic grammars

• Some things that neither FSAs nor CFGs can handle

The MUCs

• Information extraction and related tasks– Entity identification

• Central task definition, central data production and distribution, central scoring

• Classic tasks:– Terrorist attacks– Mergers, “executive succession”

MUC = Message Understanding

Conference

The MUCs

• Performance:– Best systems: ~.75– More typical: ~.65

• Why they ended:– Funding ended– Convergence of solutions led to loss of

interest

Context-free grammars: The Chomsky hierarchy

• Turing-equivalent

• Context-sensitive grammars

• Context-free grammars

• Regular grammars

“The older I get, the further down the Chomsky hierarchy I go.”

--Aravind Joshi

Increasing

“computational complexity”

Context-free grammars:recursion

• NP → NP conjunction NP – p53 and BRCA1

• NP → NP PP

• PP → P NP

Structural ambiguity

• Transport of [peptides from cerebrospinal fluid]

• Transport of [peptides] [from cerebrospinal fluid]

Resolving structural ambiguity

• Guess (e.g., first one)

• Purely statistical– P(parse | “head” word)

• Knowledge-based rules, top-down development– If the pattern is “translocation of protein X

from compartment Y…”

(shallow parsing can have ambiguity issues, too)

• Basic definition of the NounGroup: noun plus everything to its left

• gene

• liver gene

• human liver gene

• rat and human liver gene

I’m SOOO frustrated…

I can’t remember what JJ means!

green/JJ fluorescent/JJ protein/NN

Semantic grammarsmotivation

• Problems to solve:– Structural ambiguity– Irrelevant syntactic structure– How to make use of what we know about the

domain/words?

• Solution:– “rules and constituents…correspond directly to entities

and relations from the domain” (J&M)– “key semantic components occur together within single

rules” (J&M)– Terminals and non-terminals mix freely

Semantic grammarstypical example

• Restaurant-finding system– InfoRequest → USER want to go to eat

FOOD_TYPE TIME_EXPRESSION– I want to go to eat some Italian food today– FOOD_TYPE → (some)? Italian|Chinese

(food)?– TIME_EXPRESSION → today|now|tonight

Semantic grammarsbiomedical example

• GENE_DISEASE_ASSOCIATION → GENE ROLE_PLAY_VERB_GROUP DISEASE

• GENE_PROCESS_ASSOCIATION → GENE ROLE_PLAY_VERB_GROUP BIOLOGICAL_PROCESS

• ROLE_PLAY_VERB_GROUP → (might|may)? plays? a|an (important|crucial)? role in

Semantic grammars

• Scalability– How much has to be reinvented every time?– NP -> AdjectivePhrase NP– NP -> NP conjunction NP– NP -> ((NP, )+ (and)?) NP– NP -> NP PP– NP -> NP RelativeClause

• Extensibility– If it works on transport texts, will it work on signal

transduction?– If it works on GeneRIFs, will it work on abstracts?– If it works on abstracts, will it work on full text?

Do you want to repeat this for cargo and driver and cellular_compartment and cell_cycle_phase and species and cell line and posttranslational_modification and splice_variant and…

An interesting task: “tagging” roles

• ProteinX transports ProteinY

Subject DirectObject• ProteinY is transported by ProteinX

Subject Oblique

These are syntactic roles…How useful are they?

An interesting task: “tagging” roles

• ProteinX transports ProteinY

Driver Cargo• ProteinY is transported by ProteinX

Cargo Driver

These are semantic roles…How useful are they?

How do you do semantic role assignment?

• Learning-based– E.g. Gildea and Jurafsky– Features include:– Head word (semantic)– position in syntactic structure (syntactic)

Things that neither FSAs nor CFGs can handle

• “Template merging”– ProteinX is transported from the cell membrane

– Transport(ProteinX, cell membrane, _)– ProteinX is translocated to the nucleus

– Transport(ProteinX, _, nucleus)– Transport(ProteinX, cell membrane,

nucleus)

• Coreference/anaphora

Coreference/anaphora

• Coreference: two linguistic things refer to the same thing.

• Anaphora: one of them has no independent meaning.

This week’s programming assignment: learn to use Brill tagger• Download from

http://research.microsoft.com/%7Ebrill/• Molecular biology data in Brill format: on

compbio, /snurp/ie/GENETAG• Convert last week’s data files to

Brill format• Tag• Retrain tagger with molbio data• Re-tag• Turn in: a user manual—how to install, use, and

train the Brill tagger

Hong Yu job talk

Noon tomorrow!