bioi 7791 projects in bioinformatics spring 2005 march 22 © kevin b. cohen

34
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Post on 22-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

BIOI 7791 Projects in bioinformaticsSpring 2005

March 22

© Kevin B. Cohen

Page 2: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

PGES upregulates PGE2 production in human thyrocytes

(GeneRIF: 12145315)

Page 3: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

• Syntax: what are the relationships between words/phrases?

• Parsing: figuring out the structure– Full parse– Shallow parse

• Shallow parse• Partial parse• Syntactic chunking

Page 4: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Full parse

PGES upregulates PGE2 production in human thyrocytes

Page 5: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Shallow parse

PGES

upregulates

PGE2 production

in

human thyrocytes

NounGroup

VerbGroup

NounGroup

NounGroup

PrepositionalGroup

Page 6: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Shallow vs. full parsing

• Different depths– Full parse goes down to level of individual

words– Shallow parse doesn’t go down any further

than the base phrase

• Different “heights”– Full parse goes “up” to root node– Shallow parse doesn’t (generally) go further

up than base phrase

Page 7: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Shallow vs. full parsing

• Different number of levels of structure– Full parse has many levels– Shallow parse has far fewer

Page 8: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Shallow vs. full parsing

• Either way, you need POS information…

Page 9: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

POS tagging: why you need it

• All syntax is built on it

• Overcome sparseness problem by abstracting away from specific words

• Help you decide how to stem

• Potential basis for entity identification

Page 10: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

What “POS tagging” is

• POS: part of speech

• School: 8 (noun, verb, adjective, interjection…)

• Real life: 40 or more

Page 11: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

How do you get from 8 to 80?

• Noun • NN (noun, singular or mass)• NNS (plural noun)• NNP (proper noun)• NNPS (plural proper noun)

Page 12: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

How do you get from 8 to 80?

• Verb • VB (base form)• VBD (past tense)• VBG (gerund)• VBN (past participle)• VBP (singular present-tense non-3rd-

person)• VBZ (3rd-person singular present

tense)

Page 13: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Others that are good to recognize

• Adjective • JJ (adjective)• JJR (comparative

adjective)• JJS (superlative

adjective)

Page 14: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Others that are good to recognize

• Coordinating conjunctions

• Determiners• Prepositions• To• Punctuation

• CC

• DT• IN• TO• , (comma)• . (sentence-final)• : (sentence-medial)

Page 15: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

POS tagging

• Definition: assigning POS “tags” to a string of tokens

• Input: – string of tokens– tag set

• Output:– Best tag for each token

Page 16: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

How do you define noun, verb, etc.?

• Semantic: – “A noun is a person,

place, or thing…”– “A verb is…”

• Distributional characteristics:– “A noun can take the

plural and genitive morphemes”

– “A noun can appear in the environment All of my twelve hairy ___ left before noon”

Page 17: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Why’s it hard?

Time flies/VBZ like/IN an arrow, but fruit flies/NNS like/VBP a banana.

Page 18: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

POS tagging: rule-based

1. Assign each word its list of potential parts of speech

2. Use rules to remove potential tags from the list

The EngCG system:

• 56,000-item dictionary

• 3,744 rules

Note that all taggers need a way to deal with unknown words (OOV or “out-of-vocabulary”).

Page 19: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

As always, (about) two approaches….

• Rule-based

• Learning-based

Page 20: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

An aside: tagger input formatsapoptosis in a human tumor cell line .

apoptosis/NN in/IN a/DT human/JJ tumor/NN cell/NN line/NN ./.

apoptosis

in

a

human

tumor

cell

line

.

NN

IN

DT

JJ

NN

NN

NN

.

Page 21: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Just how ambiguous is natural language?

• Most English words are not ambiguous…

• …but, many of the most common ones are.

• Brown corpus: only 11.5% of word types ambiguous…

• …but > 40% of tokens ambiguous.Dictionary doesn’t give you a good estimate of the problem space…

…but corpus data does.

Empirical question: how ambiguous is biomedical text?

Page 22: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

A statistical approach: TnT

• Second-order Markov model

• Smoothing by linear interpolation of ngrams• λ estimated by deleted interpolation• Tag probabilities learned for word endings; used

for unknown words

Page 23: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

TnT

• Ngram: an n-tag or n-word sequence• N = 1

– DET– NOUN– role

• Bigrams– DET NOUN– NOUN PREPOSITION– a role

• Trigrams

Page 24: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

The Brill Tagger

Page 25: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

The Brill tagger

• Uses rules

• …but, set of rules are induced.

Page 26: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

The Brill tagger

• Iterative error reduction1. Assign most common tags, then

2. Evaluate performance, then

Page 27: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

The Brill tagger

• Iterative error reduction1. Assign most common tags, then

2. Evaluate performance, then

3. Propose rules to fix errors

4. Evaluate performance, then

5. If you’ve improved, GOTO 3, else END

Page 28: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

The Brill tagger

• Change Determiner Verb “of”

• …to…

• Determiner Noun “of”

The/Determiner running/Verb of/IN

The/Determiner running/Noun of/IN

Page 29: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

An aside: evaluating POS taggers

• Accuracy• Confusion matrix• How hard is the task? Domain/genre-

specific…– Baseline– Ceiling– State of the art:

• 96-97% total accuracy• Lower for non-punctuation

Give each word its most common tag

Interannotator agreement

--usually high 90’s

Low 90’s on some corpora!

Page 30: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

Confusion matrix

JJ NN VBD

JJ -- .6 4.6

NN .5 --

VBD 5.4 .01 --

Columns = tagger output

Rows = right answer

Page 31: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

An aside: unknown words

• Call them all nouns

• Learn most common POS from training data

• Use morphology

• Suffix trees

• Other features, e.g. hyphenation (JJ in Brown; biomed?), capitalization…

Page 32: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

POS tagging: extension(s)

• Entity identification

• What else??

Page 33: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

• First step in any POS tagging effort: – Tokenization– …maybe sentence segmentation

Page 34: BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen

First programming assignment: tokenization

• What was hard?

• What if I told you that dictionaries don’t work for recognizing gene names, chemicals, or other “entities”?