cs460/it632 natural language processing/language technology for the web lecture 2 (06/01/06) prof....

22
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS) Tagging

Upload: cale-leeming

Post on 14-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

CS460/IT632 Natural Language

Processing/Language Technology for the Web

Lecture 2 (06/01/06)Prof. Pushpak Bhattacharyya

IIT Bombay

Part of Speech (PoS) Tagging

Page 2: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 2

Tagging or Annotation

● Purpose is Disambiguation● A word can have a number of labels● The problem is to give unique label.● PoS tagging makes use of the “local context”,

whereas Sense tagging needs “long distance dependency” and hence difficult too.

● PoS tagging is needed in mainly parsing and also in other applications.

Page 3: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 3

Approaches

● Rule Based approach● Statistical approach

– we will mainly focus on the statistical approach

Page 4: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 4

Types of Tagging Tasks

● PoS● Named entity● Sense● Parse tree

Page 5: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 5

PoS Tagging

● Example– “The Orange ducks clean the bills.”

● Assign tags to each word from the lexicon; multiple possibilities exist

Page 6: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 6

Lexicon dictionary

● The: – DT (Determiner)

● Orange:– NN (Noun)

– JJ (Adjective)

● Duck:– NN

– VB ( Basic verb)

● Clean:– NN – VB

● Bill:– NN– VB

JJ, VB, NN are called as Syntactic entities or PoS tags

Page 7: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 7

PoS tagging as a sequence labelling task

● Task is to assign the correct PoS tag sequence to the words.

● It can be:– Unigram: Consider one word while deciding the

sequence.

– Multigram: Consider multiple words.

● 16 (=1*2*2*2*1*2) possible sequences for the “Duck” example.

● It is a classification problem: classify each word’s tag correctly into the right category.

Page 8: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 8

Challenges● Lexical ambiguity: Multiple choices● Morphology analysis: Find the root word● Tokenization: Find word boundaries

– In Thai language there is no blank space

– Non trivial (example: capturing boundaries when the word is continued to the next line with a “-”)

Page 9: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 9

Named Entity tagging

● Example 1:– “Mohan went to school in Kolkata”

● Tagged as:– “Mohan_Person went to School_Place in

Kolkata_Place”.

● Example 2:– “Kolkata bore the brunt of 1947 riots when 1947

children died at Kolkata.

– “Kolkata_? bore the brunt of 1947_year riots when 1947_num children died at Kolkata_Place.

Page 10: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 10

Sense tagging

● Detecting the meaning.● Our example tagged as:

– The Orange_{colour} ducks_{bird} clean the bills_{body_part}

● Sense tagging has been done by means of hypernymy.

● Semantic relations like hypernymy are stored in the lexical resource called “WordNet”.

Page 11: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 11

Parse Tree tagging

● Example parse tree:

Page 12: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 12

Parse Tree tagging (contd.)

● Given a grammar, one can construct the parse tree.

● Annotation will produce following structure:– [ [The_DT [Orange_JJ Ducks_NN]NP]NP [clean_VB[the_VB

[bills_NN]NP]NP]VP]S

● This structure is called the Penn Treebank form

● From the Treebank form, one can arrive at a grammar through learning.

Page 13: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 13

Statistical Formulation of the PoS tagging problem

● Input:

– W1,W2,...Wn words

– C1,C2,....Cm Lexical tags reposition (DT,JJ, NN et. al.)

● Output:

– “Best” PoS tag sequence Ci1, Ci2

, Ci3....Cin

for the

given words.

● Best means:

– P(Ci1, Ci2

, Ci3....Cin

|W1,W2,...Wn) is the maximum of all

possible C-sequence.

Page 14: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 14

● Example:

– P(DT JJ NN| The Orange duck) > P(DT NN VB| The Orange duck) is required

● Why?:– Because given the phrase “The orange duck”, there

is overwhelming evidence in the corpus that “DT JJ NN” is the right tag sequence.

Statistical Formation of PoS tagging problem

Page 15: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 15

Mathematical machinery

Page 16: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 16

Bayes Theorem

● P(A|B) = (P(A).P(B|A)) / P(B)– Where,

– P(A): Prior probability

– P(A|B): Posterior probability

– P(B|A): likelihood

● Why apply Bayes theorem:– This is the Generative Vs Discriminative model

question.

Page 17: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 17

Apply Bayes theorem

P(Ci1, Ci2

, Ci3....Cin

|W1,W2,...Wn) = P(C|W)

=

where,

C = <Ci1, Ci2

, Ci3....Cin

>

W = <W1,W2,...Wn>

P(C). P(W|C)

P(W)

Page 18: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 18

Best tag sequence

C* = <Ci1, Ci2

, Ci3....Cin

>* , where * signifies best

C-sequence

= argmax(P(C|W))● As denominator is common in all the tag sequences

Therefore,

C* = argmax(P(C).P(W|C))

Page 19: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 19

Processing the1st part

P(C) = P(Ci1, Ci2

, Ci3....Cin

)

= P(Ci1).P(Ci2

|Ci1).P(Ci3

|Ci1. Ci2

)..P(Cin|Ci1

Ci2..

Cin-1)

(on applying chain rule of probability)

Ex: P(DT JJ NN) = P(DT).P(JJ|DT).P(NN|DT JJ)

Page 20: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 20

Markov assumption

● Tag depends only on a window, not on everything that the “chain law” of probability demands.

● Kth order Markov assumption considers only previous K tags.

● Typical values of K = 3 for English, and (it seems) 5 for Hindi.

Page 21: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 21

Apply assumption

With K=2, our problem will be:

P(C) = P(Ci|Ci-1),

i: 1..n

C0: sentence beginning marker.

Page 22: CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

06/01/06 Prof. Pushpak Bhattacharyya, IIT Bombay 22

Exercise given in the lecture

● Contrast PoS tagging with Sense tagging.● Find an example to show the difference.