part-of-speech tagging and chunking using crf & tbl avinesh.pvs, karthik.g ltrc iiit hyderabad...

22
Part-Of-Speech Part-Of-Speech Tagging and Tagging and Chunking Chunking using CRF & TBL using CRF & TBL Avinesh.PVS, Karthik.G Avinesh.PVS, Karthik.G LTRC LTRC IIIT Hyderabad IIIT Hyderabad {avinesh,karthikg}students.iiit.ac {avinesh,karthikg}students.iiit.ac .in .in

Upload: malcolm-ward

Post on 16-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

Part-Of-Speech Part-Of-Speech Tagging and Tagging and Chunking Chunking

using CRF & TBLusing CRF & TBLAvinesh.PVS, Karthik.GAvinesh.PVS, Karthik.G

LTRCLTRCIIIT HyderabadIIIT Hyderabad

{avinesh,karthikg}students.iiit.ac.in{avinesh,karthikg}students.iiit.ac.in

Page 2: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

OutlineOutline

1.Introduction1.Introduction

2.Background2.Background

3.Architecture of the System 3.Architecture of the System

4.Experiments4.Experiments

5.Conclusion5.Conclusion

Page 3: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

IntroductionIntroduction

POS-TaggingPOS-Tagging::

It is the process of assigning the part of speech tag to It is the process of assigning the part of speech tag to the NL text based on both its definition and its contextthe NL text based on both its definition and its context..

Uses:Uses:Parsing of sentences, MT, IR, Word Sense disambiguation, Parsing of sentences, MT, IR, Word Sense disambiguation, Speech synthesis etc.Speech synthesis etc.

Methods:Methods:1. Statistical Approach1. Statistical Approach2. Rule Based2. Rule Based

Page 4: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

Cont..Cont..

Chunking or Shallow Parsing:Chunking or Shallow Parsing: It is the task of identifying and segmenting the text It is the task of identifying and segmenting the text

into syntactically correlated word groups. into syntactically correlated word groups.

Ex:Ex:

[NP He ] [VP reckons ] [NP the current account deficit ] [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] . [NP September ] .

Page 5: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

BackgroundBackground

Lots of work has been done using various machine learning approaches like Lots of work has been done using various machine learning approaches like HMMsHMMs MEMMsMEMMs CRFs CRFs TBL etc…TBL etc…

for English and other European Languages.for English and other European Languages.

Page 6: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

Drawbacks For Indian Languages:Drawbacks For Indian Languages: These techniques don’t work well when small amount of These techniques don’t work well when small amount of

tagged data is used to estimate the parameters.tagged data is used to estimate the parameters. Free word order.Free word order.

Page 7: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

So what to do???So what to do???

Add more information…Add more information… Morphological InformationMorphological Information

Root, affixesRoot, affixes

Length of the WordLength of the Word Adverbs, Post-positions : 2-3 chars long.Adverbs, Post-positions : 2-3 chars long.

Contextual and Lexical RulesContextual and Lexical Rules

Page 8: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

OUR APPROACHOUR APPROACH

Page 9: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

POS-TaggerPOS-Tagger

CRF’s Training TBL (Building Rules)

FeaturesTraining Corpus

Pruning CRF output using TBL Rules

Training Corpus

CRF’s Testing

Model

Test Corpus Lexical &

Contextual Rules

Final Output

Page 10: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

Training Corpus CRF’s Training

CRF’s Testing

Model

Features

Test Corpus

Final Output

ChunkerHMM Based Chunk

Boundary Identification

Page 11: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

ExperimentsExperiments Pos-TaggingPos-Tagging::

a) Features for CRF:a) Features for CRF:

1) Basic Template of the combination of surrounding words have been used.1) Basic Template of the combination of surrounding words have been used. i.e. window size of 2,4, and 6 are tried with all possible combinations.i.e. window size of 2,4, and 6 are tried with all possible combinations. (4 was best for Telugu)(4 was best for Telugu) Ex: Ex: Window size of 2Window size of 2 : W-1,cW,W+1 : W-1,cW,W+1

Window size of 4Window size of 4 : W-2, W-1, cW, W+1, W+2 : W-2, W-1, cW, W+1, W+2 Window size of 6Window size of 6 : W-3, W-2, W-1, cW, W+1, W+2,W+3 : W-3, W-2, W-1, cW, W+1, W+2,W+3

cWcW : Current word : Current word W-1W-1: Previous word, : Previous word, W-2W-2: Previous 2: Previous 2ndnd Word, Word, W-3W-3: Previous 3: Previous 3rdrd word word W+1W+1: Next Word, : Next Word, W+2W+2: Next 2: Next 2ndnd Word, Word, W+3W+3: Next 3: Next 3rdrd word word

Accuracy: 62.89% (5193 test data) Accuracy: 62.89% (5193 test data)

Page 12: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

2) n-Suffix information:2) n-Suffix information: This feature consists of the last, last 2,last 3 and last 4 chars of a word. This feature consists of the last, last 2,last 3 and last 4 chars of a word. (Here the suffix mean statistical suffix not the linguistic suffix)(Here the suffix mean statistical suffix not the linguistic suffix)

Reason:Reason:Due to the agglutinative nature of Telugu considering the suffixes Due to the agglutinative nature of Telugu considering the suffixes

increases the accuracy. increases the accuracy.

Ex: Ex: ivvalsocivvalsociMdiiMdi (had to give) : VRB (had to give) : VRB ravalsocravalsociMdiiMdi (had to come): VRB (had to come): VRB

Accuracy: 73.45 %Accuracy: 73.45 %

Page 13: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

3) n-Preffix information:3) n-Preffix information:

This feature consists of the first, first 2, first 3, and so on up to This feature consists of the first, first 2, first 3, and so on up to first 7 chars of the words. ( prefix means statistical prefix not the first 7 chars of the words. ( prefix means statistical prefix not the linguistic prefix)linguistic prefix)

Reason:Reason:Usually the vibakthis get added to nouns.Usually the vibakthis get added to nouns.

puswakApuswakAlalo (in the books) NNlalo (in the books) NN puswakApuswakAmnu (the book) NNmnu (the book) NN

Accuracy: 75.35%Accuracy: 75.35%

Page 14: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

4)Word Length:4)Word Length:

All the words with length All the words with length <=<=3 are tagged as Less and 3 are tagged as Less and the rest are tagged as More.the rest are tagged as More.

Reason:Reason:

This is to account large number of functional This is to account large number of functional words in Indian Language.words in Indian Language.

Accuracy: 76.23%Accuracy: 76.23%

Page 15: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

5) Morph Root & Expected Tags:5) Morph Root & Expected Tags:

Root word and the best three expected lexical categories are Root word and the best three expected lexical categories are extracted using the morphological analyzer and are added as extracted using the morphological analyzer and are added as feature.feature.

Reason:Reason:

It is similar to the concept of the prefix and suffix. But here It is similar to the concept of the prefix and suffix. But here the root is extracted using the Morph Analyzer. Expected tags can the root is extracted using the Morph Analyzer. Expected tags can be used bind the output of the tagger.be used bind the output of the tagger.

Accuracy: 76.78%Accuracy: 76.78%

Page 16: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

b) Pruning :b) Pruning :

Next step is pruning the output using the rules generated Next step is pruning the output using the rules generated by TBL i.e. the contextual and the lexical rules.by TBL i.e. the contextual and the lexical rules.

Ex:Ex:

VJJ to VAUX when bigram is VJJ to VAUX when bigram is lolo unneunne

JJ to NN when next tag is PREPJJ to NN when next tag is PREP

Accuracy: 77.37%Accuracy: 77.37%

Page 17: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

Tagging Errors:Tagging Errors:

Issues regarding the nouns/compound nouns/adjectives.Issues regarding the nouns/compound nouns/adjectives. NN NN NNP NNP NNC NNC NN NN NN NN JJ JJ

And Also,And Also,

VRB VRB VFM; VFM VFM; VFM VAUX etc… VAUX etc…

Page 18: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

Experiments…(chunking)Experiments…(chunking)

1) Chunk Boundary identification1) Chunk Boundary identification

Initially we tried out HMM model for identifying the Initially we tried out HMM model for identifying the chunk boundary . chunk boundary .

First level:First level: pUrwi pUrwi NVB NVB B B

cesicesi VRB VRB I I aMxiMcamani aMxiMcamani VRB VRB I I

Page 19: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

2) Chunk Labeling Using CRFs2) Chunk Labeling Using CRFs

Features used in the CRF based approach are:Features used in the CRF based approach are: Word window of 4 : W-2,W-1,cW,W+1,W+2Word window of 4 : W-2,W-1,cW,W+1,W+2

Pos-tag window of 5 : P-3,P-2,P-1,cP,P+1,P+2Pos-tag window of 5 : P-3,P-2,P-1,cP,P+1,P+2

We used the chunk boundary label as a feature.We used the chunk boundary label as a feature.

Second level:Second level: pUrwipUrwi NVB NVB B-VG B-VG cesi cesi VRB VRB I-VG I-VG aMxiMcamani aMxiMcamani VRB VRB I-VG I-VG

Page 20: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

ResultsResults

Fig.1 Results of the POS-Tagging Fig.2 Chunking Results

*The same model is used for Telugu, Hindi and Bengali except for variations in the window size i.e. for Hindi, Bengali and Telugu we used a window size of 6, 6 and 4 respectively.

* Using the Golden Standard tags the accuracy for Telugu tagger was 90.65%

Page 21: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

ConclusionConclusion

The best accuracies were achieved with the use The best accuracies were achieved with the use morphologically rich features like suffix, prefix of information morphologically rich features like suffix, prefix of information etc... coupled with various efficient machine learning etc... coupled with various efficient machine learning techniques techniques

Sandhi Spliter could be used to improve furture.Sandhi Spliter could be used to improve furture. Eg: Eg:

1: 1: pAxaprohAlace pAxaprohAlace (NN) = (NN) = pAxaprahArAliiu pAxaprahArAliiu (NN) + (NN) + ce ce (PREP) (PREP)

2: 2: vAllumtAruvAllumtAru(V) = (V) = vAlylyuvAlylyu(NN) + (NN) + uM-tAruuM-tAru(V) (V)

Page 22: Part-Of-Speech Tagging and Chunking using CRF & TBL Avinesh.PVS, Karthik.G LTRC IIIT Hyderabad {avinesh,karthikg}students.iiit.ac.in {avinesh,karthikg}students.iiit.ac.in

Thank You!!

Queries???