part-of-speech tagging and chunking using crf & tbl avinesh.pvs, karthik.g ltrc iiit hyderabad...

Part-Of-Speech Part-Of-Speech Tagging and Tagging and Chunking Chunking

using CRF & TBLusing CRF & TBLAvinesh.PVS, Karthik.GAvinesh.PVS, Karthik.G

LTRCLTRCIIIT HyderabadIIIT Hyderabad

{avinesh,karthikg}students.iiit.ac.in{avinesh,karthikg}students.iiit.ac.in

OutlineOutline

1.Introduction1.Introduction

2.Background2.Background

3.Architecture of the System 3.Architecture of the System

4.Experiments4.Experiments

5.Conclusion5.Conclusion

IntroductionIntroduction

POS-TaggingPOS-Tagging::

It is the process of assigning the part of speech tag to It is the process of assigning the part of speech tag to the NL text based on both its definition and its contextthe NL text based on both its definition and its context..

Uses:Uses:Parsing of sentences, MT, IR, Word Sense disambiguation, Parsing of sentences, MT, IR, Word Sense disambiguation, Speech synthesis etc.Speech synthesis etc.

Methods:Methods:1. Statistical Approach1. Statistical Approach2. Rule Based2. Rule Based

Cont..Cont..

Chunking or Shallow Parsing:Chunking or Shallow Parsing: It is the task of identifying and segmenting the text It is the task of identifying and segmenting the text

into syntactically correlated word groups. into syntactically correlated word groups.

Ex:Ex:

[NP He ] [VP reckons ] [NP the current account deficit ] [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] . [NP September ] .

BackgroundBackground

Lots of work has been done using various machine learning approaches like Lots of work has been done using various machine learning approaches like HMMsHMMs MEMMsMEMMs CRFs CRFs TBL etc…TBL etc…

for English and other European Languages.for English and other European Languages.

Drawbacks For Indian Languages:Drawbacks For Indian Languages: These techniques don’t work well when small amount of These techniques don’t work well when small amount of

tagged data is used to estimate the parameters.tagged data is used to estimate the parameters. Free word order.Free word order.

So what to do???So what to do???

Add more information…Add more information… Morphological InformationMorphological Information

Root, affixesRoot, affixes

Length of the WordLength of the Word Adverbs, Post-positions : 2-3 chars long.Adverbs, Post-positions : 2-3 chars long.

Contextual and Lexical RulesContextual and Lexical Rules

OUR APPROACHOUR APPROACH

POS-TaggerPOS-Tagger

CRF’s Training TBL (Building Rules)

FeaturesTraining Corpus

Pruning CRF output using TBL Rules

Training Corpus

CRF’s Testing

Model

Test Corpus Lexical &

Contextual Rules

Final Output

Training Corpus CRF’s Training

CRF’s Testing

Model

Features

Test Corpus

Final Output

ChunkerHMM Based Chunk

Boundary Identification

ExperimentsExperiments Pos-TaggingPos-Tagging::

a) Features for CRF:a) Features for CRF:

1) Basic Template of the combination of surrounding words have been used.1) Basic Template of the combination of surrounding words have been used. i.e. window size of 2,4, and 6 are tried with all possible combinations.i.e. window size of 2,4, and 6 are tried with all possible combinations. (4 was best for Telugu)(4 was best for Telugu) Ex: Ex: Window size of 2Window size of 2 : W-1,cW,W+1 : W-1,cW,W+1

Window size of 4Window size of 4 : W-2, W-1, cW, W+1, W+2 : W-2, W-1, cW, W+1, W+2 Window size of 6Window size of 6 : W-3, W-2, W-1, cW, W+1, W+2,W+3 : W-3, W-2, W-1, cW, W+1, W+2,W+3

cWcW : Current word : Current word W-1W-1: Previous word, : Previous word, W-2W-2: Previous 2: Previous 2ndnd Word, Word, W-3W-3: Previous 3: Previous 3rdrd word word W+1W+1: Next Word, : Next Word, W+2W+2: Next 2: Next 2ndnd Word, Word, W+3W+3: Next 3: Next 3rdrd word word

Accuracy: 62.89% (5193 test data) Accuracy: 62.89% (5193 test data)

2) n-Suffix information:2) n-Suffix information: This feature consists of the last, last 2,last 3 and last 4 chars of a word. This feature consists of the last, last 2,last 3 and last 4 chars of a word. (Here the suffix mean statistical suffix not the linguistic suffix)(Here the suffix mean statistical suffix not the linguistic suffix)

Reason:Reason:Due to the agglutinative nature of Telugu considering the suffixes Due to the agglutinative nature of Telugu considering the suffixes

increases the accuracy. increases the accuracy.

Ex: Ex: ivvalsocivvalsociMdiiMdi (had to give) : VRB (had to give) : VRB ravalsocravalsociMdiiMdi (had to come): VRB (had to come): VRB

Accuracy: 73.45 %Accuracy: 73.45 %

3) n-Preffix information:3) n-Preffix information:

This feature consists of the first, first 2, first 3, and so on up to This feature consists of the first, first 2, first 3, and so on up to first 7 chars of the words. ( prefix means statistical prefix not the first 7 chars of the words. ( prefix means statistical prefix not the linguistic prefix)linguistic prefix)

Reason:Reason:Usually the vibakthis get added to nouns.Usually the vibakthis get added to nouns.

puswakApuswakAlalo (in the books) NNlalo (in the books) NN puswakApuswakAmnu (the book) NNmnu (the book) NN

Accuracy: 75.35%Accuracy: 75.35%

4)Word Length:4)Word Length:

All the words with length All the words with length <=<=3 are tagged as Less and 3 are tagged as Less and the rest are tagged as More.the rest are tagged as More.

Reason:Reason:

This is to account large number of functional This is to account large number of functional words in Indian Language.words in Indian Language.


5) Morph Root & Expected Tags:5) Morph Root & Expected Tags:

Root word and the best three expected lexical categories are Root word and the best three expected lexical categories are extracted using the morphological analyzer and are added as extracted using the morphological analyzer and are added as feature.feature.

Reason:Reason:

It is similar to the concept of the prefix and suffix. But here It is similar to the concept of the prefix and suffix. But here the root is extracted using the Morph Analyzer. Expected tags can the root is extracted using the Morph Analyzer. Expected tags can be used bind the output of the tagger.be used bind the output of the tagger.


b) Pruning :b) Pruning :

Next step is pruning the output using the rules generated Next step is pruning the output using the rules generated by TBL i.e. the contextual and the lexical rules.by TBL i.e. the contextual and the lexical rules.

Ex:Ex:

VJJ to VAUX when bigram is VJJ to VAUX when bigram is lolo unneunne

JJ to NN when next tag is PREPJJ to NN when next tag is PREP


Tagging Errors:Tagging Errors:

Issues regarding the nouns/compound nouns/adjectives.Issues regarding the nouns/compound nouns/adjectives. NN NN NNP NNP NNC NNC NN NN NN NN JJ JJ

And Also,And Also,

VRB VRB VFM; VFM VFM; VFM VAUX etc… VAUX etc…

Experiments…(chunking)Experiments…(chunking)

1) Chunk Boundary identification1) Chunk Boundary identification

Initially we tried out HMM model for identifying the Initially we tried out HMM model for identifying the chunk boundary . chunk boundary .

First level:First level: pUrwi pUrwi NVB NVB B B

cesicesi VRB VRB I I aMxiMcamani aMxiMcamani VRB VRB I I

2) Chunk Labeling Using CRFs2) Chunk Labeling Using CRFs

Features used in the CRF based approach are:Features used in the CRF based approach are: Word window of 4 : W-2,W-1,cW,W+1,W+2Word window of 4 : W-2,W-1,cW,W+1,W+2

Pos-tag window of 5 : P-3,P-2,P-1,cP,P+1,P+2Pos-tag window of 5 : P-3,P-2,P-1,cP,P+1,P+2

We used the chunk boundary label as a feature.We used the chunk boundary label as a feature.

Second level:Second level: pUrwipUrwi NVB NVB B-VG B-VG cesi cesi VRB VRB I-VG I-VG aMxiMcamani aMxiMcamani VRB VRB I-VG I-VG

ResultsResults

Fig.1 Results of the POS-Tagging Fig.2 Chunking Results

*The same model is used for Telugu, Hindi and Bengali except for variations in the window size i.e. for Hindi, Bengali and Telugu we used a window size of 6, 6 and 4 respectively.

* Using the Golden Standard tags the accuracy for Telugu tagger was 90.65%

ConclusionConclusion

The best accuracies were achieved with the use The best accuracies were achieved with the use morphologically rich features like suffix, prefix of information morphologically rich features like suffix, prefix of information etc... coupled with various efficient machine learning etc... coupled with various efficient machine learning techniques techniques

Sandhi Spliter could be used to improve furture.Sandhi Spliter could be used to improve furture. Eg: Eg:

1: 1: pAxaprohAlace pAxaprohAlace (NN) = (NN) = pAxaprahArAliiu pAxaprahArAliiu (NN) + (NN) + ce ce (PREP) (PREP)

2: 2: vAllumtAruvAllumtAru(V) = (V) = vAlylyuvAlylyu(NN) + (NN) + uM-tAruuM-tAru(V) (V)

Thank You!!

Queries???

part-of-speech tagging and chunking using crf & tbl avinesh.pvs, karthik.g ltrc iiit hyderabad...

Documents

crf tbl

g ltrc iiit hyderabad

speech tagging