part-of-speech tagging with limited training corpora robert staubs period 1

21
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Upload: mavis-wade

Post on 18-Jan-2018

225 views

Category:

Documents


0 download

DESCRIPTION

Corpus Linguistics -The study of language based on sample texts of the language in use, often with computational methods. “Text” in this case means any sample of the language. One of the bodies of text may be called a corpus—Latin for “body”. The plural is corpora.

TRANSCRIPT

Page 1: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Part-of-Speech Tagging with Limited Training Corpora

Robert StaubsPeriod 1

Page 2: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Pay Attention!

- cor·pusn. pl. cor·po·ra– A large collection of writings of a specific kind

or on a specific subject. – A collection of writings or recorded

remarks used for linguistic analysis.– ...

Page 3: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Corpus Linguistics

- The study of language based on sample texts of the language in use, often with computational methods.

• “Text” in this case means any sample of the language.

• One of the bodies of text may be called a corpus—Latin for “body”. The plural is corpora.

Page 4: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Part-of-Speech (POS) Tagging

- The processing of a corpus to apply tags to words or other semantic units corresponding to part-of-speech.

• “Part-of-speech” here has a very general meaning.– Not the seven classes you learn in English class.– Not the categories of transformational syntax (within

“pure” linguistics).– Such categories as are best suited for computer

processing of language (esp. parsing).

Page 5: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Training Corpora

• Traditional POS taggers must be trained on a tagged corpus.

• This corpus will usually be a list of words in the order of their appearance in the text with the tag of each word next to it. Other information may be included.

• Most taggers work using corpora that have been hand-tagged by humans.

• Some can make do with partial or non-perfect machine tagged corpora.

Page 6: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Limited Training Corpora

• Modern tagging methods call for the largest possible amount of training data.

• Oftentimes data can be hard to obtain. (Not quite the case with English anymore, but other languages can be more problematic.)

• Behavior of POS taggers with smaller training corpora may differ from those with larger ones.

Page 7: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Genre and Tagging

• Many corpora are divided into the genre the of source for the texts: transcribed speech, newswire article, fiction, etc.

• Training knowledge about a specific genre can be—and usually is—more valuable than general knowledge.

• When tagging size reaches a certain smallness general knowledge may in some cases become more valuable than genre knowledge.

Page 8: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Implementation Basics• The frequencies of each tag, word, tag-tag

transition, and word-tag correspondence are needed.

• These frequencies are extended to the general case by using them to create probabilities of a certain word having a certain tag and of a certain tag following another tag.

• The probabilities of a certain tag mapping to a certain word are combined and the most probable is chosen.

• This is the Viterbi Algorithm viewing the system as a Hidden Markov Model.

Page 9: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Intermission

• Who names their kid Corpus?

• I mean, seriously?

Page 10: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Questions for Investigation

• How do limited training corpora POS taggers perform?

• How does performance differ amongst various genres, and between general and genre-specific tagging?

• Is this in accordance with large corpora theories?

• Can a transitional point be found—the minimal size?

• How general is such information?

Page 11: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Results

• Results were found for genre-specific tagging for each of the four genres in the corpus as well as for general tagging on an exemplar of each genre.

• A “primed” genre-specific case wherein the system is trained on the target text as well is included for comparison.

Page 12: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1
Page 13: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Comparison of Taggings

• Primed genre-specific taggings are most accurate in all cases.

• General tagging is somewhat less accurate in two cases, very slightly less accurate in one case, and somewhat more accurate in one case.

Page 14: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1
Page 15: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1
Page 16: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1
Page 17: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Training Corpus Size

• No clear trend can be drawn from the sizes investigated.

• The small differences involved are largely inconsequential compared to differences between texts.

Page 18: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Conclusions

• Performance Evaluation:– Relatively poor compared to large-corpora taggers but

much better than random.• Effect of Genres:

– Genred taggings seem to work somewhat better, although numbers found are not overwhelmingly convincing.

– Existence of “transition point” still uncertain.– Best for G (biographies; memoirs) and N (adventure;

fiction), second is A (press reportage), last is J (learned writing).

Page 19: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Potential Improvements

• This tagger was designed to be purely statistical. Given some human knowledge it could improve.

• Basic conjugation, declension, and other word-formation methods would be very helpful in determinations of POS.

• Using further depths of association among sequences of tags would also help.

Page 20: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Final Words

• Limited-corpora training is inadequate for many uses if one uses the standard methods.

• An efficient,possibly improved implementation could be helpful in preliminary work for a Baum-Welsh Re-estimation or for a tagging using a larger corpus.

Page 21: Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1

Quiz and Q&A

• Define corpus.

• Any questions?