introduction to computational linguistics

76
Introduction to Computational Linguistics Dipti Misra Sharma IIIT, Hyderabad <[email protected] > IASNLP 05-07-2012

Upload: brinda

Post on 29-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Computational Linguistics. Dipti Misra Sharma IIIT, Hyderabad < [email protected] > IASNLP 05-07-2012. Outline. Background What is Computational Linguistics (CL)? What do the Computational Linguists do? What are the issues in processing natural languages? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Computational Linguistics

Introduction to Computational Linguistics

Dipti Misra Sharma

IIIT, Hyderabad

<[email protected]>

IASNLP 05-07-2012

Page 2: Introduction to Computational Linguistics

Outline

Background

What is Computational Linguistics (CL)?

What do the Computational Linguists do?

What are the issues in processing natural languages?

What can we do with CL?

Approaches in CL?

Page 3: Introduction to Computational Linguistics

Background

Language is a means of communication

Therefore, one can say

It encodes what is communicated <information>

We apply the processes of

Analysis (decoding) for understanding

Synthesis (encoding) for expression (speaking)

Page 4: Introduction to Computational Linguistics

What do we communicate ?

Information (SPAIN delivered a football masterclass at Euro 2012)

Intention <purpose> Emphasis/focus (Euro 2012 won by Spain/ Spain bags Euro 2012)

Introduces variation

Page 5: Introduction to Computational Linguistics

How do we communicate ?

We use linguistic elements such as

Words (country, park, the, is, Bandipur, of, as, and, considered, National,

a, spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one)

Arrangement of the words (Sentences) Words are related to each-other to provide the

composite meaning(Bandipur National park is a beautiful tourist spot and considered as

one of the best wild life sanctuaries in the country)

Page 6: Introduction to Computational Linguistics

How do we communicate ?

Arrangement of sentences (Discourse) Sentences or parts of sentences are related to each other to provide a cohesive meaning

*(Considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km. Bandipur National park is a beautiful tourist spot.)

(Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km)

Languages differ in the way they organise information in these entities

All of these interact in the organisation of information

Page 7: Introduction to Computational Linguistics

What is Computational Linguistics?

Computational linguistics is the scientific study of language from a computational perspective.

Page 8: Introduction to Computational Linguistics

What does it mean?

Scientific Provides explanation for a linguistic or psycholinguisitc phenomenon

Computational Develops computational models/techniques for linguistic phenomena

Human language is the subject of study

Page 9: Introduction to Computational Linguistics

In other words

Computational linguistics is the application of linguistic theories and computational techniques

to problems of natural language processing.

http://www.ba.umist.ac.uk/public/departments/registrars/academicoffice/uga/lang.htm

Page 10: Introduction to Computational Linguistics

What do the Computational Linguists do?

Linguistic research

Develop language models for processing natural languages

Develop language resources for NLP research/applications

Understand and develop models for analysis and generation of natural languages by the computers

Page 11: Introduction to Computational Linguistics

So,

A Computational Linguist needs to understand

How language works

What information is available in the language?

How languages encode information? How this knowledge/information can

be representated for computational processing?

Page 12: Introduction to Computational Linguistics

Information in Language (1/4)

Languages encode information

cuuhe maarate haiN kutte

rats kill dogs

Hindi sentence is ambiguous Possible interpretations

Dogs kill rats

Rats kill dogs

However,

English sentence is not ambiguous

Page 13: Introduction to Computational Linguistics

Information in Language (2/4)

Ambiguity in Hindi is resolved if,

cuuhe maarate haiM kuttoN korats kill dogs acc

English encodes information in positions

Hindi in morphemes

Languages encode information differently

Page 14: Introduction to Computational Linguistics

Information in Language (3/4)

Another example,

This chair has been sat on

– The chair has been used for sitting– X sat on this chair, and it is known– The sentence does not mention X

Languages encode information partially

Page 15: Introduction to Computational Linguistics

Information in Language (4/4)

English pronouns he, she, itHindi pronoun vaha

He is going to Delhi ==> vaha dilli jaa rahaa hai

She is going to Delhi ==> vaha dillii jaa rahii hai

It broke ==> vaha TuuTa ??

Information does not always map fully from one language into another

Conceptual worlds may be different

Page 16: Introduction to Computational Linguistics

Differences ?

Words

English Hindi Telugu

boys laDake/laDakoN <n,pl> <n,sg/pl,case>

He/she/it vaha atanu/aame/adi is/am/are hai/huuN/haiN/ho

is going jaa rahaa hai/rahii hai/rahe haiN

Page 17: Introduction to Computational Linguistics

Indian Languages

Relatively flexible word order

1. a) baccaa phala khaataa hai

‘child’ ‘fruit’ ‘eat+hab’ ‘pres’

The child eats fruits

b) phala baccaa khaataa hai

c) phala khaataa hai baccaa

d) baccaa khaataa hai phala

Page 18: Introduction to Computational Linguistics

Some structural differences

EnglishDeclarative : Ravi is coming todayInterrogative : Is Ravi coming today ?

Change in the position of ‘is’ brings the change in meaning

HindiDeclarative : ravi aaj aa rahaa haiInterrogative : kyaa ravi aaj aa rahaa hai ?

Word ‘kyaa’ encodes the question information

Alternatively, more natural spoken form in Hindi

ravi aaj aa rahaa hai ? (with appropriate intonation) ORRavi aaj aa rahaa hai kyaa?

Page 19: Introduction to Computational Linguistics

Post nominal modification

'ing' clauses

I know [the man playing guitar]

Hindi, on the other hand

maiN [giTaar bajaa rahe vyakti ko] jaanataa huuN

Page 20: Introduction to Computational Linguistics

Clauses having 'un-' negative constructions

EnglishUnless you reach there the job will not be done

Hindijab tak tum vahaaN nahiiN pahuNcate , kaam

nahiiN hogaa

Page 21: Introduction to Computational Linguistics

Languages Differ

Different languages have different

mechanisms/devices to encode information Some devices are common across certain languages and some are different There are alternative ways of expressing the same meaning within the same language Languages show preferences for one device over the othersEnglish exploits ‘position’ for encoding informationHindi uses ‘words’ more effectively

Thus, differences in grammatical structures

Page 22: Introduction to Computational Linguistics

Ambiguity in Natural Language (1/2)

Look at the word 'plot' in the following examples

(a) The plot having rocks and boulders is not good.(b) The plot having twists and turns is interesting.

'plot' in (a) means 'a piece of land' and in (b) 'an outline of the events in a story'

Page 23: Introduction to Computational Linguistics

Ambiguity in Natural Language (2/2)

Lexical level

Sentence level

Structural differences between SL and TL in a Machine Translation system.

Page 24: Introduction to Computational Linguistics

Lexical ambiguity

Lexical ambiguity can be both for

Content words – nouns, verbs etcFunction words – prepositions, TAMs etc

Content words' ambiguity is of two types

HomonymyPolysemy

Page 25: Introduction to Computational Linguistics

Homonymy

A word has two or more unrelated senses

Example : I was walking on the bank (river-bank)

I deposited the money in the bank (money-bank)

Page 26: Introduction to Computational Linguistics

Polysemy

A word having two or more related senses

Example : English word 'issue', noun 1. The issue is under discussion (muddaa)2. The latest issue of the journal is out (aNka)3. He buys stamps on the day of the issue (vimocan)

4. The couple has no issue even after five years of marriage (saNtaan)

Page 27: Introduction to Computational Linguistics

Information Flow and Ambiguity

1. He scratched a figure on the rock (engrave)

2. She scratched the figure on the rock (scrape)

• Other words in the context make a difference• Change of 'a' (in 1) to 'the' (in 2) changes the meaning of 'scratched'

Page 28: Introduction to Computational Linguistics

Function words can also pose problems (1/4)

Function words can also be ambiguousFor example – English preposition   'in'

                    (a)  I met him in the garden                           maiN usase bagiice meiN milaa

                    (b)  I met him in the morning                            maiN usase subaha 0 milaa

'Ambiguity' here refers to the 'appropriate correspondence' in the target language.

Page 29: Introduction to Computational Linguistics

Function words can also pose problems (2/4)

1. He bought a shirt with tiny collars.

usane chote kaular vaalii kamiiz khariidii

‘he tiny collars with shirt bought’

‘with’ gets translated as ‘vaalii’ in Hindi

2. He washed a shirt with soap.

usane saabun se kamiiz dhoii

‘he soap with shirt washed’

‘with’ gets translated as ‘se’ .

Page 30: Introduction to Computational Linguistics

Function words can also pose problems (3/4)

TAM Markers mark tense, aspect and modality

– Consist of inflections and/or auxiliary verbs in Hindi

– An important source of information

– Narrow down the meaning of a verb (eg. lied, lay)

Page 31: Introduction to Computational Linguistics

Function words can also pose problems (4/4)

English Simple Past vs Habitual'

1a. He stayed in the guest house during his visit to our University in Jan (rahaa)

1b. He stayed in the guest house whenever he visited us (rahataa thaa)

2a. He went to the school just now (gayaa)

2b. He went to the school everyday (jaataa thaa)

Page 32: Introduction to Computational Linguistics

Sentence level ambiguity

I met the girl in the store      + Possible readings          a)  I met the girl who works in the store          b)  I met the girl while I was in the store         Time flies like an arrow.      + Possible parses:

a) Time flies like an arrow (N V Prep Det N)b) Time flies like an arrow (N N V Det N) c) Time flies like an arrow (V N Prep Det N) (flies are like an arrow) d) Time flies like an arrow (V N Prep Det N) (manner of timing)

Page 33: Introduction to Computational Linguistics

Thus,

Languages encode information differently

Languages code information only partially Tension between BREVITY and PRECISION Brevity wins leading to inherent ambiguity at

different levels

Page 34: Introduction to Computational Linguistics

Human beings use

World knowledge

Context (both linguistic and extra-linguistic)

Cultural knowledge and

Language conventions to resolve ambiguities

Can all this knowledge be provided to the machine ? Computational Linguistics aims for this.

Page 35: Introduction to Computational Linguistics

How to provide this knowledge ? (1/2)

Analyse language at various levels (word, phrase, sentence etc)

Build Tools for analysing the natural language at various levels in a text

POS tagger (category marking)

Morphological analysers (analysis of a word)

Morphological generators (word generators)

Chunkers (shallow parsers)

Parsers (syntactic analysis)

Filters (markers for special expressions)

Sense Disambiguation Algorithms

Etc

The tools need linguistic knowledge

Page 36: Introduction to Computational Linguistics

How to provide this knowledge ? (2/2)

Build language resources

Machine Readable Lexicon Rules for various levels of linguistic

analysis Computational Grammars Mapping rules for the concerned

language pair for an MT system Sense Disambiguation Rules Annotated corpora Etc

Page 37: Introduction to Computational Linguistics

POS Tagger

What is a POS? Take the following English sentence

My old friend Ram recently bought a book on Indian snakes for his cousin from London from the new bookshop .

Each word in the above sentence belongs to a word class (also called as a Part Of Speech (POS))

The class to which a word may belong is based on its morphological and syntactic behavior

MorphologicalKind of affixes a word takes, for example,

boy, boys; girl, girls; book, books (noun class) Syntactic

How it is distributed in a sentence He chairs the next session (verb) The chairs are new (noun)

Page 38: Introduction to Computational Linguistics

Why is POS relevant in CL/NLP ? (1/2)

• Word class information of a given word in a sentence helps to predict its neighbour

• WSD

He runs a mile every day (verb)

Their team made 250 runs (noun)

Time flies like an arrow (n v prep det n)

• Helps in further processing – chunking, morph pruning, sentence parsing

• IR

A POS tagger automatically marks the POS of all the words in a text

Page 39: Introduction to Computational Linguistics

POS tagged sentence

My possesive pronoun

old adjective

friend noun

Ram proper noun

recently adverb

bought verb

a determiner

book noun

on preposition

Indian adjective

snakes noun

for preposition

his possesive pronoun

cousin noun

from preposition

London proper noun

, punctuation

from preposition

the determiner

new adjective

bookshop noun

in preposition

town noun

Page 40: Introduction to Computational Linguistics

POS Tagging Approaches

Rule Based

Statistical

Transformation Based

Page 41: Introduction to Computational Linguistics

Rule Based POS Tagging

Two staged architecture algorithms

(Harris, 1962; Klein and Simmons, 1963; Green and Rubin,

1971)

Stage 1 assign POS by referring to the

dictionary

Eg Dictionary entry for Eng word that

that Conj, Adv, Pronoun

Stage 2 disambiguate, using manually

crafted rules

Page 42: Introduction to Computational Linguistics

Statistical

Taggers use probabilities for tagging

The tagger picks the most likely tag for a given word in a context

HMM based algorithms are most commonly used for POS tagging task

Requires manually tagged corpus

Page 43: Introduction to Computational Linguistics

Annotating Corpus for POS

Annotated corpora is useful for developing statistical POS taggers

Tagging schemeSet of POS Tags

Guidelines for the annotators

The tagged corpora should beHigh quality (in terms of tagging accuracy)

Consistent

Page 44: Introduction to Computational Linguistics

POS Tags for English

English

Penn Tree Bank – 45 tags

C5 - Lancaster – 61 tags – used in CLAWS

Basic tagset used for BNC http://view.byu.edu/bnc_tags.htm

- C7 – 147 tags – Leech

http://www.comp.lancs.ac.uk/ucrel/claws7tags.html

Page 45: Introduction to Computational Linguistics

Pen Treebank Tags

My PP$

old JJ

friend NN

Ram NNP

recently RB

bought VBD

a DT

book NN

on IN

Indian JJ

snakes NNS

for IN

his PP$

cousin NN

from IN

London NNP

, ,

from IN

the DT

new JJ

bookshop NN

in IN

town NN

Page 46: Introduction to Computational Linguistics

POS Tags for Indian Languages

Objective

To arrive at a standard POS and Chunk tagging scheme for all Indian languages

Assumption

Commonality in Indian Languages

Page 47: Introduction to Computational Linguistics

Issues in Tag Set Design (1/2)

Linguistic knowledge coarse vs fine Syntactic function vs lexical category (for

POS tags) New tags vs tags close to existing English

tags Should be comprehensive/complete

Page 48: Introduction to Computational Linguistics

Issues in Tag Set Design (2/2)

Simple Less effort in manual tagging Number of tags Common for all Indian languages

Page 49: Introduction to Computational Linguistics

Linguistic Knowledge :Fine vs Coarse (1/2)

ExampleOnly noun (NN) laDakA, laDake, laDakoM, laDakI, laDakiyAM,

ladakiyoMORNoun with gender, number, case information (NNM) ladakA, ladAke, laDakoM, (NNMS) ladakA, laDake (NNMP) laDake, laDkoM, (NNMSD) laDakA, (NNMSO) laDake, (NNMPD) laDake, (NNMPO) laDakoM

The decision has implications for the size of corpora and machine learning

Page 50: Introduction to Computational Linguistics

Linguistic Knowledge :Fine vs Coarse (2/2)

Alternatives Coarse - NN (advantages/disadvantages) Fine - NNMSD

(advantages/disadvantages) Hierarchical

Example: NN_m_sg_d

Hierarchical tag set provides the possibility for underspecification

Page 51: Introduction to Computational Linguistics

Considerations

POS tagger is NOT a replacement for a morph analyzer

Coarse analysis to begin with Expandable if needed If the information can be obtained from

elsewhere, it need not be included in the POS tag

Page 52: Introduction to Computational Linguistics

Syntactic function vs lexical category

Example

harijana bAlaka ‘harijan’ ‘child’

Decision : Lexical category

Helps achieve Consistency in annotation Better learning

Page 53: Introduction to Computational Linguistics

New tags vs tags close to existing English tags

New tags

Noun, Pron, Adj, Adv Familiar tags (Penn Treebank tags)

NN, PRP, JJ, RB

Decision : Penn tags for common lexical types

New tags for certain IL specific cases

Page 54: Introduction to Computational Linguistics

Comprehensive/Complete

All the lexical items occurring in a sentence should be marked for their POS, including punctuations.

If the language has some special cases, these should also be captured – Reduplications in ILs

Page 55: Introduction to Computational Linguistics

Simple

Why simple ? The tags are designed for some manual

annotation Ease of learning Consistency in annotation

Page 56: Introduction to Computational Linguistics

Less Effort in Manual Tagging

The annotators should not have to Write too much Take too many steps in annotating a lexical item

Page 57: Introduction to Computational Linguistics

Number of Tags

Number of tags makes a difference both for the man and the machine

For the man in decision making For the machine in learning for automatic

tagging

Page 58: Introduction to Computational Linguistics

Common for All Indian Languages

Indian languages belong to various language families

Share linguistic features

However, There are differences

Some languages have quotatives, some don't Some have classifiers, some don't

Page 59: Introduction to Computational Linguistics

Chunking

What forms a chunk ?

Non-recursive phrase ((det adj noun))

Partial structure without distorting the dependencies Include inflections (postposition/auxiliaries) with a lexical category

Example : ((mere choTe bhaaii ne))_NP

((jaa rahaa hai))_VG

Page 60: Introduction to Computational Linguistics

Chunker

A Chunker automatically groups words in a sentence as chunks and labels them

((My old friend Ram))_NP ((recently bought))_VG ((a book))_NP on ((Indian snakes))_NP for ((his cousin))_NP from ((London))_NP from ((the new bookshop))_NP.

Page 61: Introduction to Computational Linguistics

IL Chunk Tags (1/2)

NP noun chunk bahut acchiiI kitaab

JJP adjective chunk bahut sundar sii

RBP adverb chunk dhiIre – dhIire

NEGP chunk for negatives nahiiN

CCP conjunct chunks raam Ora shyaam

BLK miscellaneous interjections etc

Page 62: Introduction to Computational Linguistics

IL Chunk Tags (2/2)

VGF Finite verb chunk jaa rahaa hai VGNF Non finite verb chunk jaate hue VGINF Infinitive verb chunk jaanaa VGNN Gerunds jaanaa FRAGP Discontiguous fragments of a chunk

raama (meraa bhaaii) ne

Page 63: Introduction to Computational Linguistics

Some Issues

How to chunk the following ?

Adverbs

within a verb chunk or separately Eg ((recently bought)) or ((recently)) ((bought))

Punctuations Particles – hii (only), to, bhii (also) etc

Page 64: Introduction to Computational Linguistics

Current approach

For punctuation – chunk them with the preceding chunk

Adverbs – chunk them separatelyParticles – chunk them with the chunk to

which they belong

((raam ne bhii)) ((jaa hii rahaa thaa))

Page 65: Introduction to Computational Linguistics

Issues

• Verb Negation

1. nahiiN jaa rahaa ‘not going’2. kahaa hii nahiiN ‘just did not mention’3. kaha to nahiiN rahaa thaa ‘was not saying’

(emphatic)4. binaa yaha baata kahe ‘without saying this’

5. yahii nahiiN, balki likhita ruup meiN bhii yah miltaa hai

‘Not only this, in fact, this is also found in writing'

Page 66: Introduction to Computational Linguistics

Current approach

For cases 1 to 3, chunk NEG with the verb group

For 4, chunk the NEG separately in a chunk

For 5, also a separate NEGP chunk will work

NOUN NEGATION ???

Page 67: Introduction to Computational Linguistics

Chunking Co-ordinate Constructions

1. word1 CC word2 raam aur shyaam

((raam))_NP ((aur))_CCP ((shyaam))_NP

2. phrase CC phrasemeraa bhaaii shyaam aur tumhaaraa bhaaii mohan

((meraa bhaaii shyaam))_NP ((aur))_CCP ((tumhaaraa bhaaii mohan))_NP

3. clause CC clause

Page 68: Introduction to Computational Linguistics

Discontiguous Phrases

What about cases such as ' X (Y) Z' ?

where X = noun, Y = a phrase, Z = postposition

raam (meraa xillii vaalaa bhaaii) ne

OR

isa 'upanyaas – samraaT' shabda kaa'

FRAGP

Page 69: Introduction to Computational Linguistics

Chunking Conjunct Verbs

Conjunct verbs

A verb composed of a noun/adj and a verb (sviikaar karnaa 'accept')

Should the conjunct verbs be tagged as a single chunk or two chunks?

'prawIkSA karanA', 'kSamA karanA' etc

‘to wait’ ‘to forgive’

Page 70: Introduction to Computational Linguistics

What about genitives ?

raam kaa betaa

'brother of Ram'

usakaa betaa

'his/her son'

mere bhaaii raam kaa betaa

'my brother Ram's son'

iske pahale

'before this'

mez ke uupar

'above/on the table'

ravi ke saath

'with Ravi'

Page 71: Introduction to Computational Linguistics

Chunking Numbers/Quantifiers (1/2)

Numerals, quantifiers may occur as follows

a) ek laDakaa 'one boy'

b) 1 laDakaa '1 boy'

c) pahalaa laDakaa 'first boy'

d) karoDoN log 'billions of people'

e) 1962 meiN 'in 1962'

Page 72: Introduction to Computational Linguistics

Chunking Numbers/Quantifiers (2/2)

The POS tags for numerals and quantifiers are QC (numerals) and QF (other quantifiers) in IL POS tagset

Example (d) and (e) in the previous slide show cases where the quantifier is behaving like a noun

The issue :

Should the quantifiers in cases such as (d) and (e) be tagged as a Q* or as NN since the chunk itself is a noun chunk ?

Page 73: Introduction to Computational Linguistics

Summary

For annotating POS and Chunk a scheme needs to be designed

While doing so following issues need to be considered.

Definition of 'chunk'

Elements which together can form a chunk type

Whether to include postpositions, punctuations etc inside a chunk or form them as independent chunks

POS/Chunk tag labels

Page 74: Introduction to Computational Linguistics

Approaches in Computational Linguistics (for Tools)

Two major approaches Rule based

Requires manually crafted rulesExplicit linguistic knowledgeNeeds manual time and effortTrained manpowerHigh precisionLess robust

Page 75: Introduction to Computational Linguistics

Approaches in Computational Linguistics (for Tools)

Data driven approachUses statistical methods or machine learning Requires less human effortOften requires large scale data sources (manually annotated corpora, lexicons etc)Linguistic knowledge is implicitMore adaptive to noisy textMore robust

Page 76: Introduction to Computational Linguistics

Computational Linguistics Application Areas

Is useful for Communication between

Man-machine Question answering systems, interactive railway reservation Text summarization Web applications Intelligent search engines Cross lingual searchMan – man

Machine translation