introduction to computational linguistics

Introduction to Computational Linguistics

Dipti Misra Sharma

IIIT, Hyderabad

<[email protected]>

IASNLP 05-07-2012

mailto:[email protected]

Outline

Background

What is Computational Linguistics (CL)?

What do the Computational Linguists do?

What are the issues in processing natural languages?

What can we do with CL?

Approaches in CL?

Background

Language is a means of communication

Therefore, one can say

It encodes what is communicated <information>

We apply the processes of

Analysis (decoding) for understanding

Synthesis (encoding) for expression (speaking)

What do we communicate ?

Information (SPAIN delivered a football masterclass at Euro 2012)

Intention <purpose> Emphasis/focus (Euro 2012 won by Spain/ Spain bags Euro 2012)

Introduces variation

How do we communicate ?

We use linguistic elements such as

Words (country, park, the, is, Bandipur, of, as, and, considered, National,

a, spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one)

Arrangement of the words (Sentences) Words are related to each-other to provide the

composite meaning(Bandipur National park is a beautiful tourist spot and considered as

one of the best wild life sanctuaries in the country)

How do we communicate ?

Arrangement of sentences (Discourse) Sentences or parts of sentences are related to each other to provide a cohesive meaning

*(Considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km. Bandipur National park is a beautiful tourist spot.)

(Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km)

Languages differ in the way they organise information in these entities

All of these interact in the organisation of information

What is Computational Linguistics?

Computational linguistics is the scientific study of language from a computational perspective.

What does it mean?

Scientific Provides explanation for a linguistic or psycholinguisitc phenomenon

Computational Develops computational models/techniques for linguistic phenomena

Human language is the subject of study

In other words

Computational linguistics is the application of linguistic theories and computational techniques

to problems of natural language processing.

http://www.ba.umist.ac.uk/public/departments/registrars/academicoffice/uga/lang.htm

What do the Computational Linguists do?

Linguistic research

Develop language models for processing natural languages

Develop language resources for NLP research/applications

Understand and develop models for analysis and generation of natural languages by the computers

So,

A Computational Linguist needs to understand

How language works

What information is available in the language?

How languages encode information? How this knowledge/information can

be representated for computational processing?

Information in Language (1/4)

Languages encode information

cuuhe maarate haiN kutte

rats kill dogs

Hindi sentence is ambiguous Possible interpretations

Dogs kill rats

Rats kill dogs

However,

English sentence is not ambiguous


Ambiguity in Hindi is resolved if,

cuuhe maarate haiM kuttoN korats kill dogs acc

English encodes information in positions

Hindi in morphemes

Languages encode information differently


Another example,

This chair has been sat on

– The chair has been used for sitting– X sat on this chair, and it is known– The sentence does not mention X

Languages encode information partially


English pronouns he, she, itHindi pronoun vaha

He is going to Delhi ==> vaha dilli jaa rahaa hai

She is going to Delhi ==> vaha dillii jaa rahii hai

It broke ==> vaha TuuTa ??

Information does not always map fully from one language into another

Conceptual worlds may be different

Differences ?

Words

English Hindi Telugu

boys laDake/laDakoN <n,pl> <n,sg/pl,case>

He/she/it vaha atanu/aame/adi is/am/are hai/huuN/haiN/ho

is going jaa rahaa hai/rahii hai/rahe haiN

Indian Languages

Relatively flexible word order

1. a) baccaa phala khaataa hai

‘child’ ‘fruit’ ‘eat+hab’ ‘pres’

The child eats fruits

b) phala baccaa khaataa hai

c) phala khaataa hai baccaa

d) baccaa khaataa hai phala

Some structural differences

EnglishDeclarative : Ravi is coming todayInterrogative : Is Ravi coming today ?

Change in the position of ‘is’ brings the change in meaning

HindiDeclarative : ravi aaj aa rahaa haiInterrogative : kyaa ravi aaj aa rahaa hai ?

Word ‘kyaa’ encodes the question information

Alternatively, more natural spoken form in Hindi

ravi aaj aa rahaa hai ? (with appropriate intonation) ORRavi aaj aa rahaa hai kyaa?

Post nominal modification

'ing' clauses

I know [the man playing guitar]

Hindi, on the other hand

maiN [giTaar bajaa rahe vyakti ko] jaanataa huuN

Clauses having 'un-' negative constructions

EnglishUnless you reach there the job will not be done

Hindijab tak tum vahaaN nahiiN pahuNcate , kaam

nahiiN hogaa

Languages Differ

Different languages have different

mechanisms/devices to encode information Some devices are common across certain languages and some are different There are alternative ways of expressing the same meaning within the same language Languages show preferences for one device over the othersEnglish exploits ‘position’ for encoding informationHindi uses ‘words’ more effectively

Thus, differences in grammatical structures

Ambiguity in Natural Language (1/2)

Look at the word 'plot' in the following examples

(a) The plot having rocks and boulders is not good.(b) The plot having twists and turns is interesting.

'plot' in (a) means 'a piece of land' and in (b) 'an outline of the events in a story'

Ambiguity in Natural Language (2/2)

Lexical level

Sentence level

Structural differences between SL and TL in a Machine Translation system.

Lexical ambiguity

Lexical ambiguity can be both for

Content words – nouns, verbs etcFunction words – prepositions, TAMs etc

Content words' ambiguity is of two types

HomonymyPolysemy

Homonymy

A word has two or more unrelated senses

Example : I was walking on the bank (river-bank)

I deposited the money in the bank (money-bank)

Polysemy

A word having two or more related senses

Example : English word 'issue', noun 1. The issue is under discussion (muddaa)2. The latest issue of the journal is out (aNka)3. He buys stamps on the day of the issue (vimocan)

4. The couple has no issue even after five years of marriage (saNtaan)

Information Flow and Ambiguity

1. He scratched a figure on the rock (engrave)

2. She scratched the figure on the rock (scrape)

• Other words in the context make a difference• Change of 'a' (in 1) to 'the' (in 2) changes the meaning of 'scratched'

Function words can also pose problems (1/4)

Function words can also be ambiguousFor example – English preposition 'in'

(a) I met him in the garden maiN usase bagiice meiN milaa

(b) I met him in the morning maiN usase subaha 0 milaa

'Ambiguity' here refers to the 'appropriate correspondence' in the target language.


1. He bought a shirt with tiny collars.

usane chote kaular vaalii kamiiz khariidii

‘he tiny collars with shirt bought’

‘with’ gets translated as ‘vaalii’ in Hindi

2. He washed a shirt with soap.

usane saabun se kamiiz dhoii

‘he soap with shirt washed’

‘with’ gets translated as ‘se’ .


TAM Markers mark tense, aspect and modality

– Consist of inflections and/or auxiliary verbs in Hindi

– An important source of information

– Narrow down the meaning of a verb (eg. lied, lay)


English Simple Past vs Habitual'

1a. He stayed in the guest house during his visit to our University in Jan (rahaa)

1b. He stayed in the guest house whenever he visited us (rahataa thaa)

2a. He went to the school just now (gayaa)

2b. He went to the school everyday (jaataa thaa)

Sentence level ambiguity

I met the girl in the store + Possible readings a) I met the girl who works in the store b) I met the girl while I was in the store Time flies like an arrow. + Possible parses:

a) Time flies like an arrow (N V Prep Det N)b) Time flies like an arrow (N N V Det N) c) Time flies like an arrow (V N Prep Det N) (flies are like an arrow) d) Time flies like an arrow (V N Prep Det N) (manner of timing)

Thus,

Languages encode information differently

Languages code information only partially Tension between BREVITY and PRECISION Brevity wins leading to inherent ambiguity at

different levels

Human beings use

World knowledge

Context (both linguistic and extra-linguistic)

Cultural knowledge and

Language conventions to resolve ambiguities

Can all this knowledge be provided to the machine ? Computational Linguistics aims for this.

How to provide this knowledge ? (1/2)

Analyse language at various levels (word, phrase, sentence etc)

Build Tools for analysing the natural language at various levels in a text

POS tagger (category marking)

Morphological analysers (analysis of a word)

Morphological generators (word generators)

Chunkers (shallow parsers)

Parsers (syntactic analysis)

Filters (markers for special expressions)

Sense Disambiguation Algorithms

Etc

The tools need linguistic knowledge

How to provide this knowledge ? (2/2)

Build language resources

Machine Readable Lexicon Rules for various levels of linguistic

analysis Computational Grammars Mapping rules for the concerned

language pair for an MT system Sense Disambiguation Rules Annotated corpora Etc

POS Tagger

What is a POS? Take the following English sentence

My old friend Ram recently bought a book on Indian snakes for his cousin from London from the new bookshop .

Each word in the above sentence belongs to a word class (also called as a Part Of Speech (POS))

The class to which a word may belong is based on its morphological and syntactic behavior

MorphologicalKind of affixes a word takes, for example,

boy, boys; girl, girls; book, books (noun class) Syntactic

How it is distributed in a sentence He chairs the next session (verb) The chairs are new (noun)

Why is POS relevant in CL/NLP ? (1/2)

• Word class information of a given word in a sentence helps to predict its neighbour

• WSD

He runs a mile every day (verb)

Their team made 250 runs (noun)

Time flies like an arrow (n v prep det n)

• Helps in further processing – chunking, morph pruning, sentence parsing

• IR

A POS tagger automatically marks the POS of all the words in a text

POS tagged sentence

My possesive pronoun

old adjective

friend noun

Ram proper noun

recently adverb

bought verb

a determiner

book noun

on preposition

Indian adjective

snakes noun

for preposition

his possesive pronoun

cousin noun

from preposition

London proper noun

, punctuation

from preposition

the determiner

new adjective

bookshop noun

in preposition

town noun

POS Tagging Approaches

Rule Based

Statistical

Transformation Based

Rule Based POS Tagging

Two staged architecture algorithms

(Harris, 1962; Klein and Simmons, 1963; Green and Rubin,

1971)

Stage 1 assign POS by referring to the

dictionary

Eg Dictionary entry for Eng word that

that Conj, Adv, Pronoun

Stage 2 disambiguate, using manually

crafted rules

Statistical

Taggers use probabilities for tagging

The tagger picks the most likely tag for a given word in a context

HMM based algorithms are most commonly used for POS tagging task

Requires manually tagged corpus

Annotating Corpus for POS

Annotated corpora is useful for developing statistical POS taggers

Tagging schemeSet of POS Tags

Guidelines for the annotators

The tagged corpora should beHigh quality (in terms of tagging accuracy)

Consistent

POS Tags for English

English

Penn Tree Bank – 45 tags

C5 - Lancaster – 61 tags – used in CLAWS

Basic tagset used for BNC http://view.byu.edu/bnc_tags.htm

- C7 – 147 tags – Leech

http://www.comp.lancs.ac.uk/ucrel/claws7tags.html

Pen Treebank Tags

My PP$

old JJ

friend NN

Ram NNP

recently RB

bought VBD

a DT

book NN

on IN

Indian JJ

snakes NNS

for IN

his PP$

cousin NN

from IN

London NNP

, ,

from IN

the DT

new JJ

bookshop NN

in IN

town NN

POS Tags for Indian Languages

Objective

To arrive at a standard POS and Chunk tagging scheme for all Indian languages

Assumption

Commonality in Indian Languages

Issues in Tag Set Design (1/2)

Linguistic knowledge coarse vs fine Syntactic function vs lexical category (for

POS tags) New tags vs tags close to existing English

tags Should be comprehensive/complete

Issues in Tag Set Design (2/2)

Simple Less effort in manual tagging Number of tags Common for all Indian languages

Linguistic Knowledge :Fine vs Coarse (1/2)

ExampleOnly noun (NN) laDakA, laDake, laDakoM, laDakI, laDakiyAM,

ladakiyoMORNoun with gender, number, case information (NNM) ladakA, ladAke, laDakoM, (NNMS) ladakA, laDake (NNMP) laDake, laDkoM, (NNMSD) laDakA, (NNMSO) laDake, (NNMPD) laDake, (NNMPO) laDakoM

The decision has implications for the size of corpora and machine learning

Linguistic Knowledge :Fine vs Coarse (2/2)

Alternatives Coarse - NN (advantages/disadvantages) Fine - NNMSD

(advantages/disadvantages) Hierarchical

Example: NN_m_sg_d

Hierarchical tag set provides the possibility for underspecification

Considerations

POS tagger is NOT a replacement for a morph analyzer

Coarse analysis to begin with Expandable if needed If the information can be obtained from

elsewhere, it need not be included in the POS tag

Syntactic function vs lexical category

Example

harijana bAlaka ‘harijan’ ‘child’

Decision : Lexical category

Helps achieve Consistency in annotation Better learning

New tags vs tags close to existing English tags

New tags

Noun, Pron, Adj, Adv Familiar tags (Penn Treebank tags)

NN, PRP, JJ, RB

Decision : Penn tags for common lexical types

New tags for certain IL specific cases

Comprehensive/Complete

All the lexical items occurring in a sentence should be marked for their POS, including punctuations.

If the language has some special cases, these should also be captured – Reduplications in ILs

Simple

Why simple ? The tags are designed for some manual

annotation Ease of learning Consistency in annotation

Less Effort in Manual Tagging

The annotators should not have to Write too much Take too many steps in annotating a lexical item

Number of Tags

Number of tags makes a difference both for the man and the machine

For the man in decision making For the machine in learning for automatic

tagging

Common for All Indian Languages

Indian languages belong to various language families

Share linguistic features

However, There are differences

Some languages have quotatives, some don't Some have classifiers, some don't

Chunking

What forms a chunk ?

Non-recursive phrase ((det adj noun))

Partial structure without distorting the dependencies Include inflections (postposition/auxiliaries) with a lexical category

Example : ((mere choTe bhaaii ne))_NP

((jaa rahaa hai))_VG

Chunker

A Chunker automatically groups words in a sentence as chunks and labels them

((My old friend Ram))_NP ((recently bought))_VG ((a book))_NP on ((Indian snakes))_NP for ((his cousin))_NP from ((London))_NP from ((the new bookshop))_NP.

IL Chunk Tags (1/2)

NP noun chunk bahut acchiiI kitaab

JJP adjective chunk bahut sundar sii

RBP adverb chunk dhiIre – dhIire

NEGP chunk for negatives nahiiN

CCP conjunct chunks raam Ora shyaam

BLK miscellaneous interjections etc

IL Chunk Tags (2/2)

VGF Finite verb chunk jaa rahaa hai VGNF Non finite verb chunk jaate hue VGINF Infinitive verb chunk jaanaa VGNN Gerunds jaanaa FRAGP Discontiguous fragments of a chunk

raama (meraa bhaaii) ne

Some Issues

How to chunk the following ?

Adverbs

within a verb chunk or separately Eg ((recently bought)) or ((recently)) ((bought))

Punctuations Particles – hii (only), to, bhii (also) etc

Current approach

For punctuation – chunk them with the preceding chunk

Adverbs – chunk them separatelyParticles – chunk them with the chunk to

which they belong

((raam ne bhii)) ((jaa hii rahaa thaa))

Issues

• Verb Negation

1. nahiiN jaa rahaa ‘not going’2. kahaa hii nahiiN ‘just did not mention’3. kaha to nahiiN rahaa thaa ‘was not saying’

(emphatic)4. binaa yaha baata kahe ‘without saying this’

5. yahii nahiiN, balki likhita ruup meiN bhii yah miltaa hai

‘Not only this, in fact, this is also found in writing'

Current approach

For cases 1 to 3, chunk NEG with the verb group

For 4, chunk the NEG separately in a chunk

For 5, also a separate NEGP chunk will work

NOUN NEGATION ???

Chunking Co-ordinate Constructions

1. word1 CC word2 raam aur shyaam

((raam))_NP ((aur))_CCP ((shyaam))_NP

2. phrase CC phrasemeraa bhaaii shyaam aur tumhaaraa bhaaii mohan

((meraa bhaaii shyaam))_NP ((aur))_CCP ((tumhaaraa bhaaii mohan))_NP

3. clause CC clause

Discontiguous Phrases

What about cases such as ' X (Y) Z' ?

where X = noun, Y = a phrase, Z = postposition

raam (meraa xillii vaalaa bhaaii) ne

OR

isa 'upanyaas – samraaT' shabda kaa'

FRAGP

Chunking Conjunct Verbs

Conjunct verbs

A verb composed of a noun/adj and a verb (sviikaar karnaa 'accept')

Should the conjunct verbs be tagged as a single chunk or two chunks?

'prawIkSA karanA', 'kSamA karanA' etc

‘to wait’ ‘to forgive’

What about genitives ?

raam kaa betaa

'brother of Ram'

usakaa betaa

'his/her son'

mere bhaaii raam kaa betaa

'my brother Ram's son'

iske pahale

'before this'

mez ke uupar

'above/on the table'

ravi ke saath

'with Ravi'

Chunking Numbers/Quantifiers (1/2)

Numerals, quantifiers may occur as follows

a) ek laDakaa 'one boy'

b) 1 laDakaa '1 boy'

c) pahalaa laDakaa 'first boy'

d) karoDoN log 'billions of people'

e) 1962 meiN 'in 1962'

Chunking Numbers/Quantifiers (2/2)

The POS tags for numerals and quantifiers are QC (numerals) and QF (other quantifiers) in IL POS tagset

Example (d) and (e) in the previous slide show cases where the quantifier is behaving like a noun

The issue :

Should the quantifiers in cases such as (d) and (e) be tagged as a Q* or as NN since the chunk itself is a noun chunk ?

Summary

For annotating POS and Chunk a scheme needs to be designed

While doing so following issues need to be considered.

Definition of 'chunk'

Elements which together can form a chunk type

Whether to include postpositions, punctuations etc inside a chunk or form them as independent chunks

POS/Chunk tag labels

Approaches in Computational Linguistics (for Tools)

Two major approaches Rule based

Requires manually crafted rulesExplicit linguistic knowledgeNeeds manual time and effortTrained manpowerHigh precisionLess robust

Approaches in Computational Linguistics (for Tools)

Data driven approachUses statistical methods or machine learning Requires less human effortOften requires large scale data sources (manually annotated corpora, lexicons etc)Linguistic knowledge is implicitMore adaptive to noisy textMore robust

Computational Linguistics Application Areas

Is useful for Communication between

Man-machine Question answering systems, interactive railway reservation Text summarization Web applications Intelligent search engines Cross lingual searchMan – man

Machine translation

introduction to computational linguistics

Documents

computational processing

computational linguists

information spain

computational modelstechniques

computational perspective

background language

ambiguous information

information different