introduction to computational linguistics
DESCRIPTION
Introduction to Computational Linguistics. Dipti Misra Sharma IIIT, Hyderabad < [email protected] > IASNLP 05-07-2012. Outline. Background What is Computational Linguistics (CL)? What do the Computational Linguists do? What are the issues in processing natural languages? - PowerPoint PPT PresentationTRANSCRIPT
Introduction to Computational Linguistics
Dipti Misra Sharma
IIIT, Hyderabad
IASNLP 05-07-2012
Outline
Background
What is Computational Linguistics (CL)?
What do the Computational Linguists do?
What are the issues in processing natural languages?
What can we do with CL?
Approaches in CL?
Background
Language is a means of communication
Therefore, one can say
It encodes what is communicated <information>
We apply the processes of
Analysis (decoding) for understanding
Synthesis (encoding) for expression (speaking)
What do we communicate ?
Information (SPAIN delivered a football masterclass at Euro 2012)
Intention <purpose> Emphasis/focus (Euro 2012 won by Spain/ Spain bags Euro 2012)
Introduces variation
How do we communicate ?
We use linguistic elements such as
Words (country, park, the, is, Bandipur, of, as, and, considered, National,
a, spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one)
Arrangement of the words (Sentences) Words are related to each-other to provide the
composite meaning(Bandipur National park is a beautiful tourist spot and considered as
one of the best wild life sanctuaries in the country)
How do we communicate ?
Arrangement of sentences (Discourse) Sentences or parts of sentences are related to each other to provide a cohesive meaning
*(Considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km. Bandipur National park is a beautiful tourist spot.)
(Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km)
Languages differ in the way they organise information in these entities
All of these interact in the organisation of information
What is Computational Linguistics?
Computational linguistics is the scientific study of language from a computational perspective.
What does it mean?
Scientific Provides explanation for a linguistic or psycholinguisitc phenomenon
Computational Develops computational models/techniques for linguistic phenomena
Human language is the subject of study
In other words
Computational linguistics is the application of linguistic theories and computational techniques
to problems of natural language processing.
http://www.ba.umist.ac.uk/public/departments/registrars/academicoffice/uga/lang.htm
What do the Computational Linguists do?
Linguistic research
Develop language models for processing natural languages
Develop language resources for NLP research/applications
Understand and develop models for analysis and generation of natural languages by the computers
So,
A Computational Linguist needs to understand
How language works
What information is available in the language?
How languages encode information? How this knowledge/information can
be representated for computational processing?
Information in Language (1/4)
Languages encode information
cuuhe maarate haiN kutte
rats kill dogs
Hindi sentence is ambiguous Possible interpretations
Dogs kill rats
Rats kill dogs
However,
English sentence is not ambiguous
Information in Language (2/4)
Ambiguity in Hindi is resolved if,
cuuhe maarate haiM kuttoN korats kill dogs acc
English encodes information in positions
Hindi in morphemes
Languages encode information differently
Information in Language (3/4)
Another example,
This chair has been sat on
– The chair has been used for sitting– X sat on this chair, and it is known– The sentence does not mention X
Languages encode information partially
Information in Language (4/4)
English pronouns he, she, itHindi pronoun vaha
He is going to Delhi ==> vaha dilli jaa rahaa hai
She is going to Delhi ==> vaha dillii jaa rahii hai
It broke ==> vaha TuuTa ??
Information does not always map fully from one language into another
Conceptual worlds may be different
Differences ?
Words
English Hindi Telugu
boys laDake/laDakoN <n,pl> <n,sg/pl,case>
He/she/it vaha atanu/aame/adi is/am/are hai/huuN/haiN/ho
is going jaa rahaa hai/rahii hai/rahe haiN
Indian Languages
Relatively flexible word order
1. a) baccaa phala khaataa hai
‘child’ ‘fruit’ ‘eat+hab’ ‘pres’
The child eats fruits
b) phala baccaa khaataa hai
c) phala khaataa hai baccaa
d) baccaa khaataa hai phala
Some structural differences
EnglishDeclarative : Ravi is coming todayInterrogative : Is Ravi coming today ?
Change in the position of ‘is’ brings the change in meaning
HindiDeclarative : ravi aaj aa rahaa haiInterrogative : kyaa ravi aaj aa rahaa hai ?
Word ‘kyaa’ encodes the question information
Alternatively, more natural spoken form in Hindi
ravi aaj aa rahaa hai ? (with appropriate intonation) ORRavi aaj aa rahaa hai kyaa?
Post nominal modification
'ing' clauses
I know [the man playing guitar]
Hindi, on the other hand
maiN [giTaar bajaa rahe vyakti ko] jaanataa huuN
Clauses having 'un-' negative constructions
EnglishUnless you reach there the job will not be done
Hindijab tak tum vahaaN nahiiN pahuNcate , kaam
nahiiN hogaa
Languages Differ
Different languages have different
mechanisms/devices to encode information Some devices are common across certain languages and some are different There are alternative ways of expressing the same meaning within the same language Languages show preferences for one device over the othersEnglish exploits ‘position’ for encoding informationHindi uses ‘words’ more effectively
Thus, differences in grammatical structures
Ambiguity in Natural Language (1/2)
Look at the word 'plot' in the following examples
(a) The plot having rocks and boulders is not good.(b) The plot having twists and turns is interesting.
'plot' in (a) means 'a piece of land' and in (b) 'an outline of the events in a story'
Ambiguity in Natural Language (2/2)
Lexical level
Sentence level
Structural differences between SL and TL in a Machine Translation system.
Lexical ambiguity
Lexical ambiguity can be both for
Content words – nouns, verbs etcFunction words – prepositions, TAMs etc
Content words' ambiguity is of two types
HomonymyPolysemy
Homonymy
A word has two or more unrelated senses
Example : I was walking on the bank (river-bank)
I deposited the money in the bank (money-bank)
Polysemy
A word having two or more related senses
Example : English word 'issue', noun 1. The issue is under discussion (muddaa)2. The latest issue of the journal is out (aNka)3. He buys stamps on the day of the issue (vimocan)
4. The couple has no issue even after five years of marriage (saNtaan)
Information Flow and Ambiguity
1. He scratched a figure on the rock (engrave)
2. She scratched the figure on the rock (scrape)
• Other words in the context make a difference• Change of 'a' (in 1) to 'the' (in 2) changes the meaning of 'scratched'
Function words can also pose problems (1/4)
Function words can also be ambiguousFor example – English preposition 'in'
(a) I met him in the garden maiN usase bagiice meiN milaa
(b) I met him in the morning maiN usase subaha 0 milaa
'Ambiguity' here refers to the 'appropriate correspondence' in the target language.
Function words can also pose problems (2/4)
1. He bought a shirt with tiny collars.
usane chote kaular vaalii kamiiz khariidii
‘he tiny collars with shirt bought’
‘with’ gets translated as ‘vaalii’ in Hindi
2. He washed a shirt with soap.
usane saabun se kamiiz dhoii
‘he soap with shirt washed’
‘with’ gets translated as ‘se’ .
Function words can also pose problems (3/4)
TAM Markers mark tense, aspect and modality
– Consist of inflections and/or auxiliary verbs in Hindi
– An important source of information
– Narrow down the meaning of a verb (eg. lied, lay)
Function words can also pose problems (4/4)
English Simple Past vs Habitual'
1a. He stayed in the guest house during his visit to our University in Jan (rahaa)
1b. He stayed in the guest house whenever he visited us (rahataa thaa)
2a. He went to the school just now (gayaa)
2b. He went to the school everyday (jaataa thaa)
Sentence level ambiguity
I met the girl in the store + Possible readings a) I met the girl who works in the store b) I met the girl while I was in the store Time flies like an arrow. + Possible parses:
a) Time flies like an arrow (N V Prep Det N)b) Time flies like an arrow (N N V Det N) c) Time flies like an arrow (V N Prep Det N) (flies are like an arrow) d) Time flies like an arrow (V N Prep Det N) (manner of timing)
Thus,
Languages encode information differently
Languages code information only partially Tension between BREVITY and PRECISION Brevity wins leading to inherent ambiguity at
different levels
Human beings use
World knowledge
Context (both linguistic and extra-linguistic)
Cultural knowledge and
Language conventions to resolve ambiguities
Can all this knowledge be provided to the machine ? Computational Linguistics aims for this.
How to provide this knowledge ? (1/2)
Analyse language at various levels (word, phrase, sentence etc)
Build Tools for analysing the natural language at various levels in a text
POS tagger (category marking)
Morphological analysers (analysis of a word)
Morphological generators (word generators)
Chunkers (shallow parsers)
Parsers (syntactic analysis)
Filters (markers for special expressions)
Sense Disambiguation Algorithms
Etc
The tools need linguistic knowledge
How to provide this knowledge ? (2/2)
Build language resources
Machine Readable Lexicon Rules for various levels of linguistic
analysis Computational Grammars Mapping rules for the concerned
language pair for an MT system Sense Disambiguation Rules Annotated corpora Etc
POS Tagger
What is a POS? Take the following English sentence
My old friend Ram recently bought a book on Indian snakes for his cousin from London from the new bookshop .
Each word in the above sentence belongs to a word class (also called as a Part Of Speech (POS))
The class to which a word may belong is based on its morphological and syntactic behavior
MorphologicalKind of affixes a word takes, for example,
boy, boys; girl, girls; book, books (noun class) Syntactic
How it is distributed in a sentence He chairs the next session (verb) The chairs are new (noun)
Why is POS relevant in CL/NLP ? (1/2)
• Word class information of a given word in a sentence helps to predict its neighbour
• WSD
He runs a mile every day (verb)
Their team made 250 runs (noun)
Time flies like an arrow (n v prep det n)
• Helps in further processing – chunking, morph pruning, sentence parsing
• IR
A POS tagger automatically marks the POS of all the words in a text
POS tagged sentence
My possesive pronoun
old adjective
friend noun
Ram proper noun
recently adverb
bought verb
a determiner
book noun
on preposition
Indian adjective
snakes noun
for preposition
his possesive pronoun
cousin noun
from preposition
London proper noun
, punctuation
from preposition
the determiner
new adjective
bookshop noun
in preposition
town noun
POS Tagging Approaches
Rule Based
Statistical
Transformation Based
Rule Based POS Tagging
Two staged architecture algorithms
(Harris, 1962; Klein and Simmons, 1963; Green and Rubin,
1971)
Stage 1 assign POS by referring to the
dictionary
Eg Dictionary entry for Eng word that
that Conj, Adv, Pronoun
Stage 2 disambiguate, using manually
crafted rules
Statistical
Taggers use probabilities for tagging
The tagger picks the most likely tag for a given word in a context
HMM based algorithms are most commonly used for POS tagging task
Requires manually tagged corpus
Annotating Corpus for POS
Annotated corpora is useful for developing statistical POS taggers
Tagging schemeSet of POS Tags
Guidelines for the annotators
The tagged corpora should beHigh quality (in terms of tagging accuracy)
Consistent
POS Tags for English
English
Penn Tree Bank – 45 tags
C5 - Lancaster – 61 tags – used in CLAWS
Basic tagset used for BNC http://view.byu.edu/bnc_tags.htm
- C7 – 147 tags – Leech
http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
Pen Treebank Tags
My PP$
old JJ
friend NN
Ram NNP
recently RB
bought VBD
a DT
book NN
on IN
Indian JJ
snakes NNS
for IN
his PP$
cousin NN
from IN
London NNP
, ,
from IN
the DT
new JJ
bookshop NN
in IN
town NN
POS Tags for Indian Languages
Objective
To arrive at a standard POS and Chunk tagging scheme for all Indian languages
Assumption
Commonality in Indian Languages
Issues in Tag Set Design (1/2)
Linguistic knowledge coarse vs fine Syntactic function vs lexical category (for
POS tags) New tags vs tags close to existing English
tags Should be comprehensive/complete
Issues in Tag Set Design (2/2)
Simple Less effort in manual tagging Number of tags Common for all Indian languages
Linguistic Knowledge :Fine vs Coarse (1/2)
ExampleOnly noun (NN) laDakA, laDake, laDakoM, laDakI, laDakiyAM,
ladakiyoMORNoun with gender, number, case information (NNM) ladakA, ladAke, laDakoM, (NNMS) ladakA, laDake (NNMP) laDake, laDkoM, (NNMSD) laDakA, (NNMSO) laDake, (NNMPD) laDake, (NNMPO) laDakoM
The decision has implications for the size of corpora and machine learning
Linguistic Knowledge :Fine vs Coarse (2/2)
Alternatives Coarse - NN (advantages/disadvantages) Fine - NNMSD
(advantages/disadvantages) Hierarchical
Example: NN_m_sg_d
Hierarchical tag set provides the possibility for underspecification
Considerations
POS tagger is NOT a replacement for a morph analyzer
Coarse analysis to begin with Expandable if needed If the information can be obtained from
elsewhere, it need not be included in the POS tag
Syntactic function vs lexical category
Example
harijana bAlaka ‘harijan’ ‘child’
Decision : Lexical category
Helps achieve Consistency in annotation Better learning
New tags vs tags close to existing English tags
New tags
Noun, Pron, Adj, Adv Familiar tags (Penn Treebank tags)
NN, PRP, JJ, RB
Decision : Penn tags for common lexical types
New tags for certain IL specific cases
Comprehensive/Complete
All the lexical items occurring in a sentence should be marked for their POS, including punctuations.
If the language has some special cases, these should also be captured – Reduplications in ILs
Simple
Why simple ? The tags are designed for some manual
annotation Ease of learning Consistency in annotation
Less Effort in Manual Tagging
The annotators should not have to Write too much Take too many steps in annotating a lexical item
Number of Tags
Number of tags makes a difference both for the man and the machine
For the man in decision making For the machine in learning for automatic
tagging
Common for All Indian Languages
Indian languages belong to various language families
Share linguistic features
However, There are differences
Some languages have quotatives, some don't Some have classifiers, some don't
Chunking
What forms a chunk ?
Non-recursive phrase ((det adj noun))
Partial structure without distorting the dependencies Include inflections (postposition/auxiliaries) with a lexical category
Example : ((mere choTe bhaaii ne))_NP
((jaa rahaa hai))_VG
Chunker
A Chunker automatically groups words in a sentence as chunks and labels them
((My old friend Ram))_NP ((recently bought))_VG ((a book))_NP on ((Indian snakes))_NP for ((his cousin))_NP from ((London))_NP from ((the new bookshop))_NP.
IL Chunk Tags (1/2)
NP noun chunk bahut acchiiI kitaab
JJP adjective chunk bahut sundar sii
RBP adverb chunk dhiIre – dhIire
NEGP chunk for negatives nahiiN
CCP conjunct chunks raam Ora shyaam
BLK miscellaneous interjections etc
IL Chunk Tags (2/2)
VGF Finite verb chunk jaa rahaa hai VGNF Non finite verb chunk jaate hue VGINF Infinitive verb chunk jaanaa VGNN Gerunds jaanaa FRAGP Discontiguous fragments of a chunk
raama (meraa bhaaii) ne
Some Issues
How to chunk the following ?
Adverbs
within a verb chunk or separately Eg ((recently bought)) or ((recently)) ((bought))
Punctuations Particles – hii (only), to, bhii (also) etc
Current approach
For punctuation – chunk them with the preceding chunk
Adverbs – chunk them separatelyParticles – chunk them with the chunk to
which they belong
((raam ne bhii)) ((jaa hii rahaa thaa))
Issues
• Verb Negation
1. nahiiN jaa rahaa ‘not going’2. kahaa hii nahiiN ‘just did not mention’3. kaha to nahiiN rahaa thaa ‘was not saying’
(emphatic)4. binaa yaha baata kahe ‘without saying this’
5. yahii nahiiN, balki likhita ruup meiN bhii yah miltaa hai
‘Not only this, in fact, this is also found in writing'
Current approach
For cases 1 to 3, chunk NEG with the verb group
For 4, chunk the NEG separately in a chunk
For 5, also a separate NEGP chunk will work
NOUN NEGATION ???
Chunking Co-ordinate Constructions
1. word1 CC word2 raam aur shyaam
((raam))_NP ((aur))_CCP ((shyaam))_NP
2. phrase CC phrasemeraa bhaaii shyaam aur tumhaaraa bhaaii mohan
((meraa bhaaii shyaam))_NP ((aur))_CCP ((tumhaaraa bhaaii mohan))_NP
3. clause CC clause
Discontiguous Phrases
What about cases such as ' X (Y) Z' ?
where X = noun, Y = a phrase, Z = postposition
raam (meraa xillii vaalaa bhaaii) ne
OR
isa 'upanyaas – samraaT' shabda kaa'
FRAGP
Chunking Conjunct Verbs
Conjunct verbs
A verb composed of a noun/adj and a verb (sviikaar karnaa 'accept')
Should the conjunct verbs be tagged as a single chunk or two chunks?
'prawIkSA karanA', 'kSamA karanA' etc
‘to wait’ ‘to forgive’
What about genitives ?
raam kaa betaa
'brother of Ram'
usakaa betaa
'his/her son'
mere bhaaii raam kaa betaa
'my brother Ram's son'
iske pahale
'before this'
mez ke uupar
'above/on the table'
ravi ke saath
'with Ravi'
Chunking Numbers/Quantifiers (1/2)
Numerals, quantifiers may occur as follows
a) ek laDakaa 'one boy'
b) 1 laDakaa '1 boy'
c) pahalaa laDakaa 'first boy'
d) karoDoN log 'billions of people'
e) 1962 meiN 'in 1962'
Chunking Numbers/Quantifiers (2/2)
The POS tags for numerals and quantifiers are QC (numerals) and QF (other quantifiers) in IL POS tagset
Example (d) and (e) in the previous slide show cases where the quantifier is behaving like a noun
The issue :
Should the quantifiers in cases such as (d) and (e) be tagged as a Q* or as NN since the chunk itself is a noun chunk ?
Summary
For annotating POS and Chunk a scheme needs to be designed
While doing so following issues need to be considered.
Definition of 'chunk'
Elements which together can form a chunk type
Whether to include postpositions, punctuations etc inside a chunk or form them as independent chunks
POS/Chunk tag labels
Approaches in Computational Linguistics (for Tools)
Two major approaches Rule based
Requires manually crafted rulesExplicit linguistic knowledgeNeeds manual time and effortTrained manpowerHigh precisionLess robust
Approaches in Computational Linguistics (for Tools)
Data driven approachUses statistical methods or machine learning Requires less human effortOften requires large scale data sources (manually annotated corpora, lexicons etc)Linguistic knowledge is implicitMore adaptive to noisy textMore robust
Computational Linguistics Application Areas
Is useful for Communication between
Man-machine Question answering systems, interactive railway reservation Text summarization Web applications Intelligent search engines Cross lingual searchMan – man
Machine translation