l1 intro 2x2
TRANSCRIPT
-
8/2/2019 l1 Intro 2x2
1/28
Natural Language Processing ApplicationsLecture 1
Claire Gardent
CNRS/LORIACampus Scientifique,
BP 239,F-54 5 06 Vand uvre-l es-Nanc y, France
2007/2008
1/111
Todays lecture
Administrative issues
Course Overview
What is NLP?How is it done? a brief historics of NLPLinguistics in NLPWhy is it hard?What are the applications?
2/111
Documentation
Webpage of the course
www.loria.fr/gardent/applicationsTAL
Slides of the lecture will be handed out at the beginning of eachlecture.
Thursday Nancy NLP Seminar
www.loria.fr/gardent/Seminar/content/seminar07-2.php
First seminar: this week ; Guillaume Pitel (in English); on usingLatent Semantic Analysis to bootstrap a Framenet for French
3/111
Course Overview
Theory
What is NLP? : Why is it hard? How is it done? What arethe applications?
Symbolic approaches. Exemplified by natural languagegenerationMeaning Text
Statistical approaches. Exemplified by Information retrievaland information extractionText Meaning
Practice
Python and NLTK (Natural language toolkit)
Software project
Presentations
4/111
-
8/2/2019 l1 Intro 2x2
2/28
3/111 4/111
Computers, accounts and Lab sessions
you should all have a login account on the UHP machines.Nancy 2 students need first to register at Nancy 1 (UHP) Registration is free.
from this account you should all be able to access, python,NLTK and whatever is needed for the exercices and the projet
If not, tell us!
Room I100 is reserved for you every wednesday morning untilchristmas.
Optional lab sessions with Bertrand Gaiffe, wednesdaymorning from 10 to 12 in Room I100. Starting next week.
5/111
Assessment
Grades will be calculated as follows
Final exam : 60%
Presentations : 10%
Project : 30%
6/111
Presentations
Each student must present a paper on either Question Answeringor Natural Language Generation.
A list of papers suggested for presentation will be given shortly onthe course web site. If you prefer, you can choose some otherpaper on either QA or NLG but you must then first run it by me
for approval
I will collect your choices at the end of the second week.
Presentations will be held on 4th (QA) and 16th October (NLG).
More about presentations and about their grading athttp://www.loria.fr/gardent/applicationsTAL
7/111
Software Project
A list of software projects will be presented at the end of thesecond week.
You should gather into groups of 2 or 3 and choose a topic. Ifdesired, there can also be individual projects.
I will collect your choices at the end of the third week (4 October).
Each group will give a short oral presentation (intermediate report)of their project at the end of the 5th week (18 october).
The results (program and output) of each group on the project willbe returned at the end of the semester (roughly end of january).
More about projects athttp://www.loria.fr/gardent/applicationsTAL
8/111
-
8/2/2019 l1 Intro 2x2
3/28
Course schedule
Mo 17 september, 2pm. What is NLP? Why is it hard? Howis it done? An overview of NLP applications
Tue 18 september, 10am. Python fundamentals
Th 20 september, 10am. Regular expressions
Mo 24 september, 2pm. Corpus processing and tokenizationwith NLTK.
Tue 25 september, 10am. Tagging and Chunking with NLTK.
Th 27 september, 10am. Information Retrieval.Presentation of Topics for Software projectsCollecting Individual Choices for Presentations
Mo 1 october, 10am. Information extraction.
Tu 2 october, 10am. Information extraction.
Th 4 october, 10am. Presentations on Question Answering
9/111
Course schedule
Mo 8 october, 2pm. NLG: Introduction
Tue 9 october, 10am. NLG: Macro and microplanning dd
Th 11 october, 10am. NLG: Referential expressions
Mo 15 october, 2pm. NLG: Surface realisation
Mo 16 october, 10am. Presentations on NLG.
Th 18 october, 10am. Software Projects Presentations.
10/111
Contact information
Need to find me?
LORIA, Room B234
Phone: 03 83 59 20 39
email: Claire.Gardent AT loria.fr
Web: www.loria.fr/gardent
11/111
What is NLP?
Natural Language Processing aims at making computerstalk and more precisely, at endowing them with the
linguistic ability of humans.
12/111
-
8/2/2019 l1 Intro 2x2
4/28
Two illustrating examples of NLP applications
1966: Eliza, the first chatterbot i.e., a computer program thatmimics human conversation. Developed by JosephWeizenbaum from the Massachusetts Institute of Technology.
1973: SHRDLU, a computer program that allows the user tointeract with a robot in a block world. Developed by TerryWinograd (MIT)
13/111
1966 ELIZA
Eliza, the first chatterbot a computer program that mimicshuman conversation.
Joseph Weizenbaum Massachusetts Institute of Technology
User types in some statement or set of statements in natural
language
ELIZA then analyzes the users statement and generates someresponse which it types out.
http://www-ai.ijs.si/eliza/eliza.html
14/111
Sample conversation
Men are all alike.IN WHAT WAY?Theyre always bugging us about something or other.CAN YOU THINK OF A SPECIFIC EXAMPLE?Well, my boyfriend made me come here.YOUR BOYFRIEND MADE YOU COME HEREHe says Im depressed much of the time.I AM SORRY TO HEAR YOU ARE DEPRESSED
15/111
ELIZA
Its true. I am unhappyDO YOU THINK COMING HERE WILL HELP YOU NOT TO BEUNHAPPY
I need some help, that much seems certain.WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELPPerhaps I could learn to get along with my mother.TELL ME MORE ABOUT YOUR FAMILY
16/111
-
8/2/2019 l1 Intro 2x2
5/28
ELIZA
My mother takes care of me.WHO ELSE IN YOUR FAMILY TAKES CARE OF YOUMy father.YOUR FATHERYou are like my father in some ways.WHAT RESEMBLANCE DO YOU SEEYou are not very aggressive but I think you dont want me tonotice that.WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVEYou dont argue with me.
17/111
ELIZA
WHY DO YOU THINK I DONT ARGUE WITH YOUYou are afraid of me.DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOUMy father is afraid of everybody.
WHAT ELSE COMES TO YOUR MIND WHEN YOU THINK OFYOUR FATHERBullies.DOES THAT HAVE ANYTHING TO DO WITH THE FACTTHAT YOUR BOYFRIEND MADE YOU COME HERE
18/111
Elizas strategy KEYWORDS
Input read and inspected for the presence of a keyword
When keyword is found, the sentence is transformed accordingto a rule associated with the keyword (rcriture dexpressions
rgulires) Else a content-free remark or, under certain conditions, an
earlier transformation is retrieved.
The text so computed or retrieved is then printed out.
19/111
Example transformations
I am X How long have you been X?I am very unhappy.HOW LONG HAVE YOU BEEN VERY UNHAPPY?
X YOU Y ME What makes you think I Y you?It seems that you hate me.WHAT MAKES YOU THINK THAT I HATE YOU?
20/111
-
8/2/2019 l1 Intro 2x2
6/28
Why does it work?
Limited to psychiatrist talk
Can pretend to know almost nothing of the real worldI went for a long boat ride
Tell me about boats Not a real dialog (no real communication)
ELiza generates a reaction (often a rather non commital one)to the users statement
The user cannot ask for information. The user cannot issue an order
21/111
Does it really work?
Eliza: Hello. I am ELIZA. How can I help you?You: I went to Paris yesterdayEliza: Can you elaborate on that?You: I took the trainEliza: Can you elaborate on that?
You: The train was lateEliza: Please go on.You: All the seats were fullEliza: Please go on.You: A little boy was having a tantrumEliza: Please go on.
22/111
Does it really work?
Indexicals (I, you, here, now)User: I am having fun right now.Eliza: Does it please you to believe that I am having fun right now.
World KnowledgeUser: I am a human.Eliza: How long have you been a human?
23/111
When Eliza fails
Yes/No QuestionsUser: Are you a good therapist?Eliza: Would you prefer it if I were a good therapist?User: Im wondering about your skills as a therapist;Eliza: Does it please you to believe I am wondering about my skillsof therapist?
Wh-QuestionsUser: Who are you?Eliza: Would you prefer it if I were not?User: What do you want me to tell?Eliza: Does that question interest you?
24/111
-
8/2/2019 l1 Intro 2x2
7/28
When Eliza fails
Complex AssertionsUser: I dont have a problem. Im quite happy, I have a girlfriend, Istudy. Everything is OK!
Eliza: Why do you not have any problem, Im quite happy, youhave a girlfriend, you study. Everything is OK!
25/111
When Eliza fails
Ill-formed InputUser: azefzaer is qgqdg in the agag.Eliza: Please go on.
Ill-formed OutputUser: Doesnt my story interest you?Eliza: Why do you stay your story interest I?User: I think that you can help me but Im not sure that you canunderstand me.Eliza: Do you wish to but im not sure that me
26/111
Is it NLP?
Ad hoc transformations
No linguistic (e.g., syntactic, semantic, morphological)knowledge/processing involved
Not portable: would not work for another domain e.g., Trainscheduling information systemI want to go to Paris??? Tell me about Paris
27/111
1973 SHRDLU
Terry Winograd, MIT Interaction with a robot in a block world. The user can:
instruct the robot to manipulate the blocks
ask it about the blocks configurations ask it about its reasoning update it with facts
http://hci.stanford.edu/winograd/shrdlu/
28/111
-
8/2/2019 l1 Intro 2x2
8/28
SHRDLU A sample dialogue
1s. Pick up a red block.1u. OK. (does it)
Understands the instructions; identify the referent of a description
2s. Grasp the pyramid.2u. I DONT UNDERSTAND WHICH PYRAMID YOU MEAN.
Knows that definite the must refer to a uniquely identifiableobject
3s. Find a block which is taller than the one you are holding andput it into the box.3u. BY IT, I ASSUME YOU MEAN THE BLOCK WHICH ISTALLER THAN THE ONE I AM HOLDING.
Identifies two possible antecedents for the one ...29/111
SHRDLU A sample dialogue
4s. What does the box contain?4u. THE BLUE PYRAMID AND THE BLUE BOX.
Keep tracks of how context is affected by dialog
5s. What is the pyramid supported by?5u. THE BOX.
Handles passive/active variation; Correctly identifies the pyramidas the pyramid just mentioned
30/111
SHRDLU A sample dialogue
How many blocks are not in the box?FOUR OF THEM.Is at least one of them narrower than the one which I told you topick up?YES, THE RED CUBE.
Is it supported?YES, BY THE TABLE.Can the table pick up blocks?NO.
Can reason about the situation
31/111
Is it NLP?
Real communicative abilities: the user can requestinformation, issue instructions and update the system withnew information
The system Understands language in a limited domain byusing syntactic parsing and semantic reasoning Large scale grammar of English + parser Procedural semantics for words and phrases
32/111
-
8/2/2019 l1 Intro 2x2
9/28
Early NLP 1950s
Machine Translation (MT) one of the earliest applications ofcomputers
Major attempts in US and USSR- Russian to English and reverse
George Town University, Washington system:- Translated sample texts in 1954- Euphoria - Lot of funding, many groups in US, USSR* But: the system could not be scaled up.
33/111
1964: The ALPAC report
Assessed research results of groups working on MTs
Concluded: MT not possible in near future.
Funding should cease for MT !
Basic research should be supported.
Word to word translation does not workLinguistic Knowledge is needed
34/111
60-80: Linguistics and CL
1957 Noam Chomskys Syntactic StructuresA formal definition of grammars and languagesProvides the basis for a automatic syntactic processing of NLexpressions
Montagues PTQ Formal semantics for NL. Basis for logicaltreatment of NL meaning
1967 Woods procedural semanticsA procedural approach to the meaning of a sentenceProvides the basis for a automatic semantic processing of NLexpressions
35/111
Some successful early CL systems
1970 TAUM MeteoMachine translation of weather reports (Canada)
1970s SYSTRAN: MT system; still used by Google
1973 LunarTo question expert system on rock analyses from Moonsamples
1973 SHRDLU (T. Winograd)Instructing a robot to move toy blocks
36/111
-
8/2/2019 l1 Intro 2x2
10/28
1980s: Symbolic NLP
Formally grounded and reasonably computationally tractablelinguistic formalisms (Lexical Functional Grammar,Head-Driven Phrase Structure Grammar, Tree AdjoiningGrammar etc.)
Linguistic/Logical paradigm extensively pursued
Not robust enough
Few applications
37/111
1980s: Corpora and Resources
Disk space becomes cheap
Machine readable text becomes uniquitous
US funding emphasises large scale evaluation on real data
1994 The British National Corpus is made availableA balanced corpus of British English
Mid 1990s WordNet (Fellbaum & Miller)A computational thesaurus developed by psycholinguists
Early 2000s The World Wide Web used as a corpus
38/111
1990s Statistical NLP
The following factors promote the emergence of statistical NLP:
Speech recognition shows that given enough data, simplestatistical techniques work
US funding emphasises speech-based interfaces andinformation extraction
Large size digitised corpora are available
39/111
CL History Summary
50s Machine translation; ended by ALPAC report
60s Applications use linguistic techniques (Eliza, shrdlu)from Chomsky (formal grammars, parsers); Proceduralsemantics (Woods) also important. Approaches only work on
restricted Domains. Not portable. 70s/80s Symbolic NLP. Applications based on extensive
linguistic and real world knowledge. Not robust enough.Lexical acquisition bottleneck.
90s now. Statistical NLP. Applications based on statisticalmethods and large (annotated) corpora
40/111
-
8/2/2019 l1 Intro 2x2
11/28
Symbolic vs. statistical approaches
Symbolic
Based on hand written rules
Requires linguistic expertise
No frequencey information
More brittle and slower than statistical approaches
Often more precise than statistical approaches
Error analysis is usually easier than for statistical approachesStatistical
Supervised or non-supervised
Rules acquired from large size corpus
Not much linguistic expertise required
Robust and quick
Requires large size (annotated) corpora
Error analysis is often difficult
41/111
Linguistics in NLP
NLP applications use knowledge about language to processlanguage
All levels of linguistic knowledge are relevant: Phonetics, Phonology The study of linguistic sounds and of
their relation to words
Morphology The study of words components Syntactic The study of the structural relationship betweenwords
Semantics The study of meaning Pragmatics The study of how language is used to accomplish
goals and of the influence of context on meaning Discourse The study of linguistic units larger than a single
utterance
42/111
Phonetics/phonology
Phonetics : study of the speech sounds used in the languages ofthe world
How to transcribe those sounds (IPA,International Phonetic Alphabet)
How sounds are produced (ArticulatoryPhonetics)
Phonology : study of the way a sound is realised in differentenvironments
A sound (phone) can usually be realised indifferent ways (allophones) depending on itscontext
E.g., the hand transcribed Switchboard corpusof English telephone speech list 16 ways ofpronuncing because and about
43/111
Phonetics/phonology
An example illustrating the Sound-to-Text mapping issue.
(1) a. Recognise speech.b. Wreck a nice peach.
Phonetics and phonology can be used either to map words intosound (Speech synthesis) or to map sounds onto words (Speechrecognition).
44/111
-
8/2/2019 l1 Intro 2x2
12/28
Morphology
Study of the structure of words
Two types of morphology :
Flectional: decomposes a word into a lemma and one ormore grammatical affixes giving information
about tense, gender, number, etc.E.g., Cats lemma = cat + affixe = sDerivational: decomposes a word into a lemma and one or
more affixes giving information about meaningor/and category.E.g., Unfair prefix = un + lemma = fair
45/111
Morphology: main issues
Exceptions and irregularities:
Women Woman, Plural
Arent are not
Ambiguity:
saw saw, noun, sg, neuter saw saw, verb, 1st person, sg, past
saw saw, verb, 2nd person, sg, past
saw saw, verb, 3rd person, sg, past
saw saw, verb, 1st person, pl, past
saw saw, verb, 2nd person, sg, past
saw saw, verb, 3rd person, sg, past
46/111
Morphology: Methods and Tools
Methods
Lemmatisation (Morphological analysis)
Stemming: approximationTools
Finite state transducers
47/111
Morphology: Applications
IN CL applications, morphological information is useful e.g.,
to resolve anaphora:(2) Sarah met the women in the street.
She did not like them. [Shesg = Sarahsg; thempl = thewomenpl]
for spell checking and for generation* The womenpl issg
48/111
-
8/2/2019 l1 Intro 2x2
13/28
Syntax
Captures structural relationships between words and phrases
Describes the constituent structure of NL expressions
Grammars are used to describe the syntax of a language
Syntactic analysers and surface realisers assign a syntacticstructure to a string/semantic representation on the basis of agrammar
49/111
Syntactic tree example
S
NP VP
John V NP PP
Adv V Det n Prep NP
often gives a book to Mary
50/111
Methods in Syntax
Words Syntactic tree
Algorithm : parser
Resource used : Lexicon + Grammar
Symbolic : hand-written grammar and lexicon
Statistical: grammar acquired from tree bank
difficulty : coverage and ambiguity
51/111
Syntax
In CL applications, syntactic information is useful e.g., . . .
for spell checking (e.g., subject-verb agreement)
to construct the meaning of a sentence to generate a grammatical sentence
52/111
-
8/2/2019 l1 Intro 2x2
14/28
Spell checking
(3) Its a fair exchange. No syntactic treeIts a fair exchange. Ok syntactic tree
(4) My friends is unhappy.
The number of my friends who were unhappy was amazing.The man who greets my friends is amazing. Subject+Verb agreement
53/111
Syntax and Meaning
John loves Mary love(j,m)Agent = Subject
= Mary loves John love(m,l)
Agent = Subject
Mary is loved by John love(j,m)Agent = By-Object
54/111
Lexical semantics
The study of word meanings and of their interaction withcontext
Words have several possible meanings
Early methods use selectional restrictions to identify meaningintended in given context
(5) a. The astronomer saw the star.
b. The astronomer married the star. Statistical methods use cooccurrence information derived from
corpora annotated with word senses(6) e. John sat on the bank.
f. John went to the bank.g. ?? King Kong sat on the bank.
Lesk algorithm: word overlap between words appearing thedefinitions of the ambiguous word and the words surroundingthis word in text
55/111
Lexical semantics
Lexical relations i.e., relations between word meanings are alsovery important for CL based applications
The most used lexical relations are:
Hyponymy (ISA) e.g., a dog is a hyponym of animal Meronymy (part of) e.g., arm is a meronym of body Synonymy e.g., eggplant and aubergine Antonymy e.g., big and little
56/111
-
8/2/2019 l1 Intro 2x2
15/28
Lexical semantics
In NLP applications, the most commonly used lexical relation ishyponymy which is used:
for semantic classification (e.g., selectional restrictions, namedentity recognition)
for shallow inference (e.g., X murdered Y implies X killedY)
for word sense disambiguation
for machine translation (if a term cannot be translated,substitute a hypernym)
57/111
Compositional Semantics
Semantics of phrases
Useful to reason about the meaning of an expression (e.g., toimprove the accuracy of a question answering system)
(7) a. John saw Mary.b. Mary saw John.
Same words, different meanings.
58/111
Pragmatics
Compositional semantics delivers the literal meaning of anutterance
NL phrases are often used non literally
Examples.
(8) a. Can you pass the salt?b. You are standing on my foot.
Speech act analysis, plan recognition are needed to determinethe full meaning of an utterance
59/111
Discourse
Much of language interpretation is dependent on thepreceding discourse/dialogue
Example. Anaphora resolution.
(9) a. The councillors refused the women a permit becausethey feared revolution.b. The councillors refused the women a permit becausethey advocated revolution.
60/111
-
8/2/2019 l1 Intro 2x2
16/28
Linguistics in deep symbolic NLP systems
The various types of linguistic knowledge are put to work inDeep NLP systems
Deep Natural Language Processing Systems build a meaningrepresentation (needed e.g., for NL interface to databases,
question answering and good MT) from user input andproduces some feedback to the user
In a deep NLP system, each type of linguistic knowledge isencoded in a knowledge base which can be used by one orseveral modules of the system
61/111
Two main problems
Ambiguity: the same linguistic unit (word, constituent, sentence,etc.) can be interpreted/categorised in several
competing waysParaphrases: the same content can be expressed in different ways.
62/111
Problem 1: Ambiguity
The same sentence can mean different things.
La belle ferme la porte. ( La belle femme )Subj (ferme la porte)VP.
(La belle femme ferme)Subj (la porte)VP.
63/111
Ambiguity pervades all levels of linguistic analysis
Phonological: The same sounds can mean different things.Recognise speech or Wreck a nice peach ??
Lexical semantics: The same word can mean different things.etoile : sky star or celebrity?
Part of speech: The same word can belong to different parts ofspeech.la : pronoun, noun or determiner?
Syntax: The same sentencecan have different syntacticstructures.Jean regarde (la fille avec des lunettes)Jean ((regarde la fille) avec des lunettes)
Semantics: The same sentencecan have different meanings.La belle ferme la porte
64/111
A bi i l bl P bl P h
-
8/2/2019 l1 Intro 2x2
17/28
A combinatorial problem
Ambiguities multiply out thereby inducing a combinatorialissue.
Example: La p orte que la belle ferme presente ferme mal.
la porter que ferme presente mal
Nb of POS 3 3 3 5 2 2 Nb de combinaisons possibles: 3 x 3 x 3 x 3 x 3 x 5 x 2 x 5 x 2
= 24 300
The combinatorics is high
65/111
Problem 2: Paraphrase
There are many ways of saying the same thing. Example:
Quand mon laptop arrivera-til? Pourriez vous me dire quand je peux esperer recevoir m on
laptop?
In generation (Meaning Text), this implies making choices.Against the combinatorics is high.
66/111
Some NLP applications
Useful systems have been built for e.g.,:
Spelling and grammar checking
Speech recognition
Spoken Language Dialog Systems
Machine Translation
Text summarisation
Information retrieval and extraction
Question answering
67/111
NLP applications
Three main types of applications:
1. Language input technologies
2. Language processing technologies3. Language output technologies
68/111
L i h l i S h ( )
-
8/2/2019 l1 Intro 2x2
18/28
Language input technologies
Speech recognition
Optical character recognition
Handwriting recognition Retroconversion
69/111
Speech recognition (1)
Key focus Spoken utterance Text
Two main types of Applications Desktop control: dictation, voice control, navigation Telephony-based transaction: travel reservation, remote
banking, pizza ordering , voice control
70/111
Speech recognition (2)
Cheap PC desktop software available
60-90% accuracy. Good enough for dictation and simple
transactions but depends on Speaker and circumstances
Speech recognition is not understanding!
71/111
Speech recognition
based on statistical techniques and very large corpora
works for many languages
accurracy depends on audio-conditions (robustness problem)
cf. the PAROLE team (Yves Laprie)
72/111
S h i i (3) Di i
-
8/2/2019 l1 Intro 2x2
19/28
Speech recognition (3)
Desktop control Philips FreeSpeech (www.speech.philips.com) IBM ViaVoice (www.software.ibm.com/speech) Scansofts DragonNaturallySpeaking
(www.lhsl.com/naturallyspeaking) See also google category :
http://directory.google.com/Top/Computers/SpeechTechnology/
73/111
Dictation
Dictation systems can do more than just transcribe what wassaid: leave out the ums and eh implement corrections that are dictated fill the information into forms rephrase sentences (add missing articles, verbs and
punctuation; remove redundant or repeated words and selfcorrections)
Communicate what is meant, not what is said
Speech can be used both to dictate content or to issuecommands to the word processing applications (speechmacros eg to insert frequently used blocks of text or tonavigate through form)
74/111
Speech recognition (4)
Telephony-based fielded products Nuance (www.nuance.com) ScanSoft (www.scansoft.com) Philips (www.speech.philips.com) Telstra directory enquiry (tel. 12455)
See also google category :http://directory.google.com/Top/Computers/SpeechTechnology/Telephony/
75/111
Optical character recognition (1)
Key focus Printed material computer readable representation
Applications Scanning (text digitized format) Business card readers (to scan the printed information from
business cards into the correct fields of an electronic addressbook.)www.cardscan.com
Website construction from printed documents
76/111
O i l h i i (2) O i l h i i (3)
-
8/2/2019 l1 Intro 2x2
20/28
Optical character recognition (2)
Current state of the art 90% accuracy on clean text 100-200 characters per second (as opposed to 3-4 for typing)
Fundamental issues
character segmentation and character recognition Problems: unclean data and ambiguity Many OCR systems use linguistic knowledge to correct
recognition errors: N-grams for word choice during processing Spelling correction in post-processing
77/111
Optical character recognition (3)
Fielded products
Caeres OmniPage (www.scansoft.com)
Xerox TextBridge (www.scansoft.com) ExperVisions TypeReader (www.expervision.com)
78/111
Handwriting recognition (1)
Key focus Human handwriting computer readable representation
Applications Forms processing Mail routing Personal digital agenda (PDA)
79/111
Handwriting recognition: fundamental issues
Everyone write differently!
Isolated letter vs. cursive script
Train user or system?
Most people type faster than they write: choose applicationswhere keyboards are not appropriate
Need elaborate language model and writing style models
80/111
H d iti iti (2) H d iti iti (3)
-
8/2/2019 l1 Intro 2x2
21/28
Handwriting recognition (2)
5-6% error rate (on isolated letters)
Good typist tolerate up to 1% error rate
Human subjects make 4-8% errrors
81/111
Handwriting recognition (3)
Isolated letters Palms Graffiti (www.palm.com) Computer Intelligence Corporations Jot (www.cic.com)
Cursive scripts
Motorolas Lexicaus ParaGraphs Calligraphper (www.paragraph.com)
cf. the READ team (Abdel Belaid)
82/111
Retroconversion
Key focus: identify the logical and physical structure of theinput text
Applications
Recognising tables of contents Recognising bibliographical references Locating and recognising mathematical formulae Document classification
83/111
Language processing technologies
Spelling and grammar checking
Spoken Language Dialog System
Machine Translation
Text Summarisation
Search and Information Retrieval
Question answering systems
84/111
Spelling and grammar checking Spelling and grammar checking
-
8/2/2019 l1 Intro 2x2
22/28
Spelling and grammar checking
Various levels of sophistication:
Flag words which are not in the dictionnary* neccessary Dictionnary lookup
In case of language with a rich morphology, flag words whichare morphosyntactically incorrect e.g.,He *gived a book to Mary Morphological processing
85/111
Spelling and grammar checking
Syntax might be needed:*Its a fair exchange Possessive pronoun distributionMy friend *were unhappy Subject/Verb agreement
Word sense disambiguation :
*The trees bows were heavy with snow v.The trees boughs were heavy with snow
Existing spell checkers only handle a limited number of theseproblems.
86/111
Spoken Language Dialog Systems
Goal a system that you can talk to in order to carry out some task.
Key focus Speech recognition Speech synthesis
Dialogue Management Applications
Information provision systems: provides information inresponse to query (request for timetable information, weatherinformation)
Transaction-based systems: to undertake transaction such asbuying/selling stocs or reserving a seat on a plane.
87/111
SLDSs Some problems
No training period possible in Phone-based systems
Error handling remains difficult
User initiative remains limited (or likely to result in errors)
88/111
-
8/2/2019 l1 Intro 2x2
23/28
Existing MT Systems The limitations of Taum Meteo
-
8/2/2019 l1 Intro 2x2
24/28
Existing MT Systems
Bownes iTranslator (www.itranslator.com) Taum-Meteo (1979): (English/French)
Domain of weather reports Highly successful
Systran: (among several European languages) Human assisted translation Rough translation Used over the internet through AltaVista http://babelfish.altavista.com
93/111
The limitations of Taum-Meteo
Exceptional domain: Limited language, large translation need.
A limited domain with enough material never again found.
The same group tried to build Taum-Aveo for aircraftmaintenance manuals.
Only limited success.
94/111
The limitations of Systran
Two British undercover soldiers are arrested by Iraqi police inBasra following a car chase. They are reported to have fired on thepolice.
Deux soldats britanniques de capot interne sont arretes par lapolice dIraq a Bassora suivant une chasse de voiture. On rapporte
quils mettent le feu sur la police.
undercover/de capot interne: incorrect word translation
following/suivant (instead of suite a): gerund/prepositionambiguity wrongly resolved
car chase/chasse a voiture (instead of course en voiture):wrong recognition of N-N compound
fire on/mettre le feu sur: non recognition of verbal locution
95/111
MT and lexical meaning
Larbre est une structure tres utilisee en l inguistique. On lutilisepar exemple, pour representer la structure syntaxique duneexpression ou, par le biais des formules logiques, pour representerle sens des expressions de la langue naturelle.
The tree is a structure very much used in linguistics. It is used forexample, to represent the syntactic structure of an expression or,by the means of the logical formulas, to represent the direction ofthe expressions of the natural language.
96/111
Word salad MT State of the Art
-
8/2/2019 l1 Intro 2x2
25/28
Word salad
Cette approche est particulierement interessante parce que, un peucomme les grammaires dunification introduites il y a quelquesdecennies par Martin Kay, [ ... ]. Cette vision qui est sans doute,celle de la plupart des linguistes, na malheureusement toujours pastrouve de cadre informatique adequat pour sexprimer etsinstancier.
This approach is particularly interesting because, a little likeintroduced grammars of unification a few decades ago by MartinKay, [ ... ] . This vision which is undoubtedly, that of the majorityof the linguists, unfortunately still did not find of data-processingframework adequate to be expressed and instancier.
97/111
MT State of the Art
Broad coverage systems already available on the web (Systran)
Reasonable accuracy for specific domains (TAUM Meteo) or
controlled languages Machine aided translation is mostly used
98/111
Text summarisation
Key issue Text Shorter version of text
Applications to decide whether its worth reading the original text To read summary instead of full text to automatically produce abstract
99/111
Text summarisation
Three main steps
1. Extract important sentences (compute document keywordsand score document sentences wrt these keywords)
2. Cohesion check: Sp ot anaphoric references and modify text
accordingly (eg add sentence containing pronoun antecedent;remove difficult sentences; remove pronoun)
3. Balance and coverage: modify summary to have anappropriate text structure (delete redundant sentences;harmonize tense of verbs; ensure balance and proper coverage)
100/ 111
Text summarisation Information Extraction/Retrieval and QA
-
8/2/2019 l1 Intro 2x2
26/28
Text summarisation
State of the Art Sentences extracted on the basis of: location, linguistic cues,
statistical information Low discourse coherence
Commercial systems British Telecoms ProSum (transend.labs.bt.com) Copernic (www.copernic.com) MS Words Summarisation tool See also
http://www.ics.mq.edu.au/swan/summarization/projects.h
101/ 111
Information Extraction/Retrieval and QA
Given a NL query and a document (e.g., web pages), retrieve document containing answer (retrieval) fill in template with relevant information (extraction) produce answer to query (Q/A)
Limited to factoid questions e.g.,
Who invented the electric guitar?How many hexagons are on a soccer ball?Where did Bill Gates go to college?
Excludes: how-to questions, yes-no questions, questions thatreauire complex reasoning
Highest possible accuracy estimated at around 70%
102/ 111
Information Extraction/Retrieval and QA
IR systems : google, yahoo, etc. QA systems
AskJeeves (www.askjeeves.com) Artificial lifes Alife Sales Rep (www.artificial-life.com) Native MindsvReps (www.nativeminds.com) Soliloquy (www.soliloquy.com)
103/ 111
Language output technologies
Text-to-Speech
Tailored document generation
104/ 111
Text to Speech (1) Text to Speech (2)
-
8/2/2019 l1 Intro 2x2
27/28
Text-to-Speech (1)
Key focus Text Natural sounding speech
Applications
Spoken rendering of email via desktop and telephone Document proofreading Voice portals Computer assisted language learning
105/ 111
Text-to-Speech (2)
Requires appropriate use of intonation and phrasing Existing systems
Scansofts RealSpeak (www.lhsl.com/realspeak) British Telecoms Laureate AT&T Natural Voices (http://www.naturalvoices.att.com)
106/ 111
Tailored document generation
Key focus Document structure + parameters Individually tailored
documents
Applications Personalised advice giving Customised policy manuals Web delivered dynamic documents
107/ 111
Tailored document generation
KnowledgePoint (www.knowledgepoint.com) Tailored job descriptions
CoGenTex (www.cogentex.com) Project status reports Weather reports
108/ 111
CL Applications Summary
-
8/2/2019 l1 Intro 2x2
28/28
CL Applications Summary
NLP application process language using knowledge aboutlanguage
all levels of linguistic knowledge are relevant
Two main problems: ambiguity and paraphrase
NLP applications use a mix of symbolic and statisticalmethods
Current applications are not perfect as Symbolic processing is not robust/portable enough Statistical processing is not accurate enough
Applications should be classified into two main types: aids tohuman users (e.g., spell checkers, machine aided translations)and agents in their own right (e.g., NL interfaces to DB,dialogue systems)
Useful applications have been built since the late 70s
Commercial success is harder to achieve
109/ 111