l1 intro 2x2

Upload: arun-yadav

Post on 05-Apr-2018

243 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 l1 Intro 2x2

    1/28

    Natural Language Processing ApplicationsLecture 1

    Claire Gardent

    CNRS/LORIACampus Scientifique,

    BP 239,F-54 5 06 Vand uvre-l es-Nanc y, France

    2007/2008

    1/111

    Todays lecture

    Administrative issues

    Course Overview

    What is NLP?How is it done? a brief historics of NLPLinguistics in NLPWhy is it hard?What are the applications?

    2/111

    Documentation

    Webpage of the course

    www.loria.fr/gardent/applicationsTAL

    Slides of the lecture will be handed out at the beginning of eachlecture.

    Thursday Nancy NLP Seminar

    www.loria.fr/gardent/Seminar/content/seminar07-2.php

    First seminar: this week ; Guillaume Pitel (in English); on usingLatent Semantic Analysis to bootstrap a Framenet for French

    3/111

    Course Overview

    Theory

    What is NLP? : Why is it hard? How is it done? What arethe applications?

    Symbolic approaches. Exemplified by natural languagegenerationMeaning Text

    Statistical approaches. Exemplified by Information retrievaland information extractionText Meaning

    Practice

    Python and NLTK (Natural language toolkit)

    Software project

    Presentations

    4/111

  • 8/2/2019 l1 Intro 2x2

    2/28

    3/111 4/111

    Computers, accounts and Lab sessions

    you should all have a login account on the UHP machines.Nancy 2 students need first to register at Nancy 1 (UHP) Registration is free.

    from this account you should all be able to access, python,NLTK and whatever is needed for the exercices and the projet

    If not, tell us!

    Room I100 is reserved for you every wednesday morning untilchristmas.

    Optional lab sessions with Bertrand Gaiffe, wednesdaymorning from 10 to 12 in Room I100. Starting next week.

    5/111

    Assessment

    Grades will be calculated as follows

    Final exam : 60%

    Presentations : 10%

    Project : 30%

    6/111

    Presentations

    Each student must present a paper on either Question Answeringor Natural Language Generation.

    A list of papers suggested for presentation will be given shortly onthe course web site. If you prefer, you can choose some otherpaper on either QA or NLG but you must then first run it by me

    for approval

    I will collect your choices at the end of the second week.

    Presentations will be held on 4th (QA) and 16th October (NLG).

    More about presentations and about their grading athttp://www.loria.fr/gardent/applicationsTAL

    7/111

    Software Project

    A list of software projects will be presented at the end of thesecond week.

    You should gather into groups of 2 or 3 and choose a topic. Ifdesired, there can also be individual projects.

    I will collect your choices at the end of the third week (4 October).

    Each group will give a short oral presentation (intermediate report)of their project at the end of the 5th week (18 october).

    The results (program and output) of each group on the project willbe returned at the end of the semester (roughly end of january).

    More about projects athttp://www.loria.fr/gardent/applicationsTAL

    8/111

  • 8/2/2019 l1 Intro 2x2

    3/28

    Course schedule

    Mo 17 september, 2pm. What is NLP? Why is it hard? Howis it done? An overview of NLP applications

    Tue 18 september, 10am. Python fundamentals

    Th 20 september, 10am. Regular expressions

    Mo 24 september, 2pm. Corpus processing and tokenizationwith NLTK.

    Tue 25 september, 10am. Tagging and Chunking with NLTK.

    Th 27 september, 10am. Information Retrieval.Presentation of Topics for Software projectsCollecting Individual Choices for Presentations

    Mo 1 october, 10am. Information extraction.

    Tu 2 october, 10am. Information extraction.

    Th 4 october, 10am. Presentations on Question Answering

    9/111

    Course schedule

    Mo 8 october, 2pm. NLG: Introduction

    Tue 9 october, 10am. NLG: Macro and microplanning dd

    Th 11 october, 10am. NLG: Referential expressions

    Mo 15 october, 2pm. NLG: Surface realisation

    Mo 16 october, 10am. Presentations on NLG.

    Th 18 october, 10am. Software Projects Presentations.

    10/111

    Contact information

    Need to find me?

    LORIA, Room B234

    Phone: 03 83 59 20 39

    email: Claire.Gardent AT loria.fr

    Web: www.loria.fr/gardent

    11/111

    What is NLP?

    Natural Language Processing aims at making computerstalk and more precisely, at endowing them with the

    linguistic ability of humans.

    12/111

  • 8/2/2019 l1 Intro 2x2

    4/28

    Two illustrating examples of NLP applications

    1966: Eliza, the first chatterbot i.e., a computer program thatmimics human conversation. Developed by JosephWeizenbaum from the Massachusetts Institute of Technology.

    1973: SHRDLU, a computer program that allows the user tointeract with a robot in a block world. Developed by TerryWinograd (MIT)

    13/111

    1966 ELIZA

    Eliza, the first chatterbot a computer program that mimicshuman conversation.

    Joseph Weizenbaum Massachusetts Institute of Technology

    User types in some statement or set of statements in natural

    language

    ELIZA then analyzes the users statement and generates someresponse which it types out.

    http://www-ai.ijs.si/eliza/eliza.html

    14/111

    Sample conversation

    Men are all alike.IN WHAT WAY?Theyre always bugging us about something or other.CAN YOU THINK OF A SPECIFIC EXAMPLE?Well, my boyfriend made me come here.YOUR BOYFRIEND MADE YOU COME HEREHe says Im depressed much of the time.I AM SORRY TO HEAR YOU ARE DEPRESSED

    15/111

    ELIZA

    Its true. I am unhappyDO YOU THINK COMING HERE WILL HELP YOU NOT TO BEUNHAPPY

    I need some help, that much seems certain.WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELPPerhaps I could learn to get along with my mother.TELL ME MORE ABOUT YOUR FAMILY

    16/111

  • 8/2/2019 l1 Intro 2x2

    5/28

    ELIZA

    My mother takes care of me.WHO ELSE IN YOUR FAMILY TAKES CARE OF YOUMy father.YOUR FATHERYou are like my father in some ways.WHAT RESEMBLANCE DO YOU SEEYou are not very aggressive but I think you dont want me tonotice that.WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVEYou dont argue with me.

    17/111

    ELIZA

    WHY DO YOU THINK I DONT ARGUE WITH YOUYou are afraid of me.DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOUMy father is afraid of everybody.

    WHAT ELSE COMES TO YOUR MIND WHEN YOU THINK OFYOUR FATHERBullies.DOES THAT HAVE ANYTHING TO DO WITH THE FACTTHAT YOUR BOYFRIEND MADE YOU COME HERE

    18/111

    Elizas strategy KEYWORDS

    Input read and inspected for the presence of a keyword

    When keyword is found, the sentence is transformed accordingto a rule associated with the keyword (rcriture dexpressions

    rgulires) Else a content-free remark or, under certain conditions, an

    earlier transformation is retrieved.

    The text so computed or retrieved is then printed out.

    19/111

    Example transformations

    I am X How long have you been X?I am very unhappy.HOW LONG HAVE YOU BEEN VERY UNHAPPY?

    X YOU Y ME What makes you think I Y you?It seems that you hate me.WHAT MAKES YOU THINK THAT I HATE YOU?

    20/111

  • 8/2/2019 l1 Intro 2x2

    6/28

    Why does it work?

    Limited to psychiatrist talk

    Can pretend to know almost nothing of the real worldI went for a long boat ride

    Tell me about boats Not a real dialog (no real communication)

    ELiza generates a reaction (often a rather non commital one)to the users statement

    The user cannot ask for information. The user cannot issue an order

    21/111

    Does it really work?

    Eliza: Hello. I am ELIZA. How can I help you?You: I went to Paris yesterdayEliza: Can you elaborate on that?You: I took the trainEliza: Can you elaborate on that?

    You: The train was lateEliza: Please go on.You: All the seats were fullEliza: Please go on.You: A little boy was having a tantrumEliza: Please go on.

    22/111

    Does it really work?

    Indexicals (I, you, here, now)User: I am having fun right now.Eliza: Does it please you to believe that I am having fun right now.

    World KnowledgeUser: I am a human.Eliza: How long have you been a human?

    23/111

    When Eliza fails

    Yes/No QuestionsUser: Are you a good therapist?Eliza: Would you prefer it if I were a good therapist?User: Im wondering about your skills as a therapist;Eliza: Does it please you to believe I am wondering about my skillsof therapist?

    Wh-QuestionsUser: Who are you?Eliza: Would you prefer it if I were not?User: What do you want me to tell?Eliza: Does that question interest you?

    24/111

  • 8/2/2019 l1 Intro 2x2

    7/28

    When Eliza fails

    Complex AssertionsUser: I dont have a problem. Im quite happy, I have a girlfriend, Istudy. Everything is OK!

    Eliza: Why do you not have any problem, Im quite happy, youhave a girlfriend, you study. Everything is OK!

    25/111

    When Eliza fails

    Ill-formed InputUser: azefzaer is qgqdg in the agag.Eliza: Please go on.

    Ill-formed OutputUser: Doesnt my story interest you?Eliza: Why do you stay your story interest I?User: I think that you can help me but Im not sure that you canunderstand me.Eliza: Do you wish to but im not sure that me

    26/111

    Is it NLP?

    Ad hoc transformations

    No linguistic (e.g., syntactic, semantic, morphological)knowledge/processing involved

    Not portable: would not work for another domain e.g., Trainscheduling information systemI want to go to Paris??? Tell me about Paris

    27/111

    1973 SHRDLU

    Terry Winograd, MIT Interaction with a robot in a block world. The user can:

    instruct the robot to manipulate the blocks

    ask it about the blocks configurations ask it about its reasoning update it with facts

    http://hci.stanford.edu/winograd/shrdlu/

    28/111

  • 8/2/2019 l1 Intro 2x2

    8/28

    SHRDLU A sample dialogue

    1s. Pick up a red block.1u. OK. (does it)

    Understands the instructions; identify the referent of a description

    2s. Grasp the pyramid.2u. I DONT UNDERSTAND WHICH PYRAMID YOU MEAN.

    Knows that definite the must refer to a uniquely identifiableobject

    3s. Find a block which is taller than the one you are holding andput it into the box.3u. BY IT, I ASSUME YOU MEAN THE BLOCK WHICH ISTALLER THAN THE ONE I AM HOLDING.

    Identifies two possible antecedents for the one ...29/111

    SHRDLU A sample dialogue

    4s. What does the box contain?4u. THE BLUE PYRAMID AND THE BLUE BOX.

    Keep tracks of how context is affected by dialog

    5s. What is the pyramid supported by?5u. THE BOX.

    Handles passive/active variation; Correctly identifies the pyramidas the pyramid just mentioned

    30/111

    SHRDLU A sample dialogue

    How many blocks are not in the box?FOUR OF THEM.Is at least one of them narrower than the one which I told you topick up?YES, THE RED CUBE.

    Is it supported?YES, BY THE TABLE.Can the table pick up blocks?NO.

    Can reason about the situation

    31/111

    Is it NLP?

    Real communicative abilities: the user can requestinformation, issue instructions and update the system withnew information

    The system Understands language in a limited domain byusing syntactic parsing and semantic reasoning Large scale grammar of English + parser Procedural semantics for words and phrases

    32/111

  • 8/2/2019 l1 Intro 2x2

    9/28

    Early NLP 1950s

    Machine Translation (MT) one of the earliest applications ofcomputers

    Major attempts in US and USSR- Russian to English and reverse

    George Town University, Washington system:- Translated sample texts in 1954- Euphoria - Lot of funding, many groups in US, USSR* But: the system could not be scaled up.

    33/111

    1964: The ALPAC report

    Assessed research results of groups working on MTs

    Concluded: MT not possible in near future.

    Funding should cease for MT !

    Basic research should be supported.

    Word to word translation does not workLinguistic Knowledge is needed

    34/111

    60-80: Linguistics and CL

    1957 Noam Chomskys Syntactic StructuresA formal definition of grammars and languagesProvides the basis for a automatic syntactic processing of NLexpressions

    Montagues PTQ Formal semantics for NL. Basis for logicaltreatment of NL meaning

    1967 Woods procedural semanticsA procedural approach to the meaning of a sentenceProvides the basis for a automatic semantic processing of NLexpressions

    35/111

    Some successful early CL systems

    1970 TAUM MeteoMachine translation of weather reports (Canada)

    1970s SYSTRAN: MT system; still used by Google

    1973 LunarTo question expert system on rock analyses from Moonsamples

    1973 SHRDLU (T. Winograd)Instructing a robot to move toy blocks

    36/111

  • 8/2/2019 l1 Intro 2x2

    10/28

    1980s: Symbolic NLP

    Formally grounded and reasonably computationally tractablelinguistic formalisms (Lexical Functional Grammar,Head-Driven Phrase Structure Grammar, Tree AdjoiningGrammar etc.)

    Linguistic/Logical paradigm extensively pursued

    Not robust enough

    Few applications

    37/111

    1980s: Corpora and Resources

    Disk space becomes cheap

    Machine readable text becomes uniquitous

    US funding emphasises large scale evaluation on real data

    1994 The British National Corpus is made availableA balanced corpus of British English

    Mid 1990s WordNet (Fellbaum & Miller)A computational thesaurus developed by psycholinguists

    Early 2000s The World Wide Web used as a corpus

    38/111

    1990s Statistical NLP

    The following factors promote the emergence of statistical NLP:

    Speech recognition shows that given enough data, simplestatistical techniques work

    US funding emphasises speech-based interfaces andinformation extraction

    Large size digitised corpora are available

    39/111

    CL History Summary

    50s Machine translation; ended by ALPAC report

    60s Applications use linguistic techniques (Eliza, shrdlu)from Chomsky (formal grammars, parsers); Proceduralsemantics (Woods) also important. Approaches only work on

    restricted Domains. Not portable. 70s/80s Symbolic NLP. Applications based on extensive

    linguistic and real world knowledge. Not robust enough.Lexical acquisition bottleneck.

    90s now. Statistical NLP. Applications based on statisticalmethods and large (annotated) corpora

    40/111

  • 8/2/2019 l1 Intro 2x2

    11/28

    Symbolic vs. statistical approaches

    Symbolic

    Based on hand written rules

    Requires linguistic expertise

    No frequencey information

    More brittle and slower than statistical approaches

    Often more precise than statistical approaches

    Error analysis is usually easier than for statistical approachesStatistical

    Supervised or non-supervised

    Rules acquired from large size corpus

    Not much linguistic expertise required

    Robust and quick

    Requires large size (annotated) corpora

    Error analysis is often difficult

    41/111

    Linguistics in NLP

    NLP applications use knowledge about language to processlanguage

    All levels of linguistic knowledge are relevant: Phonetics, Phonology The study of linguistic sounds and of

    their relation to words

    Morphology The study of words components Syntactic The study of the structural relationship betweenwords

    Semantics The study of meaning Pragmatics The study of how language is used to accomplish

    goals and of the influence of context on meaning Discourse The study of linguistic units larger than a single

    utterance

    42/111

    Phonetics/phonology

    Phonetics : study of the speech sounds used in the languages ofthe world

    How to transcribe those sounds (IPA,International Phonetic Alphabet)

    How sounds are produced (ArticulatoryPhonetics)

    Phonology : study of the way a sound is realised in differentenvironments

    A sound (phone) can usually be realised indifferent ways (allophones) depending on itscontext

    E.g., the hand transcribed Switchboard corpusof English telephone speech list 16 ways ofpronuncing because and about

    43/111

    Phonetics/phonology

    An example illustrating the Sound-to-Text mapping issue.

    (1) a. Recognise speech.b. Wreck a nice peach.

    Phonetics and phonology can be used either to map words intosound (Speech synthesis) or to map sounds onto words (Speechrecognition).

    44/111

  • 8/2/2019 l1 Intro 2x2

    12/28

    Morphology

    Study of the structure of words

    Two types of morphology :

    Flectional: decomposes a word into a lemma and one ormore grammatical affixes giving information

    about tense, gender, number, etc.E.g., Cats lemma = cat + affixe = sDerivational: decomposes a word into a lemma and one or

    more affixes giving information about meaningor/and category.E.g., Unfair prefix = un + lemma = fair

    45/111

    Morphology: main issues

    Exceptions and irregularities:

    Women Woman, Plural

    Arent are not

    Ambiguity:

    saw saw, noun, sg, neuter saw saw, verb, 1st person, sg, past

    saw saw, verb, 2nd person, sg, past

    saw saw, verb, 3rd person, sg, past

    saw saw, verb, 1st person, pl, past

    saw saw, verb, 2nd person, sg, past

    saw saw, verb, 3rd person, sg, past

    46/111

    Morphology: Methods and Tools

    Methods

    Lemmatisation (Morphological analysis)

    Stemming: approximationTools

    Finite state transducers

    47/111

    Morphology: Applications

    IN CL applications, morphological information is useful e.g.,

    to resolve anaphora:(2) Sarah met the women in the street.

    She did not like them. [Shesg = Sarahsg; thempl = thewomenpl]

    for spell checking and for generation* The womenpl issg

    48/111

  • 8/2/2019 l1 Intro 2x2

    13/28

    Syntax

    Captures structural relationships between words and phrases

    Describes the constituent structure of NL expressions

    Grammars are used to describe the syntax of a language

    Syntactic analysers and surface realisers assign a syntacticstructure to a string/semantic representation on the basis of agrammar

    49/111

    Syntactic tree example

    S

    NP VP

    John V NP PP

    Adv V Det n Prep NP

    often gives a book to Mary

    50/111

    Methods in Syntax

    Words Syntactic tree

    Algorithm : parser

    Resource used : Lexicon + Grammar

    Symbolic : hand-written grammar and lexicon

    Statistical: grammar acquired from tree bank

    difficulty : coverage and ambiguity

    51/111

    Syntax

    In CL applications, syntactic information is useful e.g., . . .

    for spell checking (e.g., subject-verb agreement)

    to construct the meaning of a sentence to generate a grammatical sentence

    52/111

  • 8/2/2019 l1 Intro 2x2

    14/28

    Spell checking

    (3) Its a fair exchange. No syntactic treeIts a fair exchange. Ok syntactic tree

    (4) My friends is unhappy.

    The number of my friends who were unhappy was amazing.The man who greets my friends is amazing. Subject+Verb agreement

    53/111

    Syntax and Meaning

    John loves Mary love(j,m)Agent = Subject

    = Mary loves John love(m,l)

    Agent = Subject

    Mary is loved by John love(j,m)Agent = By-Object

    54/111

    Lexical semantics

    The study of word meanings and of their interaction withcontext

    Words have several possible meanings

    Early methods use selectional restrictions to identify meaningintended in given context

    (5) a. The astronomer saw the star.

    b. The astronomer married the star. Statistical methods use cooccurrence information derived from

    corpora annotated with word senses(6) e. John sat on the bank.

    f. John went to the bank.g. ?? King Kong sat on the bank.

    Lesk algorithm: word overlap between words appearing thedefinitions of the ambiguous word and the words surroundingthis word in text

    55/111

    Lexical semantics

    Lexical relations i.e., relations between word meanings are alsovery important for CL based applications

    The most used lexical relations are:

    Hyponymy (ISA) e.g., a dog is a hyponym of animal Meronymy (part of) e.g., arm is a meronym of body Synonymy e.g., eggplant and aubergine Antonymy e.g., big and little

    56/111

  • 8/2/2019 l1 Intro 2x2

    15/28

    Lexical semantics

    In NLP applications, the most commonly used lexical relation ishyponymy which is used:

    for semantic classification (e.g., selectional restrictions, namedentity recognition)

    for shallow inference (e.g., X murdered Y implies X killedY)

    for word sense disambiguation

    for machine translation (if a term cannot be translated,substitute a hypernym)

    57/111

    Compositional Semantics

    Semantics of phrases

    Useful to reason about the meaning of an expression (e.g., toimprove the accuracy of a question answering system)

    (7) a. John saw Mary.b. Mary saw John.

    Same words, different meanings.

    58/111

    Pragmatics

    Compositional semantics delivers the literal meaning of anutterance

    NL phrases are often used non literally

    Examples.

    (8) a. Can you pass the salt?b. You are standing on my foot.

    Speech act analysis, plan recognition are needed to determinethe full meaning of an utterance

    59/111

    Discourse

    Much of language interpretation is dependent on thepreceding discourse/dialogue

    Example. Anaphora resolution.

    (9) a. The councillors refused the women a permit becausethey feared revolution.b. The councillors refused the women a permit becausethey advocated revolution.

    60/111

  • 8/2/2019 l1 Intro 2x2

    16/28

    Linguistics in deep symbolic NLP systems

    The various types of linguistic knowledge are put to work inDeep NLP systems

    Deep Natural Language Processing Systems build a meaningrepresentation (needed e.g., for NL interface to databases,

    question answering and good MT) from user input andproduces some feedback to the user

    In a deep NLP system, each type of linguistic knowledge isencoded in a knowledge base which can be used by one orseveral modules of the system

    61/111

    Two main problems

    Ambiguity: the same linguistic unit (word, constituent, sentence,etc.) can be interpreted/categorised in several

    competing waysParaphrases: the same content can be expressed in different ways.

    62/111

    Problem 1: Ambiguity

    The same sentence can mean different things.

    La belle ferme la porte. ( La belle femme )Subj (ferme la porte)VP.

    (La belle femme ferme)Subj (la porte)VP.

    63/111

    Ambiguity pervades all levels of linguistic analysis

    Phonological: The same sounds can mean different things.Recognise speech or Wreck a nice peach ??

    Lexical semantics: The same word can mean different things.etoile : sky star or celebrity?

    Part of speech: The same word can belong to different parts ofspeech.la : pronoun, noun or determiner?

    Syntax: The same sentencecan have different syntacticstructures.Jean regarde (la fille avec des lunettes)Jean ((regarde la fille) avec des lunettes)

    Semantics: The same sentencecan have different meanings.La belle ferme la porte

    64/111

    A bi i l bl P bl P h

  • 8/2/2019 l1 Intro 2x2

    17/28

    A combinatorial problem

    Ambiguities multiply out thereby inducing a combinatorialissue.

    Example: La p orte que la belle ferme presente ferme mal.

    la porter que ferme presente mal

    Nb of POS 3 3 3 5 2 2 Nb de combinaisons possibles: 3 x 3 x 3 x 3 x 3 x 5 x 2 x 5 x 2

    = 24 300

    The combinatorics is high

    65/111

    Problem 2: Paraphrase

    There are many ways of saying the same thing. Example:

    Quand mon laptop arrivera-til? Pourriez vous me dire quand je peux esperer recevoir m on

    laptop?

    In generation (Meaning Text), this implies making choices.Against the combinatorics is high.

    66/111

    Some NLP applications

    Useful systems have been built for e.g.,:

    Spelling and grammar checking

    Speech recognition

    Spoken Language Dialog Systems

    Machine Translation

    Text summarisation

    Information retrieval and extraction

    Question answering

    67/111

    NLP applications

    Three main types of applications:

    1. Language input technologies

    2. Language processing technologies3. Language output technologies

    68/111

    L i h l i S h ( )

  • 8/2/2019 l1 Intro 2x2

    18/28

    Language input technologies

    Speech recognition

    Optical character recognition

    Handwriting recognition Retroconversion

    69/111

    Speech recognition (1)

    Key focus Spoken utterance Text

    Two main types of Applications Desktop control: dictation, voice control, navigation Telephony-based transaction: travel reservation, remote

    banking, pizza ordering , voice control

    70/111

    Speech recognition (2)

    Cheap PC desktop software available

    60-90% accuracy. Good enough for dictation and simple

    transactions but depends on Speaker and circumstances

    Speech recognition is not understanding!

    71/111

    Speech recognition

    based on statistical techniques and very large corpora

    works for many languages

    accurracy depends on audio-conditions (robustness problem)

    cf. the PAROLE team (Yves Laprie)

    72/111

    S h i i (3) Di i

  • 8/2/2019 l1 Intro 2x2

    19/28

    Speech recognition (3)

    Desktop control Philips FreeSpeech (www.speech.philips.com) IBM ViaVoice (www.software.ibm.com/speech) Scansofts DragonNaturallySpeaking

    (www.lhsl.com/naturallyspeaking) See also google category :

    http://directory.google.com/Top/Computers/SpeechTechnology/

    73/111

    Dictation

    Dictation systems can do more than just transcribe what wassaid: leave out the ums and eh implement corrections that are dictated fill the information into forms rephrase sentences (add missing articles, verbs and

    punctuation; remove redundant or repeated words and selfcorrections)

    Communicate what is meant, not what is said

    Speech can be used both to dictate content or to issuecommands to the word processing applications (speechmacros eg to insert frequently used blocks of text or tonavigate through form)

    74/111

    Speech recognition (4)

    Telephony-based fielded products Nuance (www.nuance.com) ScanSoft (www.scansoft.com) Philips (www.speech.philips.com) Telstra directory enquiry (tel. 12455)

    See also google category :http://directory.google.com/Top/Computers/SpeechTechnology/Telephony/

    75/111

    Optical character recognition (1)

    Key focus Printed material computer readable representation

    Applications Scanning (text digitized format) Business card readers (to scan the printed information from

    business cards into the correct fields of an electronic addressbook.)www.cardscan.com

    Website construction from printed documents

    76/111

    O i l h i i (2) O i l h i i (3)

  • 8/2/2019 l1 Intro 2x2

    20/28

    Optical character recognition (2)

    Current state of the art 90% accuracy on clean text 100-200 characters per second (as opposed to 3-4 for typing)

    Fundamental issues

    character segmentation and character recognition Problems: unclean data and ambiguity Many OCR systems use linguistic knowledge to correct

    recognition errors: N-grams for word choice during processing Spelling correction in post-processing

    77/111

    Optical character recognition (3)

    Fielded products

    Caeres OmniPage (www.scansoft.com)

    Xerox TextBridge (www.scansoft.com) ExperVisions TypeReader (www.expervision.com)

    78/111

    Handwriting recognition (1)

    Key focus Human handwriting computer readable representation

    Applications Forms processing Mail routing Personal digital agenda (PDA)

    79/111

    Handwriting recognition: fundamental issues

    Everyone write differently!

    Isolated letter vs. cursive script

    Train user or system?

    Most people type faster than they write: choose applicationswhere keyboards are not appropriate

    Need elaborate language model and writing style models

    80/111

    H d iti iti (2) H d iti iti (3)

  • 8/2/2019 l1 Intro 2x2

    21/28

    Handwriting recognition (2)

    5-6% error rate (on isolated letters)

    Good typist tolerate up to 1% error rate

    Human subjects make 4-8% errrors

    81/111

    Handwriting recognition (3)

    Isolated letters Palms Graffiti (www.palm.com) Computer Intelligence Corporations Jot (www.cic.com)

    Cursive scripts

    Motorolas Lexicaus ParaGraphs Calligraphper (www.paragraph.com)

    cf. the READ team (Abdel Belaid)

    82/111

    Retroconversion

    Key focus: identify the logical and physical structure of theinput text

    Applications

    Recognising tables of contents Recognising bibliographical references Locating and recognising mathematical formulae Document classification

    83/111

    Language processing technologies

    Spelling and grammar checking

    Spoken Language Dialog System

    Machine Translation

    Text Summarisation

    Search and Information Retrieval

    Question answering systems

    84/111

    Spelling and grammar checking Spelling and grammar checking

  • 8/2/2019 l1 Intro 2x2

    22/28

    Spelling and grammar checking

    Various levels of sophistication:

    Flag words which are not in the dictionnary* neccessary Dictionnary lookup

    In case of language with a rich morphology, flag words whichare morphosyntactically incorrect e.g.,He *gived a book to Mary Morphological processing

    85/111

    Spelling and grammar checking

    Syntax might be needed:*Its a fair exchange Possessive pronoun distributionMy friend *were unhappy Subject/Verb agreement

    Word sense disambiguation :

    *The trees bows were heavy with snow v.The trees boughs were heavy with snow

    Existing spell checkers only handle a limited number of theseproblems.

    86/111

    Spoken Language Dialog Systems

    Goal a system that you can talk to in order to carry out some task.

    Key focus Speech recognition Speech synthesis

    Dialogue Management Applications

    Information provision systems: provides information inresponse to query (request for timetable information, weatherinformation)

    Transaction-based systems: to undertake transaction such asbuying/selling stocs or reserving a seat on a plane.

    87/111

    SLDSs Some problems

    No training period possible in Phone-based systems

    Error handling remains difficult

    User initiative remains limited (or likely to result in errors)

    88/111

  • 8/2/2019 l1 Intro 2x2

    23/28

    Existing MT Systems The limitations of Taum Meteo

  • 8/2/2019 l1 Intro 2x2

    24/28

    Existing MT Systems

    Bownes iTranslator (www.itranslator.com) Taum-Meteo (1979): (English/French)

    Domain of weather reports Highly successful

    Systran: (among several European languages) Human assisted translation Rough translation Used over the internet through AltaVista http://babelfish.altavista.com

    93/111

    The limitations of Taum-Meteo

    Exceptional domain: Limited language, large translation need.

    A limited domain with enough material never again found.

    The same group tried to build Taum-Aveo for aircraftmaintenance manuals.

    Only limited success.

    94/111

    The limitations of Systran

    Two British undercover soldiers are arrested by Iraqi police inBasra following a car chase. They are reported to have fired on thepolice.

    Deux soldats britanniques de capot interne sont arretes par lapolice dIraq a Bassora suivant une chasse de voiture. On rapporte

    quils mettent le feu sur la police.

    undercover/de capot interne: incorrect word translation

    following/suivant (instead of suite a): gerund/prepositionambiguity wrongly resolved

    car chase/chasse a voiture (instead of course en voiture):wrong recognition of N-N compound

    fire on/mettre le feu sur: non recognition of verbal locution

    95/111

    MT and lexical meaning

    Larbre est une structure tres utilisee en l inguistique. On lutilisepar exemple, pour representer la structure syntaxique duneexpression ou, par le biais des formules logiques, pour representerle sens des expressions de la langue naturelle.

    The tree is a structure very much used in linguistics. It is used forexample, to represent the syntactic structure of an expression or,by the means of the logical formulas, to represent the direction ofthe expressions of the natural language.

    96/111

    Word salad MT State of the Art

  • 8/2/2019 l1 Intro 2x2

    25/28

    Word salad

    Cette approche est particulierement interessante parce que, un peucomme les grammaires dunification introduites il y a quelquesdecennies par Martin Kay, [ ... ]. Cette vision qui est sans doute,celle de la plupart des linguistes, na malheureusement toujours pastrouve de cadre informatique adequat pour sexprimer etsinstancier.

    This approach is particularly interesting because, a little likeintroduced grammars of unification a few decades ago by MartinKay, [ ... ] . This vision which is undoubtedly, that of the majorityof the linguists, unfortunately still did not find of data-processingframework adequate to be expressed and instancier.

    97/111

    MT State of the Art

    Broad coverage systems already available on the web (Systran)

    Reasonable accuracy for specific domains (TAUM Meteo) or

    controlled languages Machine aided translation is mostly used

    98/111

    Text summarisation

    Key issue Text Shorter version of text

    Applications to decide whether its worth reading the original text To read summary instead of full text to automatically produce abstract

    99/111

    Text summarisation

    Three main steps

    1. Extract important sentences (compute document keywordsand score document sentences wrt these keywords)

    2. Cohesion check: Sp ot anaphoric references and modify text

    accordingly (eg add sentence containing pronoun antecedent;remove difficult sentences; remove pronoun)

    3. Balance and coverage: modify summary to have anappropriate text structure (delete redundant sentences;harmonize tense of verbs; ensure balance and proper coverage)

    100/ 111

    Text summarisation Information Extraction/Retrieval and QA

  • 8/2/2019 l1 Intro 2x2

    26/28

    Text summarisation

    State of the Art Sentences extracted on the basis of: location, linguistic cues,

    statistical information Low discourse coherence

    Commercial systems British Telecoms ProSum (transend.labs.bt.com) Copernic (www.copernic.com) MS Words Summarisation tool See also

    http://www.ics.mq.edu.au/swan/summarization/projects.h

    101/ 111

    Information Extraction/Retrieval and QA

    Given a NL query and a document (e.g., web pages), retrieve document containing answer (retrieval) fill in template with relevant information (extraction) produce answer to query (Q/A)

    Limited to factoid questions e.g.,

    Who invented the electric guitar?How many hexagons are on a soccer ball?Where did Bill Gates go to college?

    Excludes: how-to questions, yes-no questions, questions thatreauire complex reasoning

    Highest possible accuracy estimated at around 70%

    102/ 111

    Information Extraction/Retrieval and QA

    IR systems : google, yahoo, etc. QA systems

    AskJeeves (www.askjeeves.com) Artificial lifes Alife Sales Rep (www.artificial-life.com) Native MindsvReps (www.nativeminds.com) Soliloquy (www.soliloquy.com)

    103/ 111

    Language output technologies

    Text-to-Speech

    Tailored document generation

    104/ 111

    Text to Speech (1) Text to Speech (2)

  • 8/2/2019 l1 Intro 2x2

    27/28

    Text-to-Speech (1)

    Key focus Text Natural sounding speech

    Applications

    Spoken rendering of email via desktop and telephone Document proofreading Voice portals Computer assisted language learning

    105/ 111

    Text-to-Speech (2)

    Requires appropriate use of intonation and phrasing Existing systems

    Scansofts RealSpeak (www.lhsl.com/realspeak) British Telecoms Laureate AT&T Natural Voices (http://www.naturalvoices.att.com)

    106/ 111

    Tailored document generation

    Key focus Document structure + parameters Individually tailored

    documents

    Applications Personalised advice giving Customised policy manuals Web delivered dynamic documents

    107/ 111

    Tailored document generation

    KnowledgePoint (www.knowledgepoint.com) Tailored job descriptions

    CoGenTex (www.cogentex.com) Project status reports Weather reports

    108/ 111

    CL Applications Summary

  • 8/2/2019 l1 Intro 2x2

    28/28

    CL Applications Summary

    NLP application process language using knowledge aboutlanguage

    all levels of linguistic knowledge are relevant

    Two main problems: ambiguity and paraphrase

    NLP applications use a mix of symbolic and statisticalmethods

    Current applications are not perfect as Symbolic processing is not robust/portable enough Statistical processing is not accurate enough

    Applications should be classified into two main types: aids tohuman users (e.g., spell checkers, machine aided translations)and agents in their own right (e.g., NL interfaces to DB,dialogue systems)

    Useful applications have been built since the late 70s

    Commercial success is harder to achieve

    109/ 111