1 ai@azusa pacific university sheldon liang, ph d computer science department

73
1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

Upload: cuthbert-haynes

Post on 25-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

1

AI@Azusa Pacific University

Sheldon Liang, Ph DComputer Science Department

Page 2: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

2

AI@Azusa Pacific University

Sense, Communicate, Actuate

Page 3: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

3

Natural? Natural?

Natural Language? Refers to the language spoken by people, e.g. English,

Chinese, Swahili, as opposed to artificial languages, like C++, Java, etc.

Natural Language Processing Applications that deal with natural language in a way

or another and it is the subfield of Artificial Intelligence

Computational Linguistics Doing linguistics on computers More on the linguistic side than NLP, but closely related

Natural Language? Refers to the language spoken by people, e.g. English,

Chinese, Swahili, as opposed to artificial languages, like C++, Java, etc.

Natural Language Processing Applications that deal with natural language in a way

or another and it is the subfield of Artificial Intelligence

Computational Linguistics Doing linguistics on computers More on the linguistic side than NLP, but closely related

AI@Azusa Pacific University

Page 4: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

4

What is Artificial Intelligence?What is Artificial Intelligence?

The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden)

AI is the study of how to do things which at the moment people do better (Rich & Knight)

AI is the science of making machines do things that would require intelligence if done by men. (Minsky)

The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden)

AI is the study of how to do things which at the moment people do better (Rich & Knight)

AI is the science of making machines do things that would require intelligence if done by men. (Minsky)

AI@Azusa Pacific University

Page 5: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

5

Why Natural Language Processing?Why Natural Language Processing?

kJfmmfj mmmvvv nnnffn333 Uj iheale eleee mnster vensi credur Baboi oi cestnitze Coovoel2^ ekk; ldsllk lkdf vnnjfj? Fgmflmllk mlfm kfre xnnn!

kJfmmfj mmmvvv nnnffn333 Uj iheale eleee mnster vensi credur Baboi oi cestnitze Coovoel2^ ekk; ldsllk lkdf vnnjfj? Fgmflmllk mlfm kfre xnnn!

AI@Azusa Pacific University

Page 6: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

6

Computers Lack Knowledge!Computers Lack Knowledge!

Computers “see” text in English the same you have seen the previous text!

People have no trouble understanding language Common sense knowledge Reasoning capacity Experience

Computers have No common sense knowledge No reasoning capacity

Unless we teach them!

Computers “see” text in English the same you have seen the previous text!

People have no trouble understanding language Common sense knowledge Reasoning capacity Experience

Computers have No common sense knowledge No reasoning capacity

Unless we teach them!

AI@Azusa Pacific University

Page 7: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

7

Why Natural Language Processing?Why Natural Language Processing?

Huge amounts of data Internet = at least 8

billion pagesIntranet

Applications for processing large amounts of texts

Require NLP expertise

Huge amounts of data Internet = at least 8

billion pagesIntranet

Applications for processing large amounts of texts

Require NLP expertise

Classify text into categories Index and search large texts Automatic translation Speech understanding

Understand phone conversations Information extraction

Extract useful information from resumes

Automatic summarization Condense 1 book into 1 page

Question answering Knowledge acquisition Text generations / dialogs

Classify text into categories Index and search large texts Automatic translation Speech understanding

Understand phone conversations Information extraction

Extract useful information from resumes

Automatic summarization Condense 1 book into 1 page

Question answering Knowledge acquisition Text generations / dialogs

AI@Azusa Pacific University

Page 8: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

8

Where does it fit in the CS taxonomy?Where does it fit in the CS taxonomy?

Computers & Applications

Artificial Intelligence AlgorithmsDatabases Networking

Robotics SearchNatural Language Processing

InformationRetrieval

Machine Translation

Language Analysis

Semantics Parsing

AI@Azusa Pacific University

Page 9: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

9

Situating NLPSituating NLP

computer science

psychology/cognitive science

linguistics

math/statistics

philosophy

communication

NLP

AI@Azusa Pacific University

Page 10: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

10

Theoretical foundationsmath: statistics, calculus, algebra,

modelingcomputational paradigms: connectionist,

rule-based, cognitively plausiblelinguistics: LFG, HPSG, GB, OT, CG, etc.architectures: stacks, automata,

networks, compilers

AI@Azusa Pacific University

Page 11: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

11

Some areas of researchSome areas of research

Corpora, tools, resources, standards Language/grammar engineering Machine (assisted) translation, tools Language modeling Lexicography Speech

Corpora, tools, resources, standards Language/grammar engineering Machine (assisted) translation, tools Language modeling Lexicography Speech

AI@Azusa Pacific University

Page 12: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

12

Linguistics Essentials

AI@Azusa Pacific University

Page 13: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

13

The Description of Language Language = Words and Rules Dictionary (vocabulary) + Grammar Dictionary set of words defined in the language open (dynamic) Traditional paper based Electronic machine readable dictionaries; can be obtained from paper-based Grammar set of rules which describe what is allowable in a language Classic Grammars meant for humans who know the language definitions and rules are mainly supported by examples no (or almost no) formal description tools; cannot be programmed Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...) formal description can be programmed & tested on data (texts)

AI@Azusa Pacific University

Page 14: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

14

Linguistics Levels of AnalysisLinguistics Levels of Analysis

SpeechWritten language

Phonology: sounds / letters / pronunciationMorphology: the structure of wordsSyntax: how these sequences are structuredSemantics: meaning of the strings

Interaction between levels where each level has an input and an output.

SpeechWritten language

Phonology: sounds / letters / pronunciationMorphology: the structure of wordsSyntax: how these sequences are structuredSemantics: meaning of the strings

Interaction between levels where each level has an input and an output.

AI@Azusa Pacific University

Page 15: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

15

Phonetics/OrthographyInput:

acoustic signal (phonetics) / text (orthography) Output:

phonetic alphabet (phonetics) / text (orthography)

Deals with:Phonetics:

consonant & vowel (& others) formation in the vocal tract

classification of consonants, vowels, ... in relation to frequencies, shape & position of the tongue and various muscles

intonation Orthography: normalization, punctuation, etc.

AI@Azusa Pacific University

Page 16: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

16

Phonology -- pronunciation

Input: sequence of phones/sounds (in a phonetic alphabet); or

“normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes]

Output: sequence of phonemes (~ (lexical) letters; in an

abstract alphabet) Deals with:

relation between sounds and phonemes (units which might have some function on the upper level)

e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)

AI@Azusa Pacific University

Page 17: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

17

Morphology -- the structure of words Input: sequence of phonemes (~ (lexical) letters)

Output:sequence of pairs (lemma, (morphological) tag)

Deals with:composition of phonemes into word forms and

their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding)

e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.

AI@Azusa Pacific University

Page 18: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

18

...and Beyond Input:

sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions)

Output:logical form, which can be evaluated (true/false)

Deals with:assignment of objects from the real world to the

nodes of the sentence structuree.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~

see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...])

AI@Azusa Pacific University

Page 19: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

19

Phonology

(Surface Lexical) Correspondence “symbol-based” (no complex structures) Ex.: (stem-final change)

lexical: b a b y + s (+ denotes start of ending) surface: b a b i e s (phonetic-related: bb0s)

Arabic: (interfixing, inside-stem doubling) lexical: kTb+uu+CVCCVC (CVCC...vowel/consonant

pattern) surface: kuttub

AI@Azusa Pacific University

Page 20: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

20

Phonology Examples

German (umlaut) (satz ~ sentence) lexical: s A t z + e (A denotes “umlautable” a) surface: s ä t z e (phonetic: zæc, vs. zac)

Turkish (vowel harmony) lexical: e v + l A r (~house) surface: e v l e r

AI@Azusa Pacific University

Page 21: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

21

Morphology: Morphemes & Order

Scientific study of forms of words Grouping of phonemes into morphemes

sequence deliverables deliver, able and s (3 units)

could as well be some “ID” numbers: e.g. deliver ~ 23987, s ~ 12, able ~ 3456

Morpheme Combination certain combinations/sequencing possible, other

not:deliver+able+s, but not able+derive+s; noun+s,

but not noun+ingtypically fixed (in any given language)

AI@Azusa Pacific University

Page 22: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

22

The Dictionary (or Lexicon)

Repository of information about words:Morphological:

description of morphological “behavior”: inflection patterns/classes

Syntactic:Part of Speechrelations to other words:

subcategorization (or “surface valency frames”)Semantic:

semantic featuresframes

...and any other! (e.g., translation)

AI@Azusa Pacific University

Page 23: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

23

AI@Azusa Pacific University

Sense, Communicate, Actuate

Page 24: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

24

(Surface) Syntax Input:

sequence of pairs (lemma, (morphological) tag)

Output: sentence structure (tree) with annotated nodes (all

lemmas, (morphosyntactic) tags, functions), of various forms

Deals with: the relation between lemmas & morphological categories

and the sentence structure uses syntactic categories such as Subject, Verb, Object,... e.g.: I/PP1 see/VB a/DT dog/NN ~ ((I/sg)SB ((see/pres)V (a/ind

dog/sg)OBJ)VP)S

AI@Azusa Pacific University

Page 25: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

25

Issues in SyntaxIssues in Syntax

“the dog ate my homework” - Who did what?

1. Identify the part of speech (POS)Dog = noun ; ate = verb ; homework = nounEnglish POS tagging: 95%Can be improved! Part of speech tagging on other languages almost

inexistent

2. Identify collocations

mother in law, hot dog

Compositional versus non-compositional collocates

“the dog ate my homework” - Who did what?

1. Identify the part of speech (POS)Dog = noun ; ate = verb ; homework = nounEnglish POS tagging: 95%Can be improved! Part of speech tagging on other languages almost

inexistent

2. Identify collocations

mother in law, hot dog

Compositional versus non-compositional collocates

AI@Azusa Pacific University

Page 26: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

26

Issues in SyntaxIssues in Syntax

Shallow parsing:“the dog chased the bear”“the dog” “chased the bear”subject - predicateIdentify basic structuresNP-[the dog] VP-[chased the bear]Shallow parsing on new languagesShallow parsing with little training

data

Shallow parsing:“the dog chased the bear”“the dog” “chased the bear”subject - predicateIdentify basic structuresNP-[the dog] VP-[chased the bear]Shallow parsing on new languagesShallow parsing with little training

data

AI@Azusa Pacific University

Page 27: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

27

Issues in SyntaxIssues in SyntaxFull parsing: John loves MaryFull parsing: John loves Mary

Current precisions: 85-88%

Help figuring out (automatically) questions like: Who did what and when?

AI@Azusa Pacific University

Page 28: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

28

Meaning (semantics) Input:

sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions)

Output: sentence structure (tree) with annotated nodes (semantic

lemmas, (morpho-syntactic) tags, deep functions)

Deals with: relation between categories such as “Subject”, “Object”

and (deep) categories such as “Agent”, “Effect”; adds other cat’s

e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)

AI@Azusa Pacific University

Page 29: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

29

Issues in SemanticsIssues in Semantics

Understand language! How?“plant” = industrial plant“plant” = living organism

Words are ambiguousImportance of semantics?

Machine Translation: wrong translationsInformation Retrieval: wrong informationAnaphora Resolution: wrong referents

Understand language! How?“plant” = industrial plant“plant” = living organism

Words are ambiguousImportance of semantics?

Machine Translation: wrong translationsInformation Retrieval: wrong informationAnaphora Resolution: wrong referents

AI@Azusa Pacific University

Page 30: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

30

The sea is at the home for billions of factories and animalsThe sea is home to million of plants and

animalsEnglish French [commercial MT

system]Le mer est a la maison de billion des

usines et des animauxFrench English

The sea is at the home for billions of factories and animalsThe sea is home to million of plants and

animalsEnglish French [commercial MT

system]Le mer est a la maison de billion des

usines et des animauxFrench English

Why Semantics?Why Semantics?

AI@Azusa Pacific University

Page 31: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

31

Issues in SemanticsIssues in Semantics

How to learn the meaning of words? From dictionaries:

plant, works, industrial plant -- (buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles")plant, flora, plant life -- (a living organism lacking the power of locomotion)

They are producing about 1,000 automobiles in the new plant

The sea flora consists in 1,000 different plant speciesThe plant was close to the farm of animals.

How to learn the meaning of words? From dictionaries:

plant, works, industrial plant -- (buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles")plant, flora, plant life -- (a living organism lacking the power of locomotion)

They are producing about 1,000 automobiles in the new plant

The sea flora consists in 1,000 different plant speciesThe plant was close to the farm of animals.

AI@Azusa Pacific University

Page 32: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

32

Issues in SemanticsIssues in Semantics

Learn from annotated examples:Assume 100 examples containing “plant”

previously tagged by a humanTrain a learning algorithmPrecisions in the range 60%-70%-(80%)How to choose the learning algorithm?How to obtain the 100 tagged examples?

Learn from annotated examples:Assume 100 examples containing “plant”

previously tagged by a humanTrain a learning algorithmPrecisions in the range 60%-70%-(80%)How to choose the learning algorithm?How to obtain the 100 tagged examples?

AI@Azusa Pacific University

Page 33: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

33

Issues in Learning SemanticsIssues in Learning SemanticsLearning?

Assume a (large) amount of annotated data = training

Assume a new text not annotated = testLearn from previous experience (training) to

classify new data (test)Decision trees, memory based learning,

neural networksMachine Learning

Which one performs best?

Learning?Assume a (large) amount of annotated data =

trainingAssume a new text not annotated = test

Learn from previous experience (training) to classify new data (test)

Decision trees, memory based learning, neural networksMachine Learning

Which one performs best?

AI@Azusa Pacific University

Page 34: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

34

Issues in SemanticsIssues in Semantics

Automatic annotation of dataActive learning

Identify only the hard examples

Co-trainingIdentify the examples where several techniques

agree on the semantic tag

Collecting from Web usersOpen Mind Word Expert

Automatic annotation of dataActive learning

Identify only the hard examples

Co-trainingIdentify the examples where several techniques

agree on the semantic tag

Collecting from Web usersOpen Mind Word Expert

AI@Azusa Pacific University

Page 35: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

35

Problems faced by Natural Language-Understanding

Systems

AI@Azusa Pacific University

Page 36: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

36

Key NLP problem: ambiguityKey NLP problem: ambiguity

AI@Azusa Pacific University

Human Language is highly ambiguous at all levels

• acoustic levelrecognize speech vs. wreck a nice beach

• morphological levelsaw: to see (past), saw (noun), to saw (present, inf)

• syntactic levelI saw the man on the hill with a telescope

• semantic levelOne book has to be read by every student

Page 37: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

37

Key NLP problem: AmbiguityKey NLP problem: Ambiguity

AI@Azusa Pacific University

Human Language is highly ambiguous at all levels

• acoustic levelrecognize speech vs. wreck a nice beach

• morphological levelsaw: to see (past), saw (noun), to saw (present, inf)

• syntactic levelI saw the man on the hill with a telescope

• semantic levelOne book has to be read by every student

Page 38: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

38

Language ModelLanguage Model

AI@Azusa Pacific University

A formal model about languageTwo types

Non-probabilisticAllows one to compute whether a certain

sequence (sentence or part thereof) is possible

Often grammar based Probabilistic

Allows one to compute the probability of a certain sequence

Often extends grammars with probabilities

A formal model about languageTwo types

Non-probabilisticAllows one to compute whether a certain

sequence (sentence or part thereof) is possible

Often grammar based Probabilistic

Allows one to compute the probability of a certain sequence

Often extends grammars with probabilities

Page 39: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

39

Example of Bad Language ModelExample of Bad Language Model

AI@Azusa Pacific University

Page 40: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

40

Example of Bad Language ModelExample of Bad Language Model

AI@Azusa Pacific University

Page 41: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

41

Example of Bad Language ModelExample of Bad Language Model

AI@Azusa Pacific University

Page 42: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

42

A Good Language ModelA Good Language Model

AI@Azusa Pacific University

Non-Probabilistic“I swear to tell the truth” is possible“I swerve to smell de soup” is

impossibleProbabilistic

P(I swear to tell the truth) ~ .0001P(I swerve to smell de soup) ~ 0

Non-Probabilistic“I swear to tell the truth” is possible“I swerve to smell de soup” is

impossibleProbabilistic

P(I swear to tell the truth) ~ .0001P(I swerve to smell de soup) ~ 0

Page 43: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

43

Language Model ApplicationLanguage Model Application

AI@Azusa Pacific University

Spelling correctionMobile phone textingSpeech recognitionHandwriting recognitionDisabled users…

Spelling correctionMobile phone textingSpeech recognitionHandwriting recognitionDisabled users…

Page 44: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

44

Speech & Text segmentationSpeech & Text segmentation

In spoken language, sounds representing succesive letters blend into each other

This makes the conversion of the analog signal to discrete characters very difficult

Regarding Text Segmentation , Some written languages like chinese, japanese and thai don’t have signal word boundaries.

So any significant text parsing requires identifying word boundaries, which is often a non-trivial tasks

In spoken language, sounds representing succesive letters blend into each other

This makes the conversion of the analog signal to discrete characters very difficult

Regarding Text Segmentation , Some written languages like chinese, japanese and thai don’t have signal word boundaries.

So any significant text parsing requires identifying word boundaries, which is often a non-trivial tasks

AI@Azusa Pacific University

Page 45: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

45

Word sense disambiguationWord sense disambiguation

Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. Sense Inventory usually comes from a dictionary or

thesaurus. Knowledge intensive methods, supervised learning,

and (sometimes) bootstrapping approaches Word sense discrimination is the problem of dividing the

usages of a word into different meanings, without regard to any particular existing sense inventory. Unsupervised techniques

Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. Sense Inventory usually comes from a dictionary or

thesaurus. Knowledge intensive methods, supervised learning,

and (sometimes) bootstrapping approaches Word sense discrimination is the problem of dividing the

usages of a word into different meanings, without regard to any particular existing sense inventory. Unsupervised techniques

AI@Azusa Pacific University

Page 46: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

46

Word sense disambiguationComputers versus HumansWord sense disambiguationComputers versus Humans

Polysemy – most words have many possible meanings.

A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human…

Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases…

Polysemy – most words have many possible meanings.

A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human…

Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases…

AI@Azusa Pacific University

Page 47: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

47

Word sense disambiguationAmbiguity for a Computer Word sense disambiguationAmbiguity for a Computer

The fisherman jumped off the bank and into the water.

The bank down the street was robbed! Back in the day, we had an entire bank of

computers devoted to this problem. The bank in that road is entirely too steep and is

really dangerous. The plane took a bank to the left, and then

headed off towards the mountains.

The fisherman jumped off the bank and into the water.

The bank down the street was robbed! Back in the day, we had an entire bank of

computers devoted to this problem. The bank in that road is entirely too steep and is

really dangerous. The plane took a bank to the left, and then

headed off towards the mountains.

AI@Azusa Pacific University

Page 48: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

48

Syntactic ambiguitySyntactic ambiguity

There are often multiple possible parse trees for a given sentence.

Choosing the most appropriate one usually requires semantic and contextual information.

Specific problem components here are:1. Sentence boundary disambiguation

2. Imperfect input

3. Foreign or regional accents etc.

There are often multiple possible parse trees for a given sentence.

Choosing the most appropriate one usually requires semantic and contextual information.

Specific problem components here are:1. Sentence boundary disambiguation

2. Imperfect input

3. Foreign or regional accents etc.

AI@Azusa Pacific University

Page 49: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

49

Syntactic ambiguitySyntactic ambiguity

AI@Azusa Pacific University

Page 50: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

50

Statistical NLPStatistical NLP

Statistical NLP uses stochastic, probabilistic and statistical methods to resolve some difficulties of NLP

Methods for disambiguation of an involve the use of corpora & Markov models.

Technology for statistical NLP comes from machine learning and data mining both of which involve learning from data.

Statistical NLP uses stochastic, probabilistic and statistical methods to resolve some difficulties of NLP

Methods for disambiguation of an involve the use of corpora & Markov models.

Technology for statistical NLP comes from machine learning and data mining both of which involve learning from data.

AI@Azusa Pacific University

Page 51: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

51

Statistical NLP -- CorpusStatistical NLP -- Corpus

AI@Azusa Pacific University

Corpus: text collection for linguistic purposes

TokensHow many words are contained in Tom Sawyer? 71.370

TypesHow many different words are contained in T.S.? 8.018

Hapax Legomenawords appearing only once

Page 52: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

52

Statistical NLP – Word CountsStatistical NLP – Word Counts

AI@Azusa Pacific University

The most frequent words are function words

word freq word freq

the 3332 in 906

and 2972 that 877

a 1775 he 877

to 1725 I 783

of 1440 his 772

was 1161 you 686

it 1027 Tom 679

Page 53: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

53

Major Tasks in NLPMajor Tasks in NLP

Speech RecognitionNatural Language GenerationMachine Translation Information Retrieval Information ExtractionText SimplificationAutomatic summarizationForeign Language Reading & writing aid

Speech RecognitionNatural Language GenerationMachine Translation Information Retrieval Information ExtractionText SimplificationAutomatic summarizationForeign Language Reading & writing aid

AI@Azusa Pacific University

Page 54: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

54

Speech RecognitionSpeech Recognition

It is the process of converting a speech signal to a sequence of words, by means of an algorithm (as computer program).

Applications are :1. Voice dialing 2. Call routing 3. Simple data entry 4. Preparation of structure documents

It is the process of converting a speech signal to a sequence of words, by means of an algorithm (as computer program).

Applications are :1. Voice dialing 2. Call routing 3. Simple data entry 4. Preparation of structure documents

AI@Azusa Pacific University

Page 55: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

55

Natural Language generationNatural Language generation It is a task of generating Natural Language from a

machine representation system such as a knowledge base or a logical form.

Ex: Choose randomly among outputs: – Visitant which came into the place where it will be Japanese has admired that there was Mount Fuji. Top 10 outputs according to bigram probabilities:

– Visitors who came in Japan admire Mount Fuji.– Visitors who came in Japan admires Mount Fuji.– Visitors who arrived in Japan admire Mount Fuji.– A visitor who came in Japan admire Mount Fuji.– The visitor who came in Japan admire Mount Fuji.– Visitors who came in Japan admire Mount Fuji.– The visitor who came in Japan admires Mount Fuji.– Mount Fuji is admired by a visitor who came in Japan.

It is a task of generating Natural Language from a machine representation system such as a knowledge base or a logical form.

Ex: Choose randomly among outputs: – Visitant which came into the place where it will be Japanese has admired that there was Mount Fuji. Top 10 outputs according to bigram probabilities:

– Visitors who came in Japan admire Mount Fuji.– Visitors who came in Japan admires Mount Fuji.– Visitors who arrived in Japan admire Mount Fuji.– A visitor who came in Japan admire Mount Fuji.– The visitor who came in Japan admire Mount Fuji.– Visitors who came in Japan admire Mount Fuji.– The visitor who came in Japan admires Mount Fuji.– Mount Fuji is admired by a visitor who came in Japan.

AI@Azusa Pacific University

Page 56: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

56

ConclusionConclusion

AI@Azusa Pacific University

Overview of some probabilistic and machine learning methods for NLP

Also very relevant to bioinformatics !Analogy between parsing

A sentenceA biological string (DNA, protein, mRNA,

…)

Overview of some probabilistic and machine learning methods for NLP

Also very relevant to bioinformatics !Analogy between parsing

A sentenceA biological string (DNA, protein, mRNA,

…)

Page 57: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

57

AI@Azusa Pacific University

Sense, Communicate, Actuate

Page 58: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

58

Machine Translations Machine Translations

Machine Translation or MT is a sub-field of computational linguistics that investigates usage of computer software to translate text or speech from one natural language to another

Machine Translation or MT is a sub-field of computational linguistics that investigates usage of computer software to translate text or speech from one natural language to another

AI@Azusa Pacific University

Page 59: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

59

Issues in Machine TranslationsIssues in Machine Translations

Text to Text Machine Translations Speech to Speech Machine Translations Most of the work has addressed pairs of widely

spread languages like English-French, English-Chinese

How to translate text? Learn from previously translated data Need parallel corpora

French-English, Chinese-English have the Hansards

Reasonable translations Chinese-Hindi – no such tools available today!

Text to Text Machine Translations Speech to Speech Machine Translations Most of the work has addressed pairs of widely

spread languages like English-French, English-Chinese

How to translate text? Learn from previously translated data Need parallel corpora

French-English, Chinese-English have the Hansards

Reasonable translations Chinese-Hindi – no such tools available today!

AI@Azusa Pacific University

Page 60: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

60

Issues in Machine TranslationsIssues in Machine Translations

How to obtain parallel texts?From the Web! How?From Web users! How?

Once we have the texts, how to get most out of them?Word alignmentsObtain lexiconsImport knowledge from well studied

languages

How to obtain parallel texts?From the Web! How?From Web users! How?

Once we have the texts, how to get most out of them?Word alignmentsObtain lexiconsImport knowledge from well studied

languages

AI@Azusa Pacific University

Page 61: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

61

Information Extraction Information Extraction

It’s a type of information retrieval whose goal is to automatically extract structured or semi structured information from unstructured machine readable documents.

Its significance is determined by the growing amount of information available in unstructured form, for instance on the Internet.

It’s a type of information retrieval whose goal is to automatically extract structured or semi structured information from unstructured machine readable documents.

Its significance is determined by the growing amount of information available in unstructured form, for instance on the Internet.

AI@Azusa Pacific University

Page 62: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

62

Issues in Information ExtractionIssues in Information Extraction

“There was a group of about 8-9 people close to the entrance on Highway 75”

Who? “8-9 people”Where? “highway 75”Extract informationDetect new patterns:

Detect hacking / hidden information / etc.Gov./mil. puts lots of money put into IE

research

“There was a group of about 8-9 people close to the entrance on Highway 75”

Who? “8-9 people”Where? “highway 75”Extract informationDetect new patterns:

Detect hacking / hidden information / etc.Gov./mil. puts lots of money put into IE

research

AI@Azusa Pacific University

Page 63: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

63

Information RetrievalInformation Retrieval

Information Retrieval (IR) is a science of searching

for information in documents, for documents themselves, for metadata or searching with in databases (any kind).

Information Retrieval (IR) is a science of searching

for information in documents, for documents themselves, for metadata or searching with in databases (any kind).

AI@Azusa Pacific University

Page 64: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

64

Issues in Information RetrievalIssues in Information Retrieval

Index meaningSearch for plant (=living organism)

should not retrieve texts with plant (=industrial plant)

But should retrieve documents including “flora” or other related terms

Index parsed relations

Index meaningSearch for plant (=living organism)

should not retrieve texts with plant (=industrial plant)

But should retrieve documents including “flora” or other related terms

Index parsed relations

AI@Azusa Pacific University

Page 65: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

65

Issues in Information RetrievalIssues in Information Retrieval

Retrieve specific information Question Answering “What is the height of mount Everest?” 11,000 feet Current state-of-the-art 40-50%Improve precision with the use of more common

sense knowledgePerform domain specific question answering

Retrieve specific information Question Answering “What is the height of mount Everest?” 11,000 feet Current state-of-the-art 40-50%Improve precision with the use of more common

sense knowledgePerform domain specific question answering

AI@Azusa Pacific University

Page 66: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

66

Issues in Information RetrievalIssues in Information Retrieval

Find information across languages! Cross Language Information Retrieval “What is the minimum age requirement for car

rental in Italy?” Search also Italian texts for “eta minima per

noleggio macchine” Integrate large number of languages Integrate into performant IR engines

Find information across languages! Cross Language Information Retrieval “What is the minimum age requirement for car

rental in Italy?” Search also Italian texts for “eta minima per

noleggio macchine” Integrate large number of languages Integrate into performant IR engines

AI@Azusa Pacific University

Page 67: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

67

Automatic SummarizationAutomatic Summarization

It is the creation of a shortened version of a text by a computer program.

As access to data has increased so has interest in automatic summarization. An example of the use of summarization technology is search engines such as Google.

Technologies that can make a coherent summary, of any kind of text, need to take into account several variables such as length, writing –style and syntax to make a useful summary.

It is the creation of a shortened version of a text by a computer program.

As access to data has increased so has interest in automatic summarization. An example of the use of summarization technology is search engines such as Google.

Technologies that can make a coherent summary, of any kind of text, need to take into account several variables such as length, writing –style and syntax to make a useful summary.

AI@Azusa Pacific University

Page 68: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

68

Foreign Language Writing Aid Foreign Language Writing Aid

It is a computer program that assists a non-native language user in their target language.

Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks.

Assisted aspects of writing include: Lexical syntax, Lexical semantics, idiomatic

expression transfer, etc. On-line dictionaries can also be considered as

a type of foreign language writing aid.

It is a computer program that assists a non-native language user in their target language.

Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks.

Assisted aspects of writing include: Lexical syntax, Lexical semantics, idiomatic

expression transfer, etc. On-line dictionaries can also be considered as

a type of foreign language writing aid.

AI@Azusa Pacific University

Page 69: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

69

Language & speech technology have advanced rapidly in the last decades.

AI@Azusa Pacific University

Page 70: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

70

It is EveR-2 Muse, a robot version of a Korean woman in her twenties (Eve+R for robot), can hold a conversation or sing a song, make eye contact, and express anger, sorrow and joy. But according to her creator, most Koreans found her homely in comparison to her predecessor

It is EveR-2 Muse, a robot version of a Korean woman in her twenties (Eve+R for robot), can hold a conversation or sing a song, make eye contact, and express anger, sorrow and joy. But according to her creator, most Koreans found her homely in comparison to her predecessor

AI@Azusa Pacific University

Page 71: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

71

Achievements of AI/ NLPAchievements of AI/ NLP Sphinx can recognise continuous speech. Deep Thought is an international grand master chess

player. Without training for each speaker, it operates in near real time using a vocabulary of 1000 words and has 94% word accuracy.

Navlab is a truck that can drive along a road at 55mph in normal traffic.

Carlton and United Breweries use an AI planning system to plan production of their beer.

Natural language interfaces to databases can be obtained on a PC.

Machine Learning methods have been used to build expert systems.

Expert systems are used regularly in finance, medicine, manufacturing, and agriculture

Sphinx can recognise continuous speech. Deep Thought is an international grand master chess

player. Without training for each speaker, it operates in near real time using a vocabulary of 1000 words and has 94% word accuracy.

Navlab is a truck that can drive along a road at 55mph in normal traffic.

Carlton and United Breweries use an AI planning system to plan production of their beer.

Natural language interfaces to databases can be obtained on a PC.

Machine Learning methods have been used to build expert systems.

Expert systems are used regularly in finance, medicine, manufacturing, and agriculture

AI@Azusa Pacific University

Page 72: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

72

If this dream comes alive…If this dream comes alive…

Even a person who is ignorant of computer knowledge can interact with it through a colloquial interaction.

Almost all systems will be automated. Many problems will have found a solution. No one needs to learn computer languages any more,

instead they can interact with the computer in their natural (regional) languages themselves.

It would be a matter of jubilance for the world as a whole…..

Even a person who is ignorant of computer knowledge can interact with it through a colloquial interaction.

Almost all systems will be automated. Many problems will have found a solution. No one needs to learn computer languages any more,

instead they can interact with the computer in their natural (regional) languages themselves.

It would be a matter of jubilance for the world as a whole…..

AI@Azusa Pacific University

Page 73: 1 AI@Azusa Pacific University Sheldon Liang, Ph D Computer Science Department

73

So lets await that wonderful day &

work in this direction….

AI@Azusa Pacific University