natural language processing. according to research at an elingsh uinervtisy, it deosn’t mttaer in...

31
Natural Language Processing

Upload: chrystal-robertson

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Natural LanguageProcessing

Page 2: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer are in the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by islelf but the wrod as a wlohe. Ceehiro. That’ll sutff the splelchekcer.

(Exert from Sideswipe NZ Herald 22/09/03)

Page 3: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

1970s Example: SHRDLU

Person: PICK UP A BIG RED BLOCK.

Computer: OK. (does it)

Person: GRASP THE PYRAMID.

Computer: I DON'T UNDERSTAND WHICH

PYRAMID YOU MEAN.

Person: FIND A BLOCK WHICH IS TALLER

THAN THE ONE YOU ARE HOLDING

AND PUT IT INTO THE BOX.

Computer: BY "IT", I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING.

Computer: OK. (does it)

Person: WHAT DOES THE BOX CONTAIN?

Computer: THE BLUE PYRAMID AND THE BLUE BLOCK.

Person: WHAT IS THE PYRAMID SUPPORTED BY?

Computer: THE BOX.

Person: HOW MANY BLOCKS ARE NOT

IN THE BOX?

Computer: FOUR OF THEM.

Person: IS AT LEAST ONE OF THEM

NARROWER THAN THE ONE WHICH

I TOLD YOU TO PICK UP?

Computer: YES, THE RED CUBE.

Terry Winograd. 1971.MIT Ph.D. Thesis.

Terry Winograd

Page 4: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Pomegranade

Page 5: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Natural language processing (NLP) Human Language Technology (HLT), Natural Language Engineering (NLE)

• is considered a sub-field of artificial intelligence and has significant overlap with the field of computational linguistics. It is concerned with the interactions between computers and human (natural) languages.

• Natural language generation systems convert information from computer databases into readable human language.

• Natural language understanding systems convert human language into representations that are easier for computer programs to manipulate.

• The term natural language is used to distinguish human languages (e.g. English, Persian, Swedish) from formal or computer languages (e.g. C++, Prolog).

• NLP encompasses both text and speech, but work on speech processing has evolved into a separate field.

Page 6: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Where does it fit in the CS taxonomy?

Computers

Artificial Intelligence AlgorithmsDatabases Networking

Robotics SearchNatural Language Processing

InformationRetrieval

Machine Translation

Language Analysis

Semantics Parsing

Page 7: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

• Yahoo, Google, Microsoft Information Retrieval• Monster.com, HotJobs.com (Job finders) Information Extraction & Information Retrieval• Systran powers Babelfish, Google Machine Translation• Ask Jeeves Question Answering• Myspace, Facebook, Blogspot Processing of User-

Generated Content• Tools for “business intelligence”• All “Big Companies” have (several)

strong NLP research labs: IBM, Microsoft, AT&T, Xerox,

Sun, etc.• Academia: research in an university environment

Applications

Page 8: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

What is NLP?

• Combination of computational linguistics, artificial intelligence & cognitive science.

• Concentrates on interpreting text using a combination of

lexical, syntactic, semantic and real world knowledge.

• Applications include intelligent translators, speech recognition software, information management tools and other types of communication software.

Page 9: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Grammar

• The grammar of a language is a description of the structure of that language.

• Grammars provide a scheme for specifying the structure of sentences and rules for combining words into correct phrases and clauses.

Page 10: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

English Grammar

• English word order follows a Subject-Object-Verb (SVO) linguistic topology.

• The subject of a verb is the “doer” of the verb, and the object is the “doee”.

The cat is drinking the milk.

Subject Verb Object

Page 11: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Syntax

• Syntax is the study of the rules, or patterns, that govern the way the words in a sentence come together.

• Syntax deals with how different words which are categorised into “parts of speech” (nouns, adjectives, verbs etc), and how they are combined into clauses, or phrases, which in turn combine into sentences.

Page 12: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Syntactic Analysis

• Syntactic analysis involves isolating phrases and sentences into a hierarchical structure, allowing the study of its constituents.

• For example the sentence “the big cat is drinking milk” can be broken up into the following constituents:

Page 13: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Syntactic Analysis

The big cat is drinking milk

Noun Phrase Verb Phrase

Determiner Adjective Phrase

Noun Auxiliary Verb Noun Phrase

The big cat is drinking milk

Page 14: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

A Grammar for a very small fragment of English

sentence --> noun_phrase, verb_phrase.

noun_phrase --> determiner, noun. noun_phrase --> proper_noun.

determiner -->[the]. determiner -->[a].

proper_noun -->[pedro].

noun -->[man]. noun -->[apple].

verb_phrase --> verb, noun_phrase. verb_phrase --> verb.

verb -->[eats]. verb -->[sings].

Implementation- Prolog

Page 15: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

?-  phrase(sentence, [the, man, eats]).

yes

?- phrase(sentence, [the, man, eats, the, apple]).

yes

?-  phrase(sentence, [the, apple, eats, a, man]).

yes

?-  phrase(sentence, [pedro, sings, the, pedro]).

no

?- phrase(sentence,[eats, apple, man]).

no

?- phrase(sentence,L).

Page 16: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

L = [the, man, eats, the, man] ;L = [the, man, eats, the, apple] ;L = [the, man, eats, a, man] ;L = [the, man, eats, a, apple] ;L = [the, man, eats, pedro] ;L = [the, man, sings, the, man] ;L = [the, man, sings, the, apple] ;L = [the, man, sings, a, man] ;L = [the, man, sings, a, apple] ;L = [the, man, sings, pedro] ;L = [the, man, eats] ;L = [the, man, sings] ;L = [the, apple, eats, the, man] ;L = [the, apple, eats, the, apple] ;L = [the, apple, eats, a, man] ;L = [the, apple, eats, a, apple] ;L = [the, apple, eats, pedro] ;L = [the, apple, sings, the, man] ;L = [the, apple, sings, the, apple] ;L = [the, apple, sings, a, man] ;

Page 17: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Issues in Syntax

• “the dog ate my homework” - Who did what?

• Identify the part of speech (POS)– Dog = noun ; ate = verb ; homework = noun– English POS tagging

• Identify collocations

mother in law, hot dog

Page 18: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Chomsky’s Grammars

• Chomsky introduced transformational grammars (also called transformational generative grammars or generative grammars).

• He introduced the idea of “deep structures” which provide a syntactic base of language and consist of:

Page 19: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Chomsky’s Grammars

– a series of phrase-structure (rewrite) rules

– a series of (possibly universal) rules that generates the underlying phrase-structure of a sentence

– a series of transformations that act upon the phrase-structure, producing more complex sentences

– a series of morphophonemic rules controlling pronunciation.

Page 20: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Chomsky’s Lexicon

• The lexicon, which can be thought of as a dictionary of the language in a particular form, lists all of the vocabulary words in the language and associates them with their syntactic, semantic and phonological information.

• This information is represented in terms of “features”.

Page 21: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Chomsky’s Feature Terms

• For example, the entry for “cat” might have the following syntactic features:

Cat: [+ Noun], [+ Count], [+ Common], [+ Animate]

• These features are used to fill “slots” in a set of phrase markers. For example, a phrase marker requiring an animate noun ([+ Animate]) would find “cat” eligible for lexical subsitiution into that slot, as it fulfils the requirements of being an animate noun.

Page 22: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Syntactics vs Semantics

• One of the most controversial topics in the development of transformational grammar is the reationship between syntax and semantics.

• There is a considerable degree of interdependence between the two, and the problem is how to formalise this relationship.

Page 23: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Phrase Structure Grammars

• Phrase-structure rules are used to describe a given language's syntax by attempting to break language down into its constituent parts (also known as syntactic categories) namely phrasal categories and lexical categories (parts of speech).

• There are many kinds of phrase-structure rules, which themselves can be combined to generate additional phrase-structure rules.

Page 24: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Phrase Structure Grammars

• In particlar phrase-structure rules must account for the following characteristics:

1. All languages combine nouns (N) and verbs (V) to express ideas about the universe.

2. All languages have rules determining how these are combined into meaningful units.

Page 25: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Phrase Structure Grammars

3. All languages have recursion, i.e. at least one rule that can be repeated ad infinitum:

– An example of this is the English use of "and", which can link any series of two or more nouns or two or more verbs:

• "His and hers and theirs and Mary's and John's... etc. " • "He ran and jumped and played and skipped and

danced and .. etc. "

Page 26: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Phrase Structure Grammar

– This would be described in Transfomational Grammar as:

• A noun phrase (NP) consists of a N or NP, the word ‘and’, and another N or NP.

• A verb phrase (VP) consists of a V or VP, the word ‘and’, and another V or VP.

Page 27: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Phrase Structure Tree

Sentence

Noun Phrase Verb Phrase

Determiner Noun Verb Noun Phrase

Determiner Noun

A monkey climbs the trees

Page 28: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Problems with Traditional Grammars

• They are Grammar based when natural language isn’t strictly ‘Grammar based’.

• Most don’t take into account language variations and dialects.

• Humans have a built in natural language processor that can handle things machine natural language processors cannot.

Page 29: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Yoda

• “When 900 years old you reach, look as good you will not.”

• “With you the force is.”• “A brave man your

Father was.”• Yoda (typically) uses the

OSV linguistic topology which is characteristic of some of the Brazilian languages.

Page 30: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Inherent Complexity

• To understand a sentence you must do more than combine the dictionary meanings of it’s constituents.

• A large amount of human knowledge is assumed and communication takes place between complex agents in complex environments.

Page 31: Natural Language Processing. According to research at an Elingsh uinervtisy, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt

Statistical approach

• Statistical Machine Translation