natural language processing (nlp)
DESCRIPTION
Natural Language Processing (NLP). Overview and history of the field Knowledge of language The role of ambiguity Models and Algorithms Eliza, Turing, and conversational agents History of speech and language processing Regular expressions. Computer Speech and Language Processing. - PowerPoint PPT PresentationTRANSCRIPT
Natural Language Processing (NLP)
Overview and history of the field Knowledge of language The role of ambiguity Models and Algorithms Eliza, Turing, and conversational agents History of speech and language processing
Regular expressions
Computer Speech and Language Processing
What is it?Getting computers to perform useful tasks involving human languages whether for:– Enabling human-machine communication– Improving human-human communication– Doing stuff with language objects
Examples:– Question Answering– Machine Translation– Spoken Conversational Agents
Knowledge needed to build speaking computer
Speech recognition and synthesisDictionaries (how words are pronounced)Phonetics (how to recognize/produce each sound of English)
Natural language understandingKnowledge of the English words involved– What they mean– How they combine (what is a `pod bay door’?)
Knowledge of syntactic structure– I’m I do, Sorry that afraid Dave I’m can’t
Eliza: Weizenbaum (1966)User: You are like my father in some ways.ELIZA: What resemblance do you seeUser: You are not very aggressive but I think you don’t want me to notice that.ELIZA: What makes you think I am not aggressiveUser: You don’t argue with meELIZA: Why do you think I don’t argue with youUser: You are afraid of meELIZA: Does it please you to believe I am afraid of you
Ambiguity
Computational linguists are obsessed with ambiguityAmbiguity is a fundamental problem of computational linguisticsResolving ambiguity is a crucial goal
Ambiguity
Find at least 5 meanings of this sentence:
I made her duck
Ambiguity
Find at least 5 meanings of this sentence: I made her duck
I cooked waterfowl for her benefit (to eat)I cooked waterfowl belonging to herI created the (plaster?) duck she ownsI caused her to quickly lower her head or bodyI waved my magic wand and turned her into undifferentiated waterfowl
Ambiguity is Pervasive
I caused her to quickly lower her head or bodyLexical category: “duck” can be a N or V
I cooked waterfowl belonging to her.Lexical category: “her” can be a possessive (“of her”) or dative (“for her”) pronoun
I made the (plaster) duck statue she ownsLexical Semantics: “make” can mean “create” or “cook”
Ambiguity is Pervasive
Grammar: Make can be:Transitive: (verb has a noun direct object)– I cooked [waterfowl belonging to her]
Ditransitive: (verb has 2 noun objects)– I made [her] (into) [undifferentiated
waterfowl]Action-transitive (verb has a direct object and another verb)
- I caused [her] [to move her body]
Ambiguity is Pervasive
Phonetics!I mate or duckI’m eight or duckEye maid; her duckAye mate, her duckI maid her duckI’m aid her duckI mate her duckI’m ate her duckI’m ate or duckI mate or duck
Models and AlgorithmsModels: formalisms used to capture the various kinds of linguistic structure.
State machines (fsa, transducers, markov models)Formal rule systems (context-free grammars, feature systems)Logic (predicate calculus, inference)Probabilistic versions of all of these + others (gaussian mixture models, probabilistic relational models, etc etc)
Algorithms used to manipulate representations to create structure.
Search (A*, dynamic programming)Supervised learning, etc etc
Language, Thought, Understanding
A Gedanken Experiment: Turing TestQuestion “can a machine think” is not operational.Operational version:
2 people and a computerInterrogator talks to contestant and computer via teletypeTask of machine is to convince interrogator it is humanTask of contestant is to convince interrogator she and not machine is human.
History: foundational insights 1940s-1950s
Automaton:Turing 1936McCulloch-Pitts neuron (1943)– http://diwww.epfl.ch/mantra/tutorial/english/m
cpits/html/Kleene (1951/1956)Shannon (1948) link between automata and Markov modelsChomsky (1956)/Backus (1959)/Naur(1960): CFG
Probabilistic/Information-theoretic modelsShannon (1948)Bell Labs speech recognition (1952)
History: the two camps: 1957-1970Symbolic
Zellig Harris 1958 TDAP first parser– Cascade of finite-state transducers
ChomskyAI workshop at Dartmouth (McCarthy, Minsky, Shannon, Rochester)Newell and Simon: Logic Theorist, General Problem Solver
StatisticalBledsoe and Browning (1959): Bayesian OCRMosteller and Wallace (1964): Bayesian authorship attributionDenes (1959): ASR combining grammar and acoustic probability
Four paradigms: 1970-1983Stochastic
Hidden Markov Model 1972– Independent application of Baker (CMU) and Jelinek/Bahl/Mercer
lab (IBM) following work of Baum and colleagues at IDALogic-based
Colmerauer (1970,1975) Q-systemsDefinite Clause Grammars (Pereira and Warren 1980)Kay (1979) functional grammar, Bresnan and Kaplan (1982) unification
Natural language understandingWinograd (1972) ShrdluSchank and Abelson (1977) scripts, story understandingInfluence of case-role work of Fillmore (1968) via Simmons (1973), Schank.
Discourse ModelingGrosz and colleagues: discourse structure and focusPerrault and Allen (1980) BDI model
Finite State Approach 83 - 93
Finite State ModelsKaplan and Kay (1981): Phonology/MorphologyChurch (1980): Syntax
Return of Probabilistic Models:Corpora created for language tasksEarly statistical versions of NLP applications (parsing, tagging, machine translation)Increased focus on methodological rigor:– Can’t test your hypothesis on the data you used
to build it!– Training sets and test sets
The field comes together: 1994-2007
NLP has borrowed statistical modeling from speech recognition, is now standard:
ACL conference:– 1990: 39 articles 1 statistical– 2003 62 articles 48 statistical
Machine learning techniques key
NLP has borrowed focus on web and search and “bag of words models” from information retrievalUnified field:
NLP, MT, ASR, TTS, Dialog, IR
Regular expressions
A formal language for specifying text stringsHow can we search for any of these?
woodchuckwoodchucksWoodchuckWoodchucks
Regular ExpressionsBasic regular expression patternsPerl-based syntax (slightly different from other notations for regular expressions)Disjunctions /[wW]oodchuck/
Regular ExpressionsRanges [A-Z]
• Negations [^Ss]
Regular Expressions
Optional characters ? ,* and +? (0 or 1) – /colou?r/ color or colour
* (0 or more)– /oo*h!/ oh! or Ooh! or Ooooh!
– + (1 or more) • /o+h!/ oh! or Ooh! or Ooooh!
Wild cards .- /beg.n/ begin or began or begun
Regular ExpressionsAnchors ^ and $
/^[A-Z]/ “Ramallah, Palestine”/^[^A-Z]/ “¿verdad?” “really?”/\.$/ “It is over.”/.$/ ?
Boundaries \b and \B/\bon\b/ “on my way” “Monday”/\Bon\b/ “automaton”
Disjunction |/yours|mine/ “it is either yours or mine”
Disjunction, Grouping, Precedence
Column 1 Column 2 Column 3 …How do we express this?/Column [0-9]+ *//(Column [0-9]+ +)*/
PrecedenceParenthesis ()Counters * + ? {}Sequences and anchors the ^my end$Disjunction |
ExampleFind me all instances of the word “the” in a text.
/the/Misses capitalized examples
/[tT]he/–Returns other or theology
/\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
Errors
The process we just went through was based on fixing two kinds of errors
Matching strings that we should not have matched (there, then, other)– False positives
Not matching things that we should have matched (The)– False negatives
More complex RE example
Regular expressions for prices/$[0-9]+/
Doesn’t deal with fractions of dollars/$[0-9]+\.[0-9][0-9]/
Doesn’t allow $199, not word-aligned\b$[0-9]+(\.[0-9]0-9])?\b)
Advanced operators