iaaa / pstaln · iaaa / pstaln introduction to natural language processing benoit favre ...

29
IAAA / PSTALN Introduction to natural language processing Benoit Favre <[email protected]> Aix-Marseille Université, LIF/CNRS last generated on December 1, 2019 Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 1 / 29

Upload: others

Post on 20-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

IAAA / PSTALNIntroduction to natural language processing

Benoit Favre <[email protected]>

Aix-Marseille Université, LIF/CNRS

last generated on December 1, 2019

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 1 / 29

Page 2: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

PSTALN: class outline

2019/12/02▶ Class: intro on tasks and

problems▶ Practical: intro to pytorch

2019/12/09▶ Word representations▶ Word embedding evaluation

2019/12/16▶ Named entities and other

tagging tasks▶ Noun phrase tagging

2020/01/06▶ Language modeling▶ Generation

2020/01/13▶ Machine translation▶ Grapheme to phoneme

conversion2020/01/20

▶ Dialog systems▶ Document classification

2020/01/27▶ Adaptation▶ Cross-lingual classification

2020/01/31▶ Exam

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 2 / 29

Page 3: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

What is Natural Language Processing?

What is Natural Language Processing (NLP)?Allow computer to communicate with humans using everydaylanguageTeach computers to reproduce human behavior regarding languagemanipulationLinked to the study of human language through computers(Computational Linguistics)

Why is it difficult?People do not follow rules strictly when they talk or write: “r uready?”Language is ambiguous: “time flies like an arrow”Input can be noisy: speech recognition in the subway

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 3 / 29

Page 4: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Example: generate annotations

My kids have lost a card, the TV doesn't work

If I understood, you have lost your decoder card

Yes, error code S03

We will send you another one by mail

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 4 / 29

Page 5: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Example: generate annotations

Customer: My kids have lost a card, the TV doesn't work

Acknowledge

Losing

Possession (card)Owner (kids) Entity (TV)

Prop(Function)

Awareness

Cognizer (Agent)

Content (Losing)

Owner (Customer) Possession (decoder card)

Means(error code S03)

Sending

Sender(Agent) Recipent (Customer) Theme(decoder card)

NegationResult

Pro

ble

m s

tate

men

t

Restatement

Complete

Result

Agent: If I understood, you have lost your decoder card

Customer: Yes, error code S03

Agent: I will send you another one by mail

Res

olu

tio

n

Agree

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 5 / 29

Page 6: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

NLP is everywhere

Spell checker / grammar correction (Word)Information retrieval / search (Google)Machine translation (Google)Information extraction (Ask.com)Question answering (Jeopardy)Automatic summarization (Google news)Call routing (Telcos)Sentiment analysis (Amazon)Spam filtering (Email)Writing recognition (Cheque processing)Voice dictation (Dragon, Nuance)Speech synthesis (In-car GPS)Dialog systems (Siri/OK Google/Alexa...)

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 6 / 29

Page 7: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Domains related to NLP

Artificial intelligenceFormal language theoryMachine learningLinguisticsPsycholinguisticsCognitives sciencesPhilosophy of language

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 7 / 29

Page 8: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Communication channel

From the point of view of the source (the speaker)1 Intent: the message we want to communicate2 Generation: the message in linguistic form3 Production: the muscular action which leads to sound production

From a receiver point of view (listener)1 Perception: how the sound is transmitted to neurons2 Analysis: interpretation of the linguistic message (syntactic,

semantic...)3 Integration: believe or not the information, reply...

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 8 / 29

Page 9: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Processing levels

“John loves Mary”1 Lexical : segment character stream in words, identify linguistic units

John/firstname-male loves/verb-love Mary/firstname-female2 Syntax : identify grammatical structures

(S (NP (NNP John)) (VP (VBZ loves) (NP (NNP Mary))) (. .))3 Semantic : represent meaning

love(person(John), person(Mary))4 Pragmatic : what is the function of that sentence in context?

Is it reciprocal ?Since when ?What does it entail ? know(John, Mary)

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 9 / 29

Page 10: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Modular approach

question

answer

Automatictranscription

Syntacticanalysis

Semanticanalysis

Dialogmanager

Syntacticgeneration

Lexicalgeneration

Speechsynthesis

wordssyntactic

tree

concepts,relations

words,prosody

primitivesyntax

logicalrepresentation

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 10 / 29

Page 11: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Language ambiguityPhonetic

▶ I don’t know! – I don’t - no!GraphicalPhonetic and graphical

▶ I live by the bank (river bank or financial institution)Etymology

▶ I met an Indian (from India or native American)▶ I love American wine (from USA or from the Americas)

Syntactic▶ He looks at the man with a telescope▶ He gave her cat food

Referential▶ She is gone. Who?

Notational conventions▶ Birth date: 08/01/05

(wikipedia)

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 11 / 29

Page 12: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Basic NLP tasksMeta

▶ Author/speaker traits recognition▶ Genre classification

Syntax▶ Word / sentence segmentation▶ Morphological analysis▶ Part-of-speech tagging▶ Syntactic chunking▶ Syntactic parsing

Semantic▶ Word sense disambiguation▶ Semantic role labeling▶ Logical form creation

Pragmatic▶ Coreference resolution▶ Discourse parsing

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 12 / 29

Page 13: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Syntax: Word segmentation

Character sequence → word sequence (tokenization)Split according to delimiters [ :,.!?’]What about compounds? Multiword expressions?URLs (http://www.google.com), variable names(theMaximumInTheTable)In Chinese, no spaces between words:

▶ 男孩喜歡冰淇淋。→ 男孩 (the boy) 喜歡 (likes) 冰淇淋 (ice cream) 。

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 13 / 29

Page 14: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Syntax: Morphological analysis

Split words in relevant factorsGender and number

▶ flower, flower+s, floppy, flopp+iesVerb tense

▶ parse, pars+ing, pars+edPrefixes, roots and suffixes

▶ geo+caching▶ re+do, un+do, over+do▶ pre+fix, suf+fix▶ geo+local+ization

Agglutinative languages▶ pronouns are glued to the verb (Arabic, spanish...)

Rich morphology▶ Turkish, Finish

→ Lemmatization task: find canonical word form

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 14 / 29

Page 15: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Syntax: Part-of-speech tagging

Syntactic categories

Noun Adverb Discourse markerProper name Determiner Foreign wordsVerb Preposition PunctuationAdjective Conjunctions Pronouns

Each word can have multiple categoriesExample : time flies like an arrow

▶ flies: verb or noun?▶ like: preposition or verb?

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 15 / 29

Page 16: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Syntax: Syntactic analysis

Constituency parsing

Dependency parsing

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 16 / 29

Page 17: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Semantics: Word sense disambiguation (WSD)

What is the sense of each word in its context?red: color? wine? communist?fly: what birds do? insect?bank: river? financial institution?book: made of paper? make a reservation?

Word meaning highly depends on domainapple: fruit? company?to pitch: a ball? a product? a note?

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 17 / 29

Page 18: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Semantics: Semantic parsing

Syntax is ambiguousThe man opens the doorThe door opensThe key opens the door

Semantic rolesWho performed the action? the agentWho receives the action? the patientWho helps making the action? the instrumentWhen, where, why?

John sold his car to his brother this morningagent predicate instrument patient time

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 18 / 29

Page 19: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Pragmatics: Reference resolution

Link all references to the same entity▶ “Alexander Graham Bell (March 3, 1847 –August 2, 1922)[4] was a

Scottish-born[N 3] scientist, inventor, engineer, and innovator who iscredited with patenting the first practical telephone.” (Wikipedia)

AmbiguityPronouns (it, she, he, we, you, who, whose, both...)Noun phrases (the young man, the former president, the company...)Proper names (”Victoria”: South-African city, Canadian region,Queen, model...)

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 19 / 29

Page 20: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Pragmatics: Discourse analysisRelationship between sentences of a text, argument structure.

“Fully Automated Generation of Question-Answer Pairs for Scripted Virtual Instruction”,Kuyten et al, 2012Relation type (Rhetorical Structure Theory)

BackgroundElaborationPreparationContrastObjective

CauseCircumstancesInterpretationJustificationReformulation

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 20 / 29

Page 21: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Pragmatics: Create a logical form

Predicate representation▶ Can be used to infer new

John loves Mary but it is not reciprocal.

∃x, y,name(x, ‘‘John”) ∧ name(y, ‘‘Mary”) ∧ loves(x, y) ∧ not(loves(y, x))

John sold his car this morning to his brother.

∃x, y, z,name(x, ‘‘John”) ∧ brother(x, y) ∧ car(z)∧owns(x, z) ∧ sell(x, y, z) ∧ time(‘‘morning”)

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 21 / 29

Page 22: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

History of natural language processing1950: Theory (Turing test, Chomsky grammars)

▶ Automatic translation during the cold war1960: Toy systems

▶ SHRDLU “place the red box next to the blue circle”, ELIZA “thetherapist”

1970:▶ Prolog (logic-base language for NLP), Dictionaries of semantic frames

1980: Dictation, Development of grammars1990

▶ Transition “introspection” → “corpus”▶ Evaluation campaigns▶ Neural networks are “forgotten”

2000▶ Machine learning▶ Applications: speech recognition, machine translation

2010...▶ Deep learning

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 22 / 29

Page 23: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Notion of corpus

Language in the wild▶ Email▶ Forums▶ Chats▶ Speech recordings▶ Video

Manual Annotation of all elements we want to predict▶ Text → topic▶ Sentence → parse tree▶ Review → sentiment

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 23 / 29

Page 24: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

General approach

How to approach natural language processing?▶ Introspective approach

⋆ Study what you know about language⋆ Originated from linguistics⋆ Formal models, grammar engineering

▶ Corpus-based approach⋆ Study how people produce language in the wild⋆ The de facto modern approach

An empirical approach1 Task2 Corpus3 Guide4 Annotation5 System6 Evaluation

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 24 / 29

Page 25: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

NLP Systems

Input▶ Raw text, audio...▶ Sentences, contextualized words▶ Output of another system

Output▶ n classes (ex: topics)▶ Structure (ex: syntactic parse)▶ Novel text (ex: translation, summary)▶ Commands for a system (ex: chatbots)

Process▶ output = f(input)▶ Deterministic vs random (evaluations need to be repeatable)▶ Parametrisable: output = f(input,parameters)

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 25 / 29

Page 26: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

NLP as machine learningHow to design a system from examples of annotated data?

▶ For some problems, it’s difficult to engineer an algorithm▶ Rely on statistical regularities for performing the task

Machine learning (ML)▶ find parameters of a function

to minimize errors on atraining corpus

▶ loss/cost functionminimization

NLP as ML tasks▶ Labeling (i.e. word senses)▶ Segmentation (i.e. words in Chinese)▶ Regression (i.e. sentiment intensity)▶ Linking (i.e. dependency parsing)▶ Generation (i.e. machine translation)

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 26 / 29

Page 27: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

Deep learningNon-linear compositions of differentiable functions

▶ Also called neural networks for historical reasons

Minimize loss by gradient descent▶ Chain rule: derivative of composition as

product of local derivatives▶ Iteratively improve parameters by taking small

steps towards gradient

fx1fx1(y)

x2

fx2(y)f(x1, x2)

y

→ Build complex functions like lego▶ Much more expressive than before

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 27 / 29

Page 28: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

New playground for NLP

Fundamental paradigms emerging with deep learning▶ Pre-training on very large corpora (word, sentence, text embeddings)▶ End-to-end learning (propagate supervision to all stages)▶ Sequence-to-sequence learning (complex structures as input and

output)

Advanced approaches interesting in NLP▶ Multitask learning (reuse models for multiple tasks)▶ Zero-shot learning (process never-seen conditions)▶ Adversarial learning (a game between a learner and an anti-learner)▶ Reinforcement learning (supervision in non-derivable settings)

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 28 / 29

Page 29: IAAA / PSTALN · IAAA / PSTALN Introduction to natural language processing Benoit Favre  Aix-Marseille Université, LIF/CNRS last generated on December

ML taking over NLP

1965-2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

mid-2019

0

5

10

15

20

25

30

35

40

Percent of articles in ACL Anthology with titles matching ML words

Benoit Favre (AMU) PSTALN: NLP intro December 1, 2019 29 / 29