![Page 1: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/1.jpg)
Crash Course inNatural Language Processing
Vsevolod Dyomkin04/2016
![Page 2: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/2.jpg)
A Bit about Me
* Lisp programmer* 5+ years of NLP work at Grammarly * Occasional lecturer
https://vseloved.github.io
![Page 3: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/3.jpg)
A Bit about GrammarlyThe best English language writing app
Spellcheck - Grammar check - Style improvement - Synonyms and word choice Plagiarism check
![Page 4: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/4.jpg)
Plan* Overview of NLP* Where to get Data* Common NLP problems and approaches* How to develop an NLP system
![Page 5: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/5.jpg)
What Is NLP?Transforming free-form text into structured data and back
![Page 6: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/6.jpg)
What Is NLP?Transforming free-form text into structured data and back
Intersection of:* Computational Linguistics* CompSci & AI* Stats & Information Theory
![Page 7: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/7.jpg)
Linguistic Basis
* Syntax (form)* Semantics (meaning)* Pragmatics (intent/logic)
![Page 8: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/8.jpg)
Natural Language
* ambiguous* noisy* evolving
![Page 9: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/9.jpg)
Time flies like an arrow.Fruit flies like a banana.
I read a story about evolution in ten minutes.I read a story about evolution in the last million years.
![Page 10: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/10.jpg)
NLP & DataTypes of text data:* structured* semi-structured* unstructured
“Data is ten times more
powerful than algorithms.”-- Peter NorvigThe Unreasonable Effectiveness of Data.http://youtu.be/yvDCzhbjYWs
![Page 11: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/11.jpg)
Kinds of Data* Dictionaries* Databases/Ontologies* Corpora* User Data
![Page 12: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/12.jpg)
Where to Get Data?* Linguistic Data Consortium http://www.ldc.upenn.edu/ * Common Crawl* Wikimedia* Wordnet* APIs: Twitter, Wordnik, ...* University sites & the academic community: Stanford, Oxford, CMU, ...
![Page 13: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/13.jpg)
Create Your Own!* Linguists* Crowdsourcing* By-product
-- Johnatahn Zittrain http://goo.gl/hs4qB
![Page 14: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/14.jpg)
Classic NLP Problems* Linguistically-motivated: segmentation, tagging, parsing
* Analytical: classification, sentiment analysis
* Transformation: translation, correction, generation
* Conversation:question answering, dialog
![Page 15: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/15.jpg)
TokenizationExample:This is a test that isn't so simple: 1.23."This" "is" "a" "test" "that" "is" "n't" "so" "simple" ":" "1.23" "."
Issues:* Finland’s capital - Finland Finlands Finland’s* what’re, I’m, isn’t - what ’re, I ’m, is n’t* Hewlett-Packard or Hewlett Packard * San Francisco - one token or two?* m.p.h., PhD.
![Page 16: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/16.jpg)
Regular ExpressionsSimplest regex: [^\s]+
More advanced regex:
\w+|[!"#$%&'*+,\./:;<=>?@^`~…\(\) {}\[\|\]⟨⟩ ‒–—«»“”‘’-]―
Even more advanced regex:
[+-]?[0-9](?:[0-9,\.]*[0-9])?|[\w@](?:[\w'’`@-][\w']|[\w'][\w@'’`-])*[\w']?|["#$%&*+,/:;<=>@^`~…\(\) {}\[\|\] «»“”‘’']⟨⟩ ‒–—―|[\.!?]+|-+
![Page 17: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/17.jpg)
Post-processing
* concatenate abbreviations and decimals* split contractions with regexes 2-character: i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they) ['‘’`]d$
3-character: (?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
![Page 18: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/18.jpg)
Rule-based Approach* easy to understand and reason about* can be arbitrarily precise* iterative, can be used to gather more data
Limitations:* recall problems* poor adaptability
![Page 19: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/19.jpg)
Rule-based NLP tools
* SpamAssasin* LanguageTool* ELIZA* GATE
![Page 20: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/20.jpg)
Statistical Approach
“Probability theoryis nothing butcommon sensereduced to calculation.”-- Pierre-Simon Laplace
![Page 21: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/21.jpg)
Language Models
Question: what is the probability of a sequence of words/sentence?
![Page 22: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/22.jpg)
Language Models
Question: what is the probability of a sequence of words/sentence?
Answer: Apply the chain rule
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w0 w1 w2) * …
where S = w0 w1 w2 …
![Page 23: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/23.jpg)
NgramsApply Markov assumption: each word depends only on N previous words (in practice N=1..4 which results in bigrams-fivegrams, because we include the current word also).
If n=2: P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w1 w2) * …
According to the chain rule:
P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
![Page 24: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/24.jpg)
Spelling Correction
Problem: given an out-of-dictionary word return a list of most probable in-dictionary corrections.
http://norvig.com/spell-correct.html
![Page 25: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/25.jpg)
Edit DistanceMinimum-edit (Levenstein) distance the –minimum number of insertions/deletions/substitutions needed to transform string A into B.
Other distance metrics:* the Damerau-Levenstein distance adds another operation: transposition* the longest common subsequence (LCS) metric allows only insertion and deletion, not substitution* the Hamming distance allows only substitution, hence, it only applies to strings of the same length
![Page 26: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/26.jpg)
Dynamic ProgrammingInitialization: D(i,0) = i D(0,j) = j
Recurrence relation: For each i = 1..M For each j = 1..N D(i,j) = D(i-1,j-1), if X(i) = Y(j)
otherwise: min D(i-1,j) + w_del(Y(j)) D(i,j-1) + w_ins(X(i)) D(i-1,j-1) + w_subst(X(i),Y(j))
![Page 27: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/27.jpg)
Noisy Channel ModelGiven an alphabet A, let A* be the set of all finite strings over A. Let the dictionary D of valid words be some subset of A*.
The noisy channel is the matrix G = P(s|w) where w in D is the intended word and s in A* is the scrambled word that was actually received.
P(s|w) = sum(P(x(i)|y(i))) for x(i) in s* (s aligned with w) for y(i) in w* (w aligned with s)
![Page 28: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/28.jpg)
Machine Learning Approach
![Page 29: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/29.jpg)
Spam FilteringA 2-class classification problem with a bias towards minimizing FPs.
Default approach: rule-based (SpamAssassin)
Problems:* scales poorly* hard to reach arbitrary precision* hard to rank the importance of complex features?
![Page 30: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/30.jpg)
Bag-of-words Models* each word is a feature* each word is independent of others* position of the word in a sentence is irrelevant
Pros:* simple* fast* scalable
Limitations:* independence assumption doesn't hold
Initial results: recall: 92%, precision: 98.84% Improved results: recall: 99.5%, precision: 99.97%
http://www.paulgraham.com/spam.html
![Page 31: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/31.jpg)
Naive Bayes Classifier
P(Y|X) = P(Y) * P(X|Y) / P(X)select Y = argmax P(Y|x) Naive step:
P(Y|x) = P(Y) * prod(P(x|Y)) for all x in X
(P(x) is marginalized out because it's the same for all Y)
![Page 32: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/32.jpg)
Dependency Parsing
nsubj(ate-2, They-1)root(ROOT-0, ate-2)det(pizza-4, the-3)dobj(ate-2, pizza-4)prep(ate-2, with-5)pobj(with-5, anchovies-6)
https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
![Page 33: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/33.jpg)
Shift-reduce Parsing
![Page 34: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/34.jpg)
Shift-reduce Parsing
![Page 35: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/35.jpg)
ML-based ParsingThe parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the valid actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input.
SHIFT = 0; RIGHT = 1; LEFT = 2 MOVES = [SHIFT, RIGHT, LEFT]
def parse(words, tags): n = len(words) deps = init_deps(n) idx = 1 stack = [0] while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse
![Page 36: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/36.jpg)
Averaged Perceptron
def train(model, number_iter, examples): for i in range(number_iter): for features, true_tag in examples: guess = model.predict(features) if guess != true_tag: for f in features: model.weights[f][true_tag] += 1 model.weights[f][guess] -= 1 random.shuffle(examples)
![Page 37: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/37.jpg)
Features* Word and tag unigrams, bigrams, trigrams* The first three words of the buffer* The top three words of the stack* The two leftmost children of the top of the stack* The two rightmost children of the top of the stack* The two leftmost children of the first word in the buffer* Distance between top of buffer and stack
![Page 38: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/38.jpg)
Discriminative ML Models
Linear:* (Averaged) Perceptron* Maximum Entropy / LogLinear / Logistic Regression; Conditional Random Field* SVM
Non-linear:* Decision Trees, Random Forests* Other ensemble classifiers* Neural networks
![Page 39: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/39.jpg)
SemanticsQuestion: how to model relationships between words?
![Page 40: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/40.jpg)
SemanticsQuestion: how to model relationships between words?Answer: build a graph
WordnetFreebaseDBPedia
![Page 41: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/41.jpg)
Word Similarity
Next question: now, how do we measure those relations?
![Page 42: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/42.jpg)
Word Similarity
Next question: now, how do we measure those relations?
* different Wordnet similarity measures
![Page 43: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/43.jpg)
Word Similarity
Next question: now, how do we measure those relations?
* different Wordnet similarity measures
* PMI(x,y) = log(p(x,y) / p(x) * p(y))
![Page 44: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/44.jpg)
Distributional Semantics
Distributional hypothesis:"You shall know a word bythe company it keeps"--John Rupert Firth
Word representations:* Explicit representation Number of nonzero dimensions: max:474234, min:3, mean:1595, median:415* Dense representation (word2vec, GloVe)* Hierarchical representation (Brown clustering)
![Page 45: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/45.jpg)
Steps to Developan NLP System
* Translate real-world requirements into a measurable goal* Find a suitable level and representation* Find initial data for experiments* Find and utilize existing tools and Frameworks where possible* Don't trust research results* Setup and perform a proper experiment (series of experiments)
![Page 46: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/46.jpg)
Going into Prod
* NLP tasks are usually CPU-intensive but stateless * General-purpose NLP frameworks are (mostly) not production-ready* Value pre- and post- processing* Gather user feedback
![Page 47: Crash Course in Natural Language Processing (2016)](https://reader034.vdocuments.site/reader034/viewer/2022042706/587983961a28ab6c358b5f79/html5/thumbnails/47.jpg)
Final WordsWe have discussed:* linguistic basis of NLP
- although some people manage to do NLP without it:http://arxiv.org/pdf/1103.0398.pdf
* rule-based & statistical/ML approaches* different concrete tasks
We haven't covered:* all the different tasks, such as MT,
question answering, etc. (but they use the same technics)* deep learning for NLP* natural language understanding (which remains an unsolved problem)