lecture 6 logistic regression for text classification
TRANSCRIPT
![Page 1: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/1.jpg)
Lecture 6 Logistic Regression for Text
ClassificationCS 490A, Fall 2021
https://people.cs.umass.edu/~brenocon/cs490a_f21/
Laure Thompson and Brendan O'ConnorCollege of Information and Computer Sciences
University of Massachusetts Amherst
[Many slides from Ari Kobren]
![Page 2: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/2.jpg)
• HW1! • Zoom OH demand? • Accept exercises late? Yes, for check-minus:
see the "Grading and Policies" page. Please let us know if you can't upload to Gradescope.
2
![Page 3: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/3.jpg)
BOW linear model for text classif.
• Parameters: For each class k, and word type w, there is a word weight
• Representation: bag-of-words vector of doc d's word counts
• Prediction rule: choose class y with highest score
Have we seen a text classification model that can be seen as an instance of this??
• Problem: classify doc d into one of k=1..K classes
![Page 4: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/4.jpg)
Linear classification models
• The foundational model for machine learning-based NLP!
• Examples
• The humble "keyword count" classifier (no ML)
• Naive Bayes ("generative" ML)
• Today: Logistic Regression• "discriminative" ML
• allows for features
• used within more complex models (neural networks)
4
![Page 5: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/5.jpg)
9/23/14 12:06 PMexplosion blank pow
Page 1 of 1file:///Users/brendano/Downloads/explosion-blank-pow.svg
• Input document d (a string...)
• Engineer a feature function, f(d), to generate feature vector x
5
f(d) x
Features!
9/23/14 12:06 PMexplosion blank pow
Page 1 of 1file:///Users/brendano/Downloads/explosion-blank-pow.svg
Features! 9/23/14 12:06 PMexplosion blank pow
Page 1 of 1file:///Users/brendano/Downloads/explosion-blank-pow.svg
Features!
• Not just word counts. Anything that might be useful!
• Feature engineering: when you spend a lot of trying and testing new features. Very important!! This is a place to put linguistics in, or just common sense about your data.
f(d) =
Count of “happy”,(Count of “happy”) / (Length of doc),log(1 + count of “happy”),Count of “not happy”,Count of words in my pre-specified word list, “positive words according to my favorite psychological theory”,Count of several different “happy” emoticonsCount of “of the”,Length of document,...
Typically these use feature templates:Generate many features at once
for each word w:• ${w}_count• ${w}_is_count_nonzero• NOT_${w}_bigram_count
✓ ◆
![Page 6: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/6.jpg)
Negation
Add%NOT_%to%every%word%between%negation%and%following%punctuation:
didn’t like this movie , but I
didn’t NOT_like NOT_this NOT_movie but I
Das,%Sanjiv and%Mike%Chen.%2001.%Yahoo!%for%Amazon:%Extracting%market%sentiment%from%stock%message%boards.%In%Proceedings%of%the%Asia%Pacific%Finance%Association%Annual%Conference%(APFA).Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86.
[Slide: SLP3]
![Page 7: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/7.jpg)
First, we’ll discuss how LogReg works.
Then, why it’s set up the way that it is.
Application: spam filtering
Classification: LogReg (I)
![Page 8: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/8.jpg)
● compute features (xs)
● given weights (betas)
Classification: LogReg (I)
= (count “nigerian”, count “prince”, count “nigerian prince”)
= (-1.0, -1.0, 4.0)
![Page 9: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/9.jpg)
• Compute the dot product
• Compute the logistic function for the label probability
9
● compute the dot product
● compute the logistic function
Classification: LogReg (II)
![Page 10: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/10.jpg)
LogReg Exercise
= (-1.0, -1.0, 4.0)
features: (count “nigerian”, count “prince”, count “nigerian prince”)
= (1, 1, 1)
P(y=1 | x) =
![Page 11: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/11.jpg)
Intuition: weighted sum of features
All linear models have this form!
Classification: Dot Product
z =NfeatX
j=1
�jxij
![Page 12: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/12.jpg)
Why the logistic function?
12
What does this function look like?
What properties does it have?
Logistic Function
![Page 13: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/13.jpg)
Multiclass Logistic Regression• Binary logreg: let x be a feature vector for the doc, and y either 0 or 1
� is a weight vector across the x features.p(y = 1|x,�) = exp(�Tx)
1 + exp(�Tx)
• Multiclass logistic regression: weight vector for each class
• Prediction: dot product for each class
• Predicted probabilities: apply the softmax function to normalize
![Page 14: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/14.jpg)
NB as Log-Linear Model● What are the features in Naive Bayes?
● What are the weights in Naive Bayes?
![Page 15: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/15.jpg)
NB as Log-Linear Model
In both NB and LogReg
we compute the dot product!
![Page 16: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/16.jpg)
● Both compute the dot product
● NB: sum of log probs; LogReg: logistic fun.
NB vs. LogReg
![Page 17: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/17.jpg)
● NB: learn conditional probabilities separately via counting
● LogReg: learn weights jointly
Learning Weights
![Page 18: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/18.jpg)
Learning Weights● given: a set of feature vectors and labels
● goal: learn the weights.
![Page 19: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/19.jpg)
Learning Weights
n examples; xs - features; ys - class
![Page 20: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/20.jpg)
Learning WeightsWe know:
So let’s try to maximize probability of the entire dataset - maximum likelihood estimation
Learning WeightsWe know:
So let’s try to maximize probability of the entire dataset - maximum likelihood estimation
g(z) =1
1 + e�zP (y = 1 | x) = g
0
@NfeatX
j=1
�jxij
1
A
![Page 21: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/21.jpg)
Learning WeightsSo let’s try to maximize probability of the entire
dataset - maximum likelihood estimation
![Page 22: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/22.jpg)
Gradient ascent learning
22
• Follow direction of steepest ascent. Iterate
�(new) = �(old) + ⌘@`
@�
`(�) =
|X|X
i=0
logP (yi|xi;�)
![Page 23: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/23.jpg)
Pros & Cons● LogReg doesn’t assume independence
○ better calibrated probabilities
● NB is faster to train; less likely to overfit
![Page 24: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/24.jpg)
NB & Log Reg● Both are linear models:
● Training is different:○ NB: weights trained independently○ LogReg: weights trained jointly
z =NfeatX
j=1
�jxij
![Page 25: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/25.jpg)
LogReg: Important Details!● Overfitting / regularization● Visualizing decision boundary / bias term● Multiclass LogReg
You can use scikit-learn (python) to test it out!
![Page 26: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/26.jpg)
Overfitting and generalization• Overfitting: your model performs overly optimistically on training set, but generalizes poorly
to other data (even from same distribution)
• To diagnose: separate training set vs. test set.
• Why might logreg (or NB) overfit to the training set?
26
![Page 27: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/27.jpg)
Regularization to address overfitting
• How did we regularize Naive Bayes and language modeling?
27
• For logistic regression: L2 regularization for training
![Page 28: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/28.jpg)
Regularization tradeoffs
• No regularization <--------------> Very strong regularization
28
![Page 29: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/29.jpg)
![Page 30: Lecture 6 Logistic Regression for Text Classification](https://reader030.vdocuments.site/reader030/viewer/2022020700/61f608a2c9daaa42bc2dbc12/html5/thumbnails/30.jpg)
Bias Term