text classification + machine learning reviewcall now to receive your diploma within 30 days...
TRANSCRIPT
![Page 1: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/1.jpg)
Text Classification
+
Machine Learning Review
CS 287
![Page 2: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/2.jpg)
Contents
Text Classification
Preliminaries: Machine Learning for NLP
Features and Preprocessing
Output
Classification
Linear Models
Linear Model 1: Naive Bayes
![Page 3: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/3.jpg)
Earn a Degree based on your Life Experience
Obtain a Bachelor’s, Master’s, MBA, or PhD based on your
present knowledge and life experience.
No required tests, classes, or books. Confidentiality assured.
Join our fully recognized Degree Program.
Are you a truly qualified professional in your field but lack the
appropriate, recognized documentation to achieve your goals?
Or are you venturing into a new field and need a boost to get
your foot in the door so you can prove your capabilities?
Call us for information that can change your life and help you
to achieve your goals!!!
CALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30
DAYS
![Page 4: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/4.jpg)
Sentiment
Good Sentences
I A thoughtful, provocative, insistently humanizing film.
I Occasionally melodramatic, it’s also extremely effective.
I Guaranteed to move anyone who ever shook, rattled, or rolled.
Bad Sentences
I A sentimental mess that never rings true.
I This 100-minute movie only has about 25 minutes of decent
material.
I Here, common sense flies out the window, along with the hail of
bullets, none of which ever seem to hit Sascha.
![Page 5: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/5.jpg)
Multiclass Sentiment
I ? ? ??
I visited The Abbey on several occasions on a visit to Cambridge
and found it to be a solid, reliable and friendly place for a meal.
I ??
However, the food leaves something to be desired. A very obvious
menu and average execution
I ? ? ? ? ?
Fun, friendly neighborhood bar. Good drinks, good food, not too
pricey. Great atmosphere!
![Page 6: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/6.jpg)
Text Categorization
I Straightforward setup.
I Lots of practical applications:
I Spam Filtering
I Sentiment
I Text Categorization
I e-discovery
I Twitter Mining
I Author Identification
I . . .
I Introduces machine learning notation.
However, a relatively solved problem these days.
![Page 7: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/7.jpg)
Contents
Text Classification
Preliminaries: Machine Learning for NLP
Features and Preprocessing
Output
Classification
Linear Models
Linear Model 1: Naive Bayes
![Page 8: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/8.jpg)
Preliminary Notation
I b,m; bold letters for vectors.
I B,M; bold capital letters for matrices.
I B,M; script-case for sets.
I B,M; capital letters for random variables.
I bi , xi ; lower case for scalars or indexing into vectors.
I δ(i); one-hot vector at position i
δ(2) = [0; 1; 0; . . . ]
I 1(x = y); indicator 1 if x = y , o.w. 0
![Page 9: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/9.jpg)
Text Classification
1. Extract pertinent information from the sentence.
2. Use this to construct an input representation.
3. Classify this vector into an output class.
Input Representation:
I Conversion from text into a mathematical representation?
I Main focus of this class, representation of language
I Point in coming lectures: sparse vs. dense representations
![Page 10: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/10.jpg)
Text Classification
1. Extract pertinent information from the sentence.
2. Use this to construct an input representation.
3. Classify this vector into an output class.
Input Representation:
I Conversion from text into a mathematical representation?
I Main focus of this class, representation of language
I Point in coming lectures: sparse vs. dense representations
![Page 11: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/11.jpg)
Sparse Features (Notation from YG)
I F ; a discrete set of features values.
I f1 ∈ F , . . . , fk ∈ F ; active features for input.
For a given sentence, let f1, . . . fk be the relevant features.
Typically k << |F |.
I Sparse representation of the input defined as,
x =k
∑i=1
δ(fi )
I x ∈ R1×din ; input representation
![Page 12: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/12.jpg)
Features 1: Sparse Bag-of-Words Features
Representation is counts of input words,
I F ; the vocabulary of the language.
I x = ∑i δ(fi )
Example: Movie review input,
A sentimental mess
x = δ(word:A) + δ(word:sentimental) + δ(word:mess)
x> =
1...
0
0
+
0...
0
1
+
0...
1
0
=
1...
1
1
word:A...
word:mess
word:sentimental
![Page 13: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/13.jpg)
Features 2: Sparse Word Properties
Representation can use specific aspects of text.
I F ; Spelling, all-capitals, trigger words, etc.
I x = ∑i δ(fi )
Example: Spam Email
Your diploma puts a UUNIVERSITY JOB PLACEMENT COUNSELOR
at your disposal.
x = δ(misspelling) + δ(allcapital) + δ(trigger:diploma) + . . .
x> =
0...
1
0
+
0...
0
1
+
1...
0
0
=
1...
1
1
misspelling...
capital
word:diploma
![Page 14: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/14.jpg)
Text Classification: Output Representation
1. Extract pertinent information from the sentence.
2. Use this to construct an input representation.
3. Classify this vector into an output class.
Output Representation:
I How do encode the output classes?
I We will use a one-hot output encoding.
I In future lectures, efficiency of output encoding.
![Page 15: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/15.jpg)
Output Class Notation
I C = {1, . . . , dout}; possible output classes
I c ∈ C; always one true output class
I y = δ(c) ∈ R1×dout ; true one-hot output representation
![Page 16: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/16.jpg)
Output Form: Binary Classification
Examples: spam/not-spam, good review/bad review, relevant/irrelevant
document, many others.
I dout = 2; two possible classes
I In our notation,
bad c = 1 y =[
1 0]vs.
good c = 2 y =[
0 1]
I Can also use a single output sign representation with dout = 1
![Page 17: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/17.jpg)
Output Form: Multiclass Classification
Examples: Yelp stars, etc.
I dout = 5; for examples
I In our notation, one star, two star...
? c = 1 y =[
1 0 0 0 0]vs.
? ? c = 2 y =[
0 1 0 0 0]
. . .
Examples: Word Prediction (Unit 3)
I dout > 100, 000;
I In our notation, C is vocabulary and each c is a word.
the c = 1 y =[
1 0 0 0 . . . 0]vs.
dog c = 2 y =[
0 1 0 0 . . . 0]
. . .
![Page 18: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/18.jpg)
Evaluation
I Consider evaluating accuracy on outputs y1, . . . , yn.
I Given a predictions c1 . . . cn we measure accuracy as,
n
∑i=1
1(δ(ci ) = yi )
n
I Simplest of several different metrics we will explore in the class.
![Page 19: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/19.jpg)
Contents
Text Classification
Preliminaries: Machine Learning for NLP
Features and Preprocessing
Output
Classification
Linear Models
Linear Model 1: Naive Bayes
![Page 20: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/20.jpg)
Supervised Machine Learning
Let,
I (x1, y1), . . . , (xn, yn); training data
I xi ∈ R1×din ; input representations
I yi ∈ R1×dout ; true output representations (one-hot vectors)
Goal: Learn a classifier from input to output classes.
Note:
I xi is an input vector xi ,j is element of the vector, or just xj when
there is a clear single input .
I Practically, store design matrix X ∈ Rn×din and output classes.
![Page 21: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/21.jpg)
Experimental Setup
I Data is split into three parts training, validation, and test.
I Experiments are all run on training and validation, test is final
output.
I For assignments, full training and validation data, and only inputs
for test.
For very small text classification data sets,
I Use K-fold cross-validation.
1. Split into K folds (equal splits).
2. For each fold, train on other K-1 folds, test on current fold.
![Page 22: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/22.jpg)
Linear Models for Classification
Linear model,
y = f (xW+ b)
I W ∈ Rdin×dout ,b ∈ R1×dout ; model parameters
I f : Rdout 7→ Rdout ; activation function
I Sometimes z = xW+ b informally “score” vector.
I Note z and y are not one-hot.
Class prediction,
c = arg maxi∈C
yi = arg maxi∈C
(xW+ b)i
![Page 23: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/23.jpg)
Interpreting Linear Models
Parameters give scores to possible outputs,
I Wf ,i is the score for sparse feature f under class i
I bi is a prior score for class i
I yi is the total score for class i
I c is highest scoring class under the linear model.
Example:
I For single feature score,
[β1, β2] = δ(word:dreadful)W,
Expect β1 > β2 (assuming 2 is class good).
![Page 24: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/24.jpg)
Probabilistic Linear Models
Can estimate a linear model probabilistically,
I Let output be a random variable Y , with sample space C.
I Representation be a random vector X .
I (Simplified frequentist representation)
I Interested in estimating parameters θ,
P(Y |X ; θ)
Informally we use p(y = δ(c)|x) for P(Y = c |X = x).
![Page 25: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/25.jpg)
Generative Model. Joint Log-Likelihood as Loss
I (x1, y1), . . . , (xn, yn); supervised data
I Select parameters to maximize likelihood of training data.
L(θ) = −n
∑i=1
log p(xi , yi ; θ)
For linear models θ = (W,b)
I Do this by minimizing negative log-likelihood (NLL).
arg minθL(θ)
![Page 26: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/26.jpg)
Multinomial Naive Bayes
Reminder, joint probability chain rule,
p(x, y) = p(x|y)p(y)
For a sparse features, with observed classes we can write as,
p(x, y) = p(xf1 = 1, . . . , xfk = 1|y = δ(c))p(y = δ(c)) =
=k
∏i=1
p(xfi = 1|xf1 = 1, . . . , xfi−1 = 1, y = δ(c))p(y = δ(c)) ≈
=k
∏i=1
p(xfi = 1|y)p(y)
First is by chain-rule, second is by independence assumption.
![Page 27: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/27.jpg)
Multinomial Naive Bayes
Reminder, joint probability chain rule,
p(x, y) = p(x|y)p(y)
For a sparse features, with observed classes we can write as,
p(x, y) = p(xf1 = 1, . . . , xfk = 1|y = δ(c))p(y = δ(c)) =
=k
∏i=1
p(xfi = 1|xf1 = 1, . . . , xfi−1 = 1, y = δ(c))p(y = δ(c)) ≈
=k
∏i=1
p(xfi = 1|y)p(y)
First is by chain-rule, second is by independence assumption.
![Page 28: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/28.jpg)
Estimating Multinomial Distributions
Let S be a random variable with sample space S and we have
observations s1, . . . , sn,
I P(S = s; θ) = Cat(s; θ); parameterized as a multinomial
distribution.
I Minimizing NLL for multinomial (MLE) for data has a closed-form.
P(S = s; θ) = Cat(s; θ) =n
∑i=1
1(si = s)
n
I Exercise: Derive this by minimizing L.
I Also called categorical or multinoulli (in Murphy).
![Page 29: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/29.jpg)
Estimating Multinomial Distributions
Let S be a random variable with sample space S and we have
observations s1, . . . , sn,
I P(S = s; θ) = Cat(s; θ); parameterized as a multinomial
distribution.
I Minimizing NLL for multinomial (MLE) for data has a closed-form.
P(S = s; θ) = Cat(s; θ) =n
∑i=1
1(si = s)
n
I Exercise: Derive this by minimizing L.
I Also called categorical or multinoulli (in Murphy).
![Page 30: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/30.jpg)
Multinomial Naive Bayes
I Both p(y) and p(x|y) are parameterized as multinomials.
I Fit first as,
p(y = δ(c)) =n
∑i=1
1(yi = c)
n
I Fit second using count matrix F ,
I Let
Ff ,c =n
∑i=1
1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F
I Then,
p(xf = 1|y = δ(c)) =Ff ,c
∑f ′∈F
Ff ′,c
![Page 31: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/31.jpg)
Multinomial Naive Bayes
I Both p(y) and p(x|y) are parameterized as multinomials.
I Fit first as,
p(y = δ(c)) =n
∑i=1
1(yi = c)
n
I Fit second using count matrix F ,
I Let
Ff ,c =n
∑i=1
1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F
I Then,
p(xf = 1|y = δ(c)) =Ff ,c
∑f ′∈F
Ff ′,c
![Page 32: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/32.jpg)
Alternative: Multivariate Bernoulli Naive Bayes
I Both p(y) is multinomial as above and p(xf |y) is Bernoulli over
each features .
I Fit class as Categorical,
p(y = δ(c)) =n
∑i=1
1(yi = c)
n
I Fit features using count matrix F ,
I Let
Ff ,c =n
∑i=1
1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F
I Then,
p(xf |y = δ(c)) =Ff ,c
n
∑i=1
1(yi = c)
![Page 33: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/33.jpg)
Alternative: Multivariate Bernoulli Naive Bayes
I Both p(y) is multinomial as above and p(xf |y) is Bernoulli over
each features .
I Fit class as Categorical,
p(y = δ(c)) =n
∑i=1
1(yi = c)
n
I Fit features using count matrix F ,
I Let
Ff ,c =n
∑i=1
1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F
I Then,
p(xf |y = δ(c)) =Ff ,c
n
∑i=1
1(yi = c)
![Page 34: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/34.jpg)
Getting a Conditional Distribution
I Generative models estimates of P(X ,Y ), we want P(Y |X ).
I Bayes Rule,
p(y|x) = p(x|y)p(y)p(x)
I In log-space,
log p(y|x) = log p(x|y) + log p(y)− log p(x)
![Page 35: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/35.jpg)
Prediction with Naive Bayes
I For prediction, last term is constant, so
arg maxc
log p(y = δ(c)|x) = log p(x|y = δ(c)) + log p(y = δ(c))
I Can write as linear model,
Wf ,c = log p(xf = 1|y = c) for all c ∈ C, f ∈ F
bc = log p(y = δ(c)) for all c ∈ C
![Page 36: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/36.jpg)
Getting a Conditional Distribution
I What if we want conditional probabilities?
p(y|x) = p(x|y)p(y)p(x)
=p(x|y)p(y)
∑c ′∈C p(x, y = δ(c ′))
I Denominator is acquired by renormalizing,
p(y|x) ∝ p(x|y)p(y)
![Page 37: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/37.jpg)
Practical Aspects: Calculating Log-Sum-Exp
I Because of numerical issues, calculate in log-space,
f (z) = log p(y = δ(c)|x) = log zc − log ∑c ′∈C
exp(zc ′)
where for naive Bayes
z = xW+ b
I However hard to calculate,
log ∑c ′∈C
exp(zc ′)
I Instead
log ∑c ′∈C
exp(yc ′ −M) +M
where M = maxc ′∈C zc ′
![Page 38: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/38.jpg)
Practical Aspects: Calculating Log-Sum-Exp
I Because of numerical issues, calculate in log-space,
f (z) = log p(y = δ(c)|x) = log zc − log ∑c ′∈C
exp(zc ′)
where for naive Bayes
z = xW+ b
I However hard to calculate,
log ∑c ′∈C
exp(zc ′)
I Instead
log ∑c ′∈C
exp(yc ′ −M) +M
where M = maxc ′∈C zc ′
![Page 39: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/39.jpg)
Digression: Zipf’s Law
![Page 40: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/40.jpg)
Laplace Smoothing
Method for handling the long tail of words by distributing mass,
I Add a value of α to each element in the sample space before
normalization.
θs =α + ∑n
i=1 1(si = s)
α|S|+ n
I (Similar to Dirichlet prior in a Bayesian interpretation.)
For naive Bayes:
F = α + F
![Page 41: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/41.jpg)
Laplace Smoothing
Method for handling the long tail of words by distributing mass,
I Add a value of α to each element in the sample space before
normalization.
θs =α + ∑n
i=1 1(si = s)
α|S|+ n
I (Similar to Dirichlet prior in a Bayesian interpretation.)
For naive Bayes:
F = α + F
![Page 42: Text Classification + Machine Learning ReviewCALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30 DAYS Sentiment Good Sentences I A thoughtful, provocative, insistently humanizing lm. I Occasionally](https://reader033.vdocuments.site/reader033/viewer/2022042313/5ede61e8ad6a402d6669b45a/html5/thumbnails/42.jpg)
Naive Bayes In Practice
I Very fast to train
I Relatively interpretable.
I Performs quite well on small datasets
(RT-S [movie review], CR [customer reports], MPQA [opinion polarity],
SUBJ [subjectivity])