text classification + machine learning reviewcall now to receive your diploma within 30 days...

Post on 02-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Text Classification

+

Machine Learning Review

CS 287

Contents

Text Classification

Preliminaries: Machine Learning for NLP

Features and Preprocessing

Output

Classification

Linear Models

Linear Model 1: Naive Bayes

Earn a Degree based on your Life Experience

Obtain a Bachelor’s, Master’s, MBA, or PhD based on your

present knowledge and life experience.

No required tests, classes, or books. Confidentiality assured.

Join our fully recognized Degree Program.

Are you a truly qualified professional in your field but lack the

appropriate, recognized documentation to achieve your goals?

Or are you venturing into a new field and need a boost to get

your foot in the door so you can prove your capabilities?

Call us for information that can change your life and help you

to achieve your goals!!!

CALL NOW TO RECEIVE YOUR DIPLOMA WITHIN 30

DAYS

Sentiment

Good Sentences

I A thoughtful, provocative, insistently humanizing film.

I Occasionally melodramatic, it’s also extremely effective.

I Guaranteed to move anyone who ever shook, rattled, or rolled.

Bad Sentences

I A sentimental mess that never rings true.

I This 100-minute movie only has about 25 minutes of decent

material.

I Here, common sense flies out the window, along with the hail of

bullets, none of which ever seem to hit Sascha.

Multiclass Sentiment

I ? ? ??

I visited The Abbey on several occasions on a visit to Cambridge

and found it to be a solid, reliable and friendly place for a meal.

I ??

However, the food leaves something to be desired. A very obvious

menu and average execution

I ? ? ? ? ?

Fun, friendly neighborhood bar. Good drinks, good food, not too

pricey. Great atmosphere!

Text Categorization

I Straightforward setup.

I Lots of practical applications:

I Spam Filtering

I Sentiment

I Text Categorization

I e-discovery

I Twitter Mining

I Author Identification

I . . .

I Introduces machine learning notation.

However, a relatively solved problem these days.

Contents

Text Classification

Preliminaries: Machine Learning for NLP

Features and Preprocessing

Output

Classification

Linear Models

Linear Model 1: Naive Bayes

Preliminary Notation

I b,m; bold letters for vectors.

I B,M; bold capital letters for matrices.

I B,M; script-case for sets.

I B,M; capital letters for random variables.

I bi , xi ; lower case for scalars or indexing into vectors.

I δ(i); one-hot vector at position i

δ(2) = [0; 1; 0; . . . ]

I 1(x = y); indicator 1 if x = y , o.w. 0

Text Classification

1. Extract pertinent information from the sentence.

2. Use this to construct an input representation.

3. Classify this vector into an output class.

Input Representation:

I Conversion from text into a mathematical representation?

I Main focus of this class, representation of language

I Point in coming lectures: sparse vs. dense representations

Text Classification

1. Extract pertinent information from the sentence.

2. Use this to construct an input representation.

3. Classify this vector into an output class.

Input Representation:

I Conversion from text into a mathematical representation?

I Main focus of this class, representation of language

I Point in coming lectures: sparse vs. dense representations

Sparse Features (Notation from YG)

I F ; a discrete set of features values.

I f1 ∈ F , . . . , fk ∈ F ; active features for input.

For a given sentence, let f1, . . . fk be the relevant features.

Typically k << |F |.

I Sparse representation of the input defined as,

x =k

∑i=1

δ(fi )

I x ∈ R1×din ; input representation

Features 1: Sparse Bag-of-Words Features

Representation is counts of input words,

I F ; the vocabulary of the language.

I x = ∑i δ(fi )

Example: Movie review input,

A sentimental mess

x = δ(word:A) + δ(word:sentimental) + δ(word:mess)

x> =

1...

0

0

+

0...

0

1

+

0...

1

0

=

1...

1

1

word:A...

word:mess

word:sentimental

Features 2: Sparse Word Properties

Representation can use specific aspects of text.

I F ; Spelling, all-capitals, trigger words, etc.

I x = ∑i δ(fi )

Example: Spam Email

Your diploma puts a UUNIVERSITY JOB PLACEMENT COUNSELOR

at your disposal.

x = δ(misspelling) + δ(allcapital) + δ(trigger:diploma) + . . .

x> =

0...

1

0

+

0...

0

1

+

1...

0

0

=

1...

1

1

misspelling...

capital

word:diploma

Text Classification: Output Representation

1. Extract pertinent information from the sentence.

2. Use this to construct an input representation.

3. Classify this vector into an output class.

Output Representation:

I How do encode the output classes?

I We will use a one-hot output encoding.

I In future lectures, efficiency of output encoding.

Output Class Notation

I C = {1, . . . , dout}; possible output classes

I c ∈ C; always one true output class

I y = δ(c) ∈ R1×dout ; true one-hot output representation

Output Form: Binary Classification

Examples: spam/not-spam, good review/bad review, relevant/irrelevant

document, many others.

I dout = 2; two possible classes

I In our notation,

bad c = 1 y =[

1 0]vs.

good c = 2 y =[

0 1]

I Can also use a single output sign representation with dout = 1

Output Form: Multiclass Classification

Examples: Yelp stars, etc.

I dout = 5; for examples

I In our notation, one star, two star...

? c = 1 y =[

1 0 0 0 0]vs.

? ? c = 2 y =[

0 1 0 0 0]

. . .

Examples: Word Prediction (Unit 3)

I dout > 100, 000;

I In our notation, C is vocabulary and each c is a word.

the c = 1 y =[

1 0 0 0 . . . 0]vs.

dog c = 2 y =[

0 1 0 0 . . . 0]

. . .

Evaluation

I Consider evaluating accuracy on outputs y1, . . . , yn.

I Given a predictions c1 . . . cn we measure accuracy as,

n

∑i=1

1(δ(ci ) = yi )

n

I Simplest of several different metrics we will explore in the class.

Contents

Text Classification

Preliminaries: Machine Learning for NLP

Features and Preprocessing

Output

Classification

Linear Models

Linear Model 1: Naive Bayes

Supervised Machine Learning

Let,

I (x1, y1), . . . , (xn, yn); training data

I xi ∈ R1×din ; input representations

I yi ∈ R1×dout ; true output representations (one-hot vectors)

Goal: Learn a classifier from input to output classes.

Note:

I xi is an input vector xi ,j is element of the vector, or just xj when

there is a clear single input .

I Practically, store design matrix X ∈ Rn×din and output classes.

Experimental Setup

I Data is split into three parts training, validation, and test.

I Experiments are all run on training and validation, test is final

output.

I For assignments, full training and validation data, and only inputs

for test.

For very small text classification data sets,

I Use K-fold cross-validation.

1. Split into K folds (equal splits).

2. For each fold, train on other K-1 folds, test on current fold.

Linear Models for Classification

Linear model,

y = f (xW+ b)

I W ∈ Rdin×dout ,b ∈ R1×dout ; model parameters

I f : Rdout 7→ Rdout ; activation function

I Sometimes z = xW+ b informally “score” vector.

I Note z and y are not one-hot.

Class prediction,

c = arg maxi∈C

yi = arg maxi∈C

(xW+ b)i

Interpreting Linear Models

Parameters give scores to possible outputs,

I Wf ,i is the score for sparse feature f under class i

I bi is a prior score for class i

I yi is the total score for class i

I c is highest scoring class under the linear model.

Example:

I For single feature score,

[β1, β2] = δ(word:dreadful)W,

Expect β1 > β2 (assuming 2 is class good).

Probabilistic Linear Models

Can estimate a linear model probabilistically,

I Let output be a random variable Y , with sample space C.

I Representation be a random vector X .

I (Simplified frequentist representation)

I Interested in estimating parameters θ,

P(Y |X ; θ)

Informally we use p(y = δ(c)|x) for P(Y = c |X = x).

Generative Model. Joint Log-Likelihood as Loss

I (x1, y1), . . . , (xn, yn); supervised data

I Select parameters to maximize likelihood of training data.

L(θ) = −n

∑i=1

log p(xi , yi ; θ)

For linear models θ = (W,b)

I Do this by minimizing negative log-likelihood (NLL).

arg minθL(θ)

Multinomial Naive Bayes

Reminder, joint probability chain rule,

p(x, y) = p(x|y)p(y)

For a sparse features, with observed classes we can write as,

p(x, y) = p(xf1 = 1, . . . , xfk = 1|y = δ(c))p(y = δ(c)) =

=k

∏i=1

p(xfi = 1|xf1 = 1, . . . , xfi−1 = 1, y = δ(c))p(y = δ(c)) ≈

=k

∏i=1

p(xfi = 1|y)p(y)

First is by chain-rule, second is by independence assumption.

Multinomial Naive Bayes

Reminder, joint probability chain rule,

p(x, y) = p(x|y)p(y)

For a sparse features, with observed classes we can write as,

p(x, y) = p(xf1 = 1, . . . , xfk = 1|y = δ(c))p(y = δ(c)) =

=k

∏i=1

p(xfi = 1|xf1 = 1, . . . , xfi−1 = 1, y = δ(c))p(y = δ(c)) ≈

=k

∏i=1

p(xfi = 1|y)p(y)

First is by chain-rule, second is by independence assumption.

Estimating Multinomial Distributions

Let S be a random variable with sample space S and we have

observations s1, . . . , sn,

I P(S = s; θ) = Cat(s; θ); parameterized as a multinomial

distribution.

I Minimizing NLL for multinomial (MLE) for data has a closed-form.

P(S = s; θ) = Cat(s; θ) =n

∑i=1

1(si = s)

n

I Exercise: Derive this by minimizing L.

I Also called categorical or multinoulli (in Murphy).

Estimating Multinomial Distributions

Let S be a random variable with sample space S and we have

observations s1, . . . , sn,

I P(S = s; θ) = Cat(s; θ); parameterized as a multinomial

distribution.

I Minimizing NLL for multinomial (MLE) for data has a closed-form.

P(S = s; θ) = Cat(s; θ) =n

∑i=1

1(si = s)

n

I Exercise: Derive this by minimizing L.

I Also called categorical or multinoulli (in Murphy).

Multinomial Naive Bayes

I Both p(y) and p(x|y) are parameterized as multinomials.

I Fit first as,

p(y = δ(c)) =n

∑i=1

1(yi = c)

n

I Fit second using count matrix F ,

I Let

Ff ,c =n

∑i=1

1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F

I Then,

p(xf = 1|y = δ(c)) =Ff ,c

∑f ′∈F

Ff ′,c

Multinomial Naive Bayes

I Both p(y) and p(x|y) are parameterized as multinomials.

I Fit first as,

p(y = δ(c)) =n

∑i=1

1(yi = c)

n

I Fit second using count matrix F ,

I Let

Ff ,c =n

∑i=1

1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F

I Then,

p(xf = 1|y = δ(c)) =Ff ,c

∑f ′∈F

Ff ′,c

Alternative: Multivariate Bernoulli Naive Bayes

I Both p(y) is multinomial as above and p(xf |y) is Bernoulli over

each features .

I Fit class as Categorical,

p(y = δ(c)) =n

∑i=1

1(yi = c)

n

I Fit features using count matrix F ,

I Let

Ff ,c =n

∑i=1

1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F

I Then,

p(xf |y = δ(c)) =Ff ,c

n

∑i=1

1(yi = c)

Alternative: Multivariate Bernoulli Naive Bayes

I Both p(y) is multinomial as above and p(xf |y) is Bernoulli over

each features .

I Fit class as Categorical,

p(y = δ(c)) =n

∑i=1

1(yi = c)

n

I Fit features using count matrix F ,

I Let

Ff ,c =n

∑i=1

1(yi = c)1(xi ,f = 1) forall c ∈ C, f ∈ F

I Then,

p(xf |y = δ(c)) =Ff ,c

n

∑i=1

1(yi = c)

Getting a Conditional Distribution

I Generative models estimates of P(X ,Y ), we want P(Y |X ).

I Bayes Rule,

p(y|x) = p(x|y)p(y)p(x)

I In log-space,

log p(y|x) = log p(x|y) + log p(y)− log p(x)

Prediction with Naive Bayes

I For prediction, last term is constant, so

arg maxc

log p(y = δ(c)|x) = log p(x|y = δ(c)) + log p(y = δ(c))

I Can write as linear model,

Wf ,c = log p(xf = 1|y = c) for all c ∈ C, f ∈ F

bc = log p(y = δ(c)) for all c ∈ C

Getting a Conditional Distribution

I What if we want conditional probabilities?

p(y|x) = p(x|y)p(y)p(x)

=p(x|y)p(y)

∑c ′∈C p(x, y = δ(c ′))

I Denominator is acquired by renormalizing,

p(y|x) ∝ p(x|y)p(y)

Practical Aspects: Calculating Log-Sum-Exp

I Because of numerical issues, calculate in log-space,

f (z) = log p(y = δ(c)|x) = log zc − log ∑c ′∈C

exp(zc ′)

where for naive Bayes

z = xW+ b

I However hard to calculate,

log ∑c ′∈C

exp(zc ′)

I Instead

log ∑c ′∈C

exp(yc ′ −M) +M

where M = maxc ′∈C zc ′

Practical Aspects: Calculating Log-Sum-Exp

I Because of numerical issues, calculate in log-space,

f (z) = log p(y = δ(c)|x) = log zc − log ∑c ′∈C

exp(zc ′)

where for naive Bayes

z = xW+ b

I However hard to calculate,

log ∑c ′∈C

exp(zc ′)

I Instead

log ∑c ′∈C

exp(yc ′ −M) +M

where M = maxc ′∈C zc ′

Digression: Zipf’s Law

Laplace Smoothing

Method for handling the long tail of words by distributing mass,

I Add a value of α to each element in the sample space before

normalization.

θs =α + ∑n

i=1 1(si = s)

α|S|+ n

I (Similar to Dirichlet prior in a Bayesian interpretation.)

For naive Bayes:

F = α + F

Laplace Smoothing

Method for handling the long tail of words by distributing mass,

I Add a value of α to each element in the sample space before

normalization.

θs =α + ∑n

i=1 1(si = s)

α|S|+ n

I (Similar to Dirichlet prior in a Bayesian interpretation.)

For naive Bayes:

F = α + F

Naive Bayes In Practice

I Very fast to train

I Relatively interpretable.

I Performs quite well on small datasets

(RT-S [movie review], CR [customer reports], MPQA [opinion polarity],

SUBJ [subjectivity])

top related