text classi cation - liu

97
This work is licensed under a Creative Commons Attribution 4.0 International License. Text classication Marco Kuhlmann Department of Computer and Information Science 732A92/TDDE16 Text Mining (2020)

Upload: others

Post on 24-Feb-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text classi cation - LiU

This work is licensed under a Creative Commons Attribution 4.0 International License.

Text classification

Marco Kuhlmann Department of Computer and Information Science

732A92/TDDE16 Text Mining (2020)

Page 2: Text classi cation - LiU

Text classification

• Text classification is the task of categorising text documents into predefined classes.

• The term ‘document’ is applied to everything from tweets over press releases to complete books.

Page 3: Text classi cation - LiU

Topic classification

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

diamond baseball

forward soccer

team captain

UK China Elections Sports

Adapted from Manning et al. (2008)

Page 4: Text classi cation - LiU

Forensic linguistics

‘I realized the faxed copy I just received was an outline of the manifesto, using much of the same wording, definitely the same topics and themes. … I invented [the language analysis] for this case and really, forensics linguistics took off after that.’

James Fitzgerald, profiler

Sources: Wikipedia, Newsweek

Page 5: Text classi cation - LiU

Sentiment analysis

The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.

… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.

positive negative

Page 6: Text classi cation - LiU

The standard text classification pipeline

predictor

document collection

matrix of document vectors

vector of class labels

vectorizer Example: CountVectorizer

Example: MultinomialNB

Page 7: Text classi cation - LiU

The standard text classification pipeline

predictor

document collection

matrix of document vectors

vector of class labels

vectorizer CountVectorizer

MultinomialNB

Page 8: Text classi cation - LiU

Reminder: Documents as tf–idf vectors

Scandal in Bohemia

Final problem

Empty house

Norwood builder

Dancing men

Retired colourman

Adair 0,0000 0,0000 0,0692 0,0000 0,0000 0,0000

Adler 0,0531 0,0000 0,0000 0,0000 0,0000 0,0000

Lestrade 0,0000 0,0000 0,0291 0,1424 0,0000 0,0000

Moriarty 0,0000 0,0845 0,0528 0,0034 0,0000 0,0000

Page 9: Text classi cation - LiU

Documents as count vectors – the bag of words

the gorgeously elaborate continuation of the lord of the rings trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson’s expanded vision of j.r.r. tolkien’s middle-earth.

… is a sour little movie at its core an exploration of the emptiness that underlay the relentless gaiety of the 1920’s as if to stop would hasten the economic and global political turmoil that was to come

positive negative

Page 10: Text classi cation - LiU

Documents as count vectors – the bag of words

a adequately cannot co-writer/director column continuation describe elaborate expanded gorgeously huge is j.r.r. jackson lord middle-earth of of of of peter rings so that the the the tolkien trilogy vision words

… 1920’s a an and as at come core economic emptiness exploration gaiety global hasten if is its little movie of of political relentless sour stop that that the the the the to to turmoil underlay was would

positive negative

Page 11: Text classi cation - LiU

Documents as count vectors – the bag of words

Word Count

of 4

the 3

words 1

vision 1

trilogy 1

Word Count

the 4

to 2

that 2

of 2

would 1

…positive negative

Page 12: Text classi cation - LiU

The standard text classification pipeline

predictor

document collection

matrix of document vectors

vector of class labels

vectorizer CountVectorizer

MultinomialNB

Page 13: Text classi cation - LiU

Text classification as supervised machine learning

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

A B C

A

A

B

B

C

C

Page 14: Text classi cation - LiU

Text classification as supervised machine learning

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

A B C

A

A

B

B

C

C

training set

test set

learning

evaluation

Page 15: Text classi cation - LiU

Training and testing

• Training

When we train a classifier, we present it with a document 𝑥 and its gold-standard class 𝑦 and apply some learning algorithm. maximise likelihood/minimise cross-entropy of the training data

• Testing

When we evaluate a classifier, we present it with 𝑥 and compare the predicted class for this input with the gold-standard class 𝑦.

Page 16: Text classi cation - LiU

Two challenges in text classification

• Standard document representations such as the bag of words easily yield tens of thousands of features. computational challenge, data sparsity

• Many document collections are highly imbalanced with respect to the represented classes. frequency bias, problems for evaluation

Page 17: Text classi cation - LiU

This lecture

• Introduction to text classification

• Evaluation of text classifiers

• The Naive Bayes classifier

• The Logistic Regression classifier

• Support Vector Machines

Page 18: Text classi cation - LiU

Evaluation of text classifiers

Page 19: Text classi cation - LiU

Evaluation of text classifiers

• We require a test set consisting of a number of documents, each of which has been tagged with its correct class. typically part of a larger gold-standard data set

• To evaluate a classifier, we apply it to the test set and compare the predicted classes with the gold-standard classes.

• We do so in order to estimate how well the classifier will perform on new, previously unseen documents. assume that all samples are drawn from the same distribution

Page 20: Text classi cation - LiU

Evaluation of text classifiers

Windsor The Queen

Mao Communist

TV-ads campaign

A B C

A B A

gold-standard class

predicted class

Page 21: Text classi cation - LiU

Accuracy

The accuracy of a classifier is the proportion of documents for which the classifier predicts the gold-standard class:

number of correctly classified documentsnumber of all documents

accuracy =

Page 22: Text classi cation - LiU

Accuracy

Document Gold-standard class Predicted class

Chinese Beijing Chinese China China

Chinese Chinese Shanghai China China

Chinese Macao China China

Tokyo Japan Chinese Japan China

accuracy for this example: 3/4 = 75%

Page 23: Text classi cation - LiU

Accuracy and imbalanced data sets

50 %50 %

80 %

20 %

Is 80% accuracy good or bad?

Page 24: Text classi cation - LiU

The role of baselines

• Evaluation measures are no absolute measures of performance. Whether ‘80% accuracy’ is good or not depends on the task at hand.

• Instead, we should ask for a classifier’s performance relative to other classifiers, or other points of comparison. ‘Logistic Regression has a higher accuracy than Naive Bayes.’

• When other classifiers are not available, a simple baseline is to always predict the most frequent class in the training data.

Page 25: Text classi cation - LiU

Grading criteria

"TTFTTNFOU DSJUFSJB

8IFO HSBEJOH ZPVS SFQPSU * XJMM BTTFTT JU XJUI SFTQFDU UP UIF DSJUFSJB EFĕOFE CFMPX�'PS FBDI DSJUFSJPO * XJMM BTTJHO FJUIFS UIF TDPSF ƈ EPFT OPU NFFU FYQFDUBUJPOT PS BTDPSF CFUXFFO Ɖ NFFUT FYQFDUBUJPOT BOE Ƒ FYDFFET FYQFDUBUJPOT� 5P HFU B QBTTJOHHSBEF ZPVS TDPSF GPS FBDI DSJUFSJPO NVTU CF BU MFBTU Ɖ� :PVS HSBEF JT DBMDVMBUFE GSPNUXP DPNQPOFOU TDPSFT BT TQFDJĕFE JO UIF GPMMPXJOH UBCMF� 5PHFUIFS XJUI ZPVS HSBEF ZPV XJMM SFDFJWF B TIPSU XSJUUFO NPUJWBUJPO CBTFE PO UIF EFTDSJQUPST GPS FBDI DSJUFSJPO�

$PNQPOFOU � $PNQPOFOU � (SBEF ���"�� (SBEF 5%%&��

� � & �

� � % �

� � $ �

� � % �

� � $ �

� � # �

� � $ �

� � # �

� � " �

$PNQPOFOU �

ćJT DPNQPOFOU JT UIF TDPSF GPS B TJOHMF DSJUFSJPO�

4PVOEOFTT BOE DPSSFDUOFTT *T UIF UFDIOJDBM BQQSPBDI TPVOE BOE XFMM�DIPTFO "SFUIF DMBJNT NBEF JO UIF SFQPSU TVQQPSUFE CZ QSPQFS FYQFSJNFOUT BOE BSF UIF SFTVMUTPG UIFTF FYQFSJNFOUT DPSSFDUMZ JOUFSQSFUFE

� 5SPVCMFTPNF� ćFSF BSF TPNF JEFBT XPSUI TBMWBHJOH IFSF CVU UIF XPSL TIPVMESFBMMZ IBWF CFFO EPOF PS FWBMVBUFE EJČFSFOUMZ�

� 'BJSMZ SFBTPOBCMF XPSL� ćF BQQSPBDI JT OPU CBE UIF FWBMVBUJPO JT BQQSPQSJBUF BOE BU MFBTU UIF NBJO DMBJNT BSF QSPCBCMZ DPSSFDU�

� (FOFSBMMZ TPMJE XPSL BMUIPVHI UIFSF BSF TPNF BTQFDUT PG UIF BQQSPBDI PS UIFFWBMVBUJPO UIBU * BN OPU TVSF BCPVU�

� ćF BQQSPBDI JT GVMMZ BQU BOE BMM DMBJNT BSF DPOWJODJOHMZ TVQQPSUFE� ćF SFQPSUEJTDVTTFT UIF MJNJUBUJPOT PG UIF XPSL�

Page 26: Text classi cation - LiU

The role of baselines

• Evaluation measures are no absolute measures of performance. Whether ‘80% accuracy’ is good or not depends on the task at hand.

• Instead, we should ask for a classifier’s performance relative to other classifiers, or other points of comparison. ‘Logistic Regression has a higher accuracy than Naive Bayes.’

• When other classifiers are not available, a simple baseline is to always predict the most frequent class in the training data.

Page 27: Text classi cation - LiU

Confusion matrix

classifier ‘positive’

classifier ‘negative’

gold standard ‘positive’

true positives false negatives

gold standard ‘negative’

false positives true negatives

Page 28: Text classi cation - LiU

Accuracy

classifier ‘positive’

classifier ‘negative’

gold standard ‘positive’

true positives false negatives

gold standard ‘negative’

false positives true negatives

Page 29: Text classi cation - LiU

Precision and recall

• Precision and recall ‘zoom in’ on how good a system is at identifying documents of a specific class 𝑐.

• Precision is the proportion of correctly classified documents among all documents for which the system predicts class 𝑐. When the system predicts class 𝑐, how often is it correct?

• Recall is the proportion of correctly classified documents among all documents with gold-standard class 𝑐. When the document has class 𝑐, how often does the system predict it?

Page 30: Text classi cation - LiU

Precision with respect to the positive class

classifier ‘positive’

classifier ‘negative’

gold standard ‘positive’

true positives false negatives

gold standard ‘negative’

false positives true negatives

Page 31: Text classi cation - LiU

Recall with respect to the positive class

classifier ‘positive’

classifier ‘negative’

gold standard ‘positive’

true positives false negatives

gold standard ‘negative’

false positives true negatives

Page 32: Text classi cation - LiU

Precision and recall with respect to the positive class

# true positives# true positives + # false positives

precision =

# true positives# true positives + # false negatives

recall =

Page 33: Text classi cation - LiU

F1-measure

A good classifier should balance between precision and recall. The F1-measure is the harmonic mean of the two values:

2 · precision · recallprecision + recall

F1 =

Page 34: Text classi cation - LiU

F1-measure

��� ��� ��� ��� ��� ���QSFDJTJPO

������������������

SFDBMM

���������������������������

Page 35: Text classi cation - LiU

Accuracy with three classes

A B C

A 58 6 1

B 5 11 2

C 0 7 43

Page 36: Text classi cation - LiU

Precision with respect to class B

A B C

A 58 6 1

B 5 11 2

C 0 7 43

Page 37: Text classi cation - LiU

Recall with respect to class B

A B C

A 58 6 1

B 5 11 2

C 0 7 43

Page 38: Text classi cation - LiU

Classification report in scikit-learn

precision recall f1-score support

C 0.63 0.04 0.07 671 KD 0.70 0.02 0.03 821 L 0.92 0.02 0.04 560 M 0.36 0.68 0.47 1644 MP 0.36 0.25 0.29 809 S 0.46 0.84 0.59 2773 SD 0.57 0.12 0.20 1060 V 0.59 0.15 0.24 950

accuracy 0.43 9288 macro avg 0.57 0.26 0.24 9288 weighted avg 0.52 0.43 0.34 9288

Page 39: Text classi cation - LiU

This lecture

• Introduction to text classification

• Evaluation of text classifiers

• The Naive Bayes classifier

• The Logistic Regression classifier

• Support Vector machines

Page 40: Text classi cation - LiU

The Naive Bayes classifier

Page 41: Text classi cation - LiU

Naive Bayes

• The Naive Bayes classifier is a simple but surprisingly effective probabilistic text classifier that builds on Bayes’ rule.

• It is called ‘naive’ because it makes strong (unrealistic) independence assumptions about probabilities.

Page 42: Text classi cation - LiU

Naive Bayes decision rule, informally

elaborate gorgeously

elaborate gorgeously

elaborate gorgeously

negativepositive

?

𝑃(pos) 𝑃(elaborate |pos) 𝑃(gorgeously |pos)

score(pos) =

𝑃(neg) 𝑃(elaborate |neg) 𝑃(gorgeously |neg)

score(neg) =

70% 30%

Page 43: Text classi cation - LiU

elaborate gorgeously

?

Naive Bayes decision rule, informally

elaborate gorgeously

elaborate gorgeously

negative

positive

𝑃(pos) 𝑃(elaborate |pos) 𝑃(gorgeously |pos)

score(pos) =

𝑃(neg) 𝑃(elaborate |neg) 𝑃(gorgeously |neg)

score(neg) =

70% 30%

Page 44: Text classi cation - LiU

The role of Bayes’ rules

• For classification, we would like to know 𝑃(class |document). 𝑃(disease |symptom)

• But a Naive Bayes classifier contains 𝑃(document |class). 𝑃(symptom |disease)

• The classifier uses Bayes’ rule to convert between the two. 𝑃(class |document) ∝ 𝑃(class) 𝑃(document |class)

Page 45: Text classi cation - LiU

Formal definition of a Naive Bayes classifier

𝐶 a set of possible classes

𝑉 a set of possible words; the model’s vocabulary

𝑃(𝑐) probabilities that specify how likely it is for a document to belong to class 𝑐 (one probability for each class)

𝑃(𝑤 |𝑐) probabilities that specify how likely it is for a document to contain the word 𝑤, given that the document belongs to class 𝑐 (one probability for each class–word pair)

Page 46: Text classi cation - LiU

Naive Bayes decision rule

choose that class 𝑐 which maximises the term to the right of the ‘arg max’

predicted class for the document

count of the word 𝑤 in the document

Оਅ � BSHNBYਅó৫ ৸ਅ ď Ƕਘó৾৸ਘ ] ਅ�ਘ

Page 47: Text classi cation - LiU

What does the Naive Bayes classifier do if all words in the document to be classified

are out-0f-vocabulary?

Page 48: Text classi cation - LiU

Log probabilities

• In order to avoid underflow, implementations use the logarithms of probabilities instead of the probabilities themselves. 𝑃(𝑤 |𝑐) becomes log 𝑃(𝑤 |𝑐)

• Note that in this case, instead of multiplying probabilities, we have to add their logarithms.

Page 49: Text classi cation - LiU

Log probabilities

��� ��� ��� ��� ��� ���÷�÷�÷�÷��MPH

Page 50: Text classi cation - LiU

Learning a Naive Bayes classifier

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

A B C

A

A

B

B

C

C

Page 51: Text classi cation - LiU

Learning a Naive Bayes classifier

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

A B C

A

A

B

B

C

C

training set

test set

learning

evaluation

Page 52: Text classi cation - LiU

Learning a Naive Bayes classifier

congestion London

Parliament Big Ben

Olympics Beijing

tourism Great Wall

recount votes

seat run-off

A B C

A B C

𝑃(𝑐) 𝑃(𝑤 |𝑐)class probabilities word probabilities

Page 53: Text classi cation - LiU

Maximum Likelihood Estimation (MLE)

• One standard technique for estimating the probabilities of a model is Maximum Likelihood Estimation (MLE). Find probabilities that maximise the probability of the training data.

• Depending on the type of model under consideration, MLE can be computationally challenging.

• For the special case of the Naive Bayes classifier, MLE amounts to equating probabilities with relative frequencies.

Page 54: Text classi cation - LiU

MLE for the Naive Bayes classifier

• To estimate the class probabilities 𝑃(𝑐):

Compute the percentage of documents with class 𝑐 among all documents in the training set.

• To estimate the word probabilities 𝑃(𝑤 |𝑐):

Compute the percentage of occurrences of the word 𝑤 among all word occurrences in documents with class 𝑐.

Page 55: Text classi cation - LiU

MLE for the Naive Bayes classifier

#(𝑐) number of documents with gold-standard class 𝑐

#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐

৸ਅ � �ਅöਙó৫ �ਙ ৸ਘ ] ਅ � �ਘ ਅöਙó৾ �ਙ ਅ

Page 56: Text classi cation - LiU

MLE of word probabilities

31 tokens 37 tokens

The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.

… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.

positive negative

Page 57: Text classi cation - LiU

MLE of word probabilities

Word Count

of 4

The 2

words 1

vision 1

trilogy 1

… …

Word Count

the 4

to 2

that 2

of 2

would 1

… …positive negative

Page 58: Text classi cation - LiU

MLE of word probabilities

Probability Estimated value

𝑃(of |pos) 4/31

𝑃(The |pos) 2/31

𝑃(words |pos) 1/31

𝑃(vision |pos) 1/31

𝑃(trilogy |pos) 1/31

… …

Probability Estimated value

𝑃(the |neg) 4/37

𝑃(to |neg) 2/37

𝑃(that |neg) 2/37

𝑃(of |neg) 2/37

𝑃(would |neg) 1/37

… …positive negative

Page 59: Text classi cation - LiU

Smoothing

• If we equate word probabilities with relative frequencies, some probabilities may be zero. For example, ‘gorgeously’ may only occur in positive documents.

• This is a problem because we multiply probabilities in the decision rule for Naive Bayes. Slogan: Zero probabilities destroy information.

• To avoid zero probabilities, we can use techniques for smoothing the probability distribution.

Page 60: Text classi cation - LiU

Additive smoothing

• In additive smoothing, we add a constant value 𝛼 ≥ 0 to all counts before computing relative frequencies. For 𝛼 = 1, this technique is also known as Laplace smoothing.

• Effectively, we hallucinate 𝛼 extra occurrences of each word, including words that never occurred.

• The smoothing constant 𝛼 is a hyperparameter of the model that needs to be tuned on validation data.

Page 61: Text classi cation - LiU

Hyperparameters and cross-validation

• A hyperparameter is a parameter of a machine learning model that is not set by the learning algorithm itself.

• To tune hyperparameters, we need a separate validation set, or need to use cross-validation:

• Split the data into 𝑘 parts.

• In each fold, use 𝑘 − 1 parts for training, 1 part for validation.

• Take the mean performance over the 𝑘 folds.

Page 62: Text classi cation - LiU

Five-fold cross validation

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

80% training, 20% validation

Page 63: Text classi cation - LiU

Additive smoothing

• In additive smoothing, we add a constant value 𝛼 ≥ 0 to all counts before computing relative frequencies. For 𝛼 = 1, this technique is also known as Laplace smoothing.

• Effectively, we hallucinate 𝛼 extra occurrences of each word, including words that never occurred.

• The smoothing constant 𝛼 is a hyperparameter of the model that needs to be tuned on validation data.

Page 64: Text classi cation - LiU

৸ਘ ] ਅ � �ਘ ਅ � öਙó৾<�ਙ ਅ � ><latexit sha1_base64="JOxlUkKj5gvzRD2EaG79HryUar8=">AAAFVHicjVRdb9MwFPU+CqPAPuCRF4tq0mBp1XTTuj5UmjaYQKKs0H1JTVQ5jttac5wodraVkF/Er+EFCfghvPCAnZSNpZ2YpUTX5x4fX5+b2AkYFbJa/T4zOzdfuHd/4UHx4aPHi0vLK0+OhR+FmBxhn/nhqYMEYZSTI0klI6dBSJDnMHLinO3p/Mk5CQX1+aEcBcT20IDTPsVIKqi3/Lq9dmEZny0Dv4BNaPVDhGOrtHah5usWYsEQJbElIq8XX0KLcnicwK7KX2r+OswYdtJbLlUr1XTAycAcByUwHu3eytwvy/Vx5BEuMUNCdM1qIO0YhZJiRpKiFQkSIHyGBqSrQo48Iuw4PW8CVxXiwr4fqodLmKL/roiRJ8TIcxTTQ3Io8jkNTst1I9nftmPKg0gSjrON+hGD0ofaPOjSkGDJRipAOKSqVoiHSHkmlcU3dtGFiYDgmyfRO5aFHDHSfNs5MByfuddTO444xb5Lyml9RUsQ6SHKtVS3uArhrqLv6wM3YfwSdohHtUBiFCHs0E9knyAZhUTotIIgjDWqZuXtyqYBDwJVLmJjbDtdds1RlLK5kWeZZo5mbpTNRqUxwWvkeYpUzrPqtUSRUuY7Orgq9pCcauh95DnKRV19y+e+UJ4RV0kwt6MdUsvsuEW5+mphO/T/2iOHmT3/s2DLgGPrMo2W7sQh5SMDpvI6oaCO6mogs3fuSFuZjZMqeyiQ01RucXhS4CMZRAyFd9C4sn9SpBM5w7so6MbUa9MUXlER5Nj1Wvk25tS9dHuv29TKPmNGpHVOcNPSv2Q/vSoaemxdXQyTwXGtYm5UNj9slnZ2x5fGAngGnoM1YII62AFvQBscAQy+gK/gB/g5/23+d2GuUMioszPjNU/BjVFY/APPmKKb</latexit>

#(𝑐) number of documents with gold-standard class 𝑐

#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐

MLE with additive smoothing

no smoothing here!

৸ਅ � �ਅöਙó৫ �ਙ

Page 65: Text classi cation - LiU

#(𝑐) number of documents with gold-standard class 𝑐

#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐

MLE with additive smoothing

৸ਅ � �ਅöਙó৫ �ਙ ৸ਘ ] ਅ � �ਘ ਅ � ʺöਙó৾ �ਙ ਅˁ � ৾<latexit sha1_base64="CjRgHhyU22p8bV3xgZls7ozdoVg=">AAAFYnicjVRbb9MwFPYuhVFuHXuEB4tq0mBJ1XTTuj5UmjY0gURZobtJTVQ5jttac5wodraV0B/GT+GJFx7gF/CKnZSNpZ2YpUTH53znO8ffSeyGjApZrX6bm19YLNy7v/Sg+PDR4ydPS8vPjkUQR5gc4YAF0amLBGGUkyNJJSOnYUSQ7zJy4p7t6fjJOYkEDfihHIXE8dGA0z7FSCpXr9Rpr13YxhfbwK9gE9r9COHELq9dqP26jVg4ROPEdumAdaEtYr+XXEKbcng8hgp1qbN0NHLgOszgKtQrlauVarrgtGFNjDKYrHZveeG37QU49gmXmCEhulY1lE6CIkkxI+OiHQsSInyGBqSrTI58IpwkPf0YriqPB/tBpB4uYer9NyNBvhAj31VIH8mhyMe0c1asG8v+tpNQHsaScJwV6scMygBqKaFHI4IlGykD4YiqXiEeIqWgVILfqKIbEyHBN0+iK5pCjhhpvuscGG7AvOutk8Sc4sAjZtpf0RZE+ohyTdUtrkK4q+D7+sBNmLyGHeJTTTA2ihB26GeyT5CMIyJ0WLkgTLRX7cztyqYBD0LVLmIT33aado1RENPayKMsKwezNkyrUWlM4Rp5nAKZeVS9NlagFPmeDq6aPSSn2vUh9l2lou6+FfBAKM2IpyiY19EKqTQnaVGuvmHYjoK/8shhJs//JNgy4ES6jKOlJ3FI+ciAKb0OKFdHTTWU2Tt3pK1MxmmWPRTKWSy3KDxN8IkMYoaiO3BcyT9N0ond4V0Y9GDqtVkMb6gIc+h6zbwNObOWHu/1mFrZZ8yItM8Jbtr6l+ynV0VDr62ri2HaOK5VrI3K5sfN8s7u5NJYAs/BS7AGLFAHO+AtaIMjgMFX8B38BL8WfxSKheXCSgadn5vkrIAbq/DiDySBptg=</latexit>

Page 66: Text classi cation - LiU

Estimating word probabilities

31 tokens 37 tokens

The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.

… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.

positive negative

Page 67: Text classi cation - LiU

Vocabulary

1920’s J.R.R. Jackson’s Lord Middle-earth Peter Rings The Tolkien’s a adequately an and as at cannot co-writer/director column come continuation core describe economic elaborate emptiness expanded exploration gaiety global gorgeously hasten huge if is its little movie of political relentless so sour stop that the to trilogy turmoil underlay vision was words would

53 unique words

Page 68: Text classi cation - LiU

MLE with add-one smoothing

Word Modified count

of 4 + 1

The 2 + 1

words 1 + 1

vision 1 + 1

trilogy 1 + 1

… …

Word Modified count

of 2 + 1

The 0 + 1

words 0 + 1

vision 0 + 1

trilogy 0 + 1

… …positive negative

Page 69: Text classi cation - LiU

MLE with add-one smoothing

Probability Estimated value

𝑃(of |pos) (4 + 1)/(31 + 53)

𝑃(The |pos) (2 + 1)/(31 + 53)

𝑃(words |pos) (1 + 1)/(31 + 53)

𝑃(vision |pos) (1 + 1)/(31 + 53)

𝑃(trilogy |pos) (1 + 1)/(31 + 53)

… …

Probability Estimated value

𝑃(of |neg) (2 + 1)/(37 + 53)

𝑃(The |neg) (0 + 1)/(37 + 53)

𝑃(words |neg) (0 + 1)/(37 + 53)

𝑃(vision |neg) (0 + 1)/(37 + 53)

𝑃(trilogy |neg) (0 + 1)/(37 + 53)

… …positive negative

Page 70: Text classi cation - LiU

Additive smoothing is assuming a uniform prior

Let 𝑁𝑐 be the total number of word occurrences in class 𝑐. Then

can be written as a linear interpolation of the maximum-likelihood estimate and the uniform distribution over word types:

where

৸ਘ ] ਅ � �ਘ ਅ � ৶ਅ � ৾<latexit sha1_base64="R9dEYwk+CSVSLJizl8pecD8Edls=">AAAFQXicjVTLbtNAFJ22AUp4pbBkMyKqVKgTxWnVNItIVYsqkEgbSF9SHEXj8U0y6vghz7hVMPkPvoYNC/gFPoEdYsOCDTN2aKmTio5k6865Z87cOdceO+BMyErl29z8Qu7W7TuLd/P37j94+Kiw9PhI+FFI4ZD63A9PbCKAMw8OJZMcToIQiGtzOLZPd3T++AxCwXzvQI4C6Lpk4LE+o0QqqFeotlbOLeODZdDnuIGtfkhobBVXztV81SI8GJJxvNejeBWnM3w07hWKlXIlGXg6MCdBEU1Gq7e08MtyfBq54EnKiRAdsxLIbkxCySiHcd6KBASEnpIBdFToERdEN04ON8bLCnFw3w/V40mcoP+uiIkrxMi1FdMlciiyOQ3OynUi2d/sxswLIgkeTTfqRxxLH2unsMNCoJKPVEBoyFStmA6JMkgqP6/sogsTAdCrJ9E7loQccWi8bu8bts+dy2k3jjxGfQdKSX15S4B0CfO0VCe/jPG2ou/qAzdw/AK3wWVaYGzkMW6z97ALREYhCJ1WEMaxRtWstFleN/B+oMolfIJtJssuOYpSMteyLNPM0My1klkv16d49SxPkUpZVq06VqSE+YYNLoo9gBMN7UWurVzU1Td9zxfKM3CUBHfa2iG1rBs3mac+UdwK/b/2yGFqz/8s2DDwxLpUo6k7ccC8kYETeZ1QUFt1NZDpO3OkjdTGaZUdEshZKtc4PC3wDgYRJ+ENNC7snxZpR/bwJgq6MbXqLIWXTAQZdq1auo45cy/d3ss2NdPPmIO0zoA2LP1L9pOroq7HxsXFMB0cVcvmWnn97Xpxa3tyaSyip+gZWkEmqqEt9Aq10CGi6CP6hL6gr7nPue+5H7mfKXV+brLmCboycr//AIhxnmg=</latexit>

৸ਘ ] ਅ � �ਘ ਅ৶ਅ � � ÷ �৾<latexit sha1_base64="p3lV1p5lnnNdlt5LoL4fmxbrShk=">AAAFUXicjVRdb9MwFPU+CqMDtsEjLxbVpA6aqummdX2oNG1oAoluhe5LaqrKcW5ba86HYmdTCfk//BpeeAB+CeINOw0bazsxS61uzj33+PrcxHbAmZCVys+5+YXF3IOHS4/yy4+fPF1ZXXt2KvwopHBCfe6H5zYRwJkHJ5JJDudBCMS1OZzZF/s6f3YJoWC+dyxHAXRdMvBYn1EiFdRb3WsVr6zSZ6tEN3ADW1xVOgRb/ZDQ2CoUrxSexIc9muDXuGgaGWEjY5hJfJr0VguVciVdeDows6CAstXqrS38shyfRi54knIiRMesBLIbk1AyyiHJW5GAgNALMoCOCj3igujG6WETvK4QB/f9UP08iVP034qYuEKMXFsxXSKHYjKnwVm5TiT7O92YeUEkwaPjjfoRx9LH2jnssBCo5CMVEBoy1SumQ6JskMrfW7voxkQA9PZJ9I6GkCMOjXfto5Ltc+fmsRtHHqO+A0baX94SIF3CPC3Vya9jvKfoB/rADRy/wm1wmRZISnmM2+wTHACRUQhCpxWEcaxR9WTslLdK+ChQ7RKeYTtp2Q1HUQxzc5JlmhM0c9Mw6+X6FK8+yVMkY5JVqyaKlDLfs8F1s8dwrqHDyLWVi7r7pu/5QnkGjpLgTls7pMq6cZN56pXFrdD/a48cju35nwXbJZxZN9Zo6kkcM29Uwqm8TiioraYayPH/xJG2xzZOq+yTQM5SucPhaYGPMIg4Ce+hcW3/tEg7sof3UdCDqVVnKbxhIphg16rGXcyZe+nx3oypOX6NOUjrEmjD0p9kP70q6nptX18M08FptWxulrc+bBV297JLYwm9QC9REZmohnbRW9RCJ4iiL+gr+o5+LH5b/J1DufkxdX4uq3mObq3c8h9gqqF+</latexit>

� ৶ਅ৶ਅ � ৾<latexit sha1_base64="Ee9pFSkvjwL8laKaOLbVZ0SITto=">AAAFM3icjVRdb9MwFPW2AqN8rINHXizGJARN1XTTuj5UmjY0gUS3QfclNVXlOLetNceJYmdSifoH+DW88AD/BPGGeOWZV+ykbCztxCwlujn3+Pj6XMduyJlU1eq3ufmFwq3bdxbvFu/df/BwqbT86FgGcUThiAY8iE5dIoEzAUeKKQ6nYQTEdzmcuGc7Jn9yDpFkgThUoxC6PhkI1meUKA31Ss8crskewU3s9CNCk70eHZsXfokdwsMhwcfjXmmlWqmmA08H9iRYQZNx0Fte+O14AY19EIpyImXHroaqm5BIMcphXHRiCSGhZ2QAHR0K4oPsJul2xnhVIx7uB5F+hMIp+u+MhPhSjnxXM32ihjKfM+CsXCdW/c1uwkQYKxA0W6gfc6wCbLzBHouAKj7SAaER07ViOiTaFaUdvLKKKUyGQK/uxKxoSTXi0HzT3i+7AfcuP7tJLBgNPLDS+oqOBOUTJoxUp7iK8bam75oNN3HyArfBZ0ZgXC5i3GYfYBeIiiOQJq0hjBOD6i9rs7JexvuhLpfwCbaZTrvkaIplr+VZtp2j2WuW3ag0pniNPE+TrDyrXhtrUsp8ywYXxR7CqYH2Yt/VLprqW4EIpPYMPC3BvbZxSE/rJi0m9KHEB1Hw1x41zOz5nwUbZTyxLtNomU4cMjEq41TeJDTU1l0NVfbObWkjs3FaZYeEapbKNQ5PC7yHQcxJdAONC/unRdqxO7yJgmlMvTZL4RWTYY5dr1nXMWeuZdp72aZWdow5KOccaNMxv2Q/vSoaZmxcXAzTwXGtYq9V1t+tr2xtTy6NRfQEPUXPkY3qaAu9RgfoCFH0EX1CX9DXwufC98KPws+MOj83mfMYXRmFX38A1YiaPw==</latexit>

Page 71: Text classi cation - LiU

This lecture

• Introduction to text classification

• Evaluation of text classifiers

• The Naive Bayes classifier

• The Logistic Regression classifier

• Support Vector machines

Page 72: Text classi cation - LiU

The Logistic Regression classifier

Page 73: Text classi cation - LiU

The multi-class linear model

input vector (1-by-F)

weight matrix (F-by-K)

output vector (1-by-K)

bias vector (1-by-K)

F = number of features (input variables), K = number of classes (output variables)

О � ੍ਲ � <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 74: Text classi cation - LiU

The logistic model

• The logistic model extends the linear model by adding a pointwise logistic function 𝑓:

• The term logistic regression refers to the procedure of learning the parameters of the logistic model. That model is then used for classification. Confusing terminology!

logitО � ਈ XIFSF � ੍ਲ � <latexit sha1_base64="kqlvgRIJZ5DumuuM3LN7IGHG36c=">AAAFcHicjVRbb9MwFPYuhVFuG7wg8YBhmjS2pmq6aV0fKk0bmkCibNDdpKaaHOe0teZciJ2xLsrv4zfwI3jhgb1iJ2WjaSdmKcnJd77z+fhzYjvgTMhK5cfU9Mxs4d79uQfFh48eP3k6v/DsSPhRSOGQ+twPT2wigDMPDiWTHE6CEIhrczi2z3Z0/vgcQsF870AOAui4pOexLqNEKuh0nljnQGOrT2Q8SBLcwN3lFLlM3mLra0Sc9IYtCRcy/taHEJIRPOOqujS6SLLncYJXs8hOTucXK+VKOvB4YA6DRTQc+6cLM1eW49PIBU9SToRom5VAdmISSkY5JEUrEhAQekZ60FahR1wQnTj1IsFLCnFw1w/V5Umcov9WxMQVYuDaiukS2Rf5nAYn5dqR7G52YuYFkQSPZhN1I46lj7Wx2GEhUMkHKiA0ZKpXTPskJFQq+0dm0Y2JAOjoSvSMhpADDo0Prb2S7XPn5rUTRx6jvgNG2l/REiBdwjwt1S4uYbyt6Lt6wQ0cr+AWuEwLJKUixi12CbtAZBSC0GkFYRxrVL0Zm+X1Et4LVLuED7HNtOyGoyiGuZZnmWaOZq4ZZr1cH+PV8zxFMvKsWjVRpJT5kfWumz2AEw19ilxbuai7b/qeL5Rn4CgJ7rS0Q6qsEzeZp75ovB/6f+2R/cye/1mwUcJD6zKNpt6JA+YNSjiV1wkFtdSuBjK755a0kdk4rrJDAjlJ5RaHxwW+QC/iJLyDxrX94yKtyO7fRUFvTK06SeEdE0GOXasatzEnzqW392abmtlnzEHqY6Jh6V+ymx4VdT02rg+G8eCoWjbXyuuf1xe3toeHxhx6id6gZWSiGtpC79E+OkQUfUc/0W90Nfur8KLwqvA6o05PDWueo5FRWPkDwvWwxw==</latexit>

Page 75: Text classi cation - LiU

The standard logistic function for binary classification

÷��� ÷��� ��� ��� �����������������������

Оਚ � �� � FYQ÷ਛ<latexit sha1_base64="549xeEhtfoSeI9qr017T01X+qbc=">AAAFNXicjVTLbtNAFJ22AUp4tIUlmxFVUYE4itMqj0WkqkUVSIQW0pcUR9V4fJOMOn7IM66aWv4DvoYNC/gRFuwQW5ZsmbFDS51UdCRb1+eeOXPn3PHYAWdCVirfZmbnCrdu35m/W7x3/8HDhcWlRwfCj0IK+9TnfnhkEwGcebAvmeRwFIRAXJvDoX2ypfOHpxAK5nt7chRAzyUDj/UZJVJBx4vPrCGR8SjBLWz1Q0JjM4lN/BJbEs5kDGdBsmqcP0+OF5cr5Uo68GRgjoNlNB67x0tzvy3Hp5ELnqScCNE1K4HsxSSUjHJIilYkICD0hAygq0KPuCB6cbqhBK8oxMF9P1SPJ3GK/jsjJq4QI9dWTJfIocjnNDgt141kv9GLmRdEEjyaLdSPOJY+1u5gh4VAJR+pgNCQqVoxHRLli1QeXllFFyYCoFd3olc0hBxxaL3p7JRsnzuXn7048hj1HTDS+oqWAOkS5mmpbnEF401F39YbbuH4Be6Ay7RAUipi3GHnsA1ERiEInVYQxrFG1ZfRKK+X8E6gyiV8jDXSaZccRTHMtTzLNHM0c80wm+XmBK+Z5ymSkWfVq4kipcy3bHBR7B4caehd5NrKRV192/d8oTwDR0lwp6MdUtN6cZt56lji3dD/a48cZvb8z4JaCY+tyzTauhN7zBuVcCqvEwrqqK4GMnvntlTLbJxU2SKBnKZyjcOTAh9gEHES3kDjwv5JkU5kD2+ioBtTr05TeMVEkGPXq8Z1zKlr6fZetqmdHWMO0joF2rL0L9lPr4qmHrWLi2EyOKiWzbXy+vv15Y3N8aUxj56gp2gVmaiONtBrtIv2EUUf0Sf0BX0tfC58L/wo/MyoszPjOY/RlVH49QdSlJto</latexit>

Page 76: Text classi cation - LiU

The softmax function for k-ary classification

• The softmax function converts a vector of activations into a vector of probabilities (a finite probability distribution).

• The softmax function is the natural generalisation of the standard logistic function to 𝑘-ary classification problems.

TPęNBYʐʞ<ਊ> � FYQʐ<ਊ>ʞö FYQʐ<>ʞ<latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit>

Page 77: Text classi cation - LiU

Training a logistic regression classifier

• We present the model with training samples of the form (𝒙, 𝑦) where 𝒙 is a feature vector and 𝑦 is the gold-standard class.

• The output of the model is a vector of conditional probabilities 𝑃(𝑘 | 𝒙) where 𝑘 ranges over the possible classes.

• We want to train the model so as to maximise the likelihood of the training data under this probability distribution. Same training principle as for Naive Bayes – but no closed-form solution.

Page 78: Text classi cation - LiU

Gradient descent

• Step 0: Start with an arbitrary value for the parameters 𝜽.

• Step 1: Compute the gradient of the loss function, ∇𝐿(𝜽).

• Step 2: Update the parameters 𝜽 as follows: 𝜽 ≔ 𝜽 − 𝜂 ∇𝐿(𝜽) The hyperparameter 𝜂 is the learning rate.

• Repeat step 1–2 until the error is sufficiently low.

‘Follow the gradient into the valley of error.’

Page 79: Text classi cation - LiU

���� ���� ���� ���� ����Оਚ���

MPTT

���� ���� ���� ���� ����Оਚ���

MPTT

Cross-entropy loss function

𝑦 = 1 𝑦 = 0

৴ଔ � ÷ʐਚ MPH Оਚ � � ÷ ਚ MPH� ÷ Оਚʞ<latexit sha1_base64="CE9DLhcmq8FduK1f26l6kxpQ2OU=">AAAFWHicjVRbT9swFDaXDuhuwB73Yg0htVtTNQVR+lAJwYSGRAdbuUlNVTnOaWvhXBQ7SFnU37Rfs4ftYfsZe52dZDDaomEp0cl3Pn8+/o5jO+BMyFrt+9z8wmLhydLySvHps+cvXq6urV8IPwopnFOf++GVTQRw5sG5ZJLDVRACcW0Ol/b1gc5f3kAomO+dyTiAnkuGHhswSqSC+qtHxyXrBmhiyRFIMi7jFjawZbMhL+EYW9wfYmtEZBKP8TtcMo24nIEqzPFySg/L/dWNWrWWDjwdmHmwgfJx2l9b+G05Po1c8CTlRIiuWQtkLyGhZJTDuGhFAgJCr8kQuir0iAuil6R7HuNNhTh44Ifq8SRO0X9nJMQVInZtxXSJHInJnAZn5bqRHOz2EuYFkQSPZgsNIo6lj7WB2GEhUMljFRAaMlUrpiMSEiqVzfdW0YWJAOj9negVDSFjDq2jzknF9rlz99lLIo9R3wEjra9oCZAuYZ6W6hY3Md5X9EO94RZO3uIOuEwLjCtFjDvsCxwCkVEIQqcVhHGiUd3U3ep2BZ8EqlzCc2w3nXbHURTD3JpkmeYEzdwyzGa1OcVrTvIUyZhkNepjRUqZx2x4W+wZXGnoY+TaykVdfdv3fKE8A0dJcKejHVLTekmbeerk4tPQ/2uPHGX2/M+CnQrOrcs02roTZ8yLKziV1wkFdVRXA5m9J7a0k9k4rXJAAjlL5QGHpwU+wzDiJHyExq390yKdyB49RkE3plGfpfCeiWCC3agbDzFnrqXbe9emdnaMOUh9zbQs/UsO0quiqcfO7cUwHVzUq+ZWdfvT9sbefn5pLKPX6A0qIRM10B76gE7ROaLoK/qGfqJfiz8KqLBUWMmo83P5nFfo3iis/wEAo6Kx</latexit>

Page 80: Text classi cation - LiU

÷��� ÷��� ��� ��� ���ਛ���

MPTTԯMPHJTUJD

÷��� ��� ��� ���ਛ���

MPTTԯMPHJTUJD

Cross-entropy loss for logistic units

𝑦 = 1 𝑦 = 0

ਆ৴ଔਆਛ � Оਚ ÷ ਚ<latexit sha1_base64="Wa9zpBIQpRiesmO2FcpJkvxZTHc=">AAAFOHicjVRNb9MwGPa2AqN8bIMjF4tp0oaaqummdT1UmjY0gbSyQfclNVXlOG9ba86HYmdSFuU38Gu4cIDfwY0b4sqFK3ZSNpZ2YpYSvXnex49fP69jO+BMyFrt28zsXOne/QfzD8uPHj95urC49OxE+FFI4Zj63A/PbCKAMw+OJZMczoIQiGtzOLXPd3X+9AJCwXzvSMYB9Fwy9NiAUSIV1F9cswYhoYmzv2pdAE0sOQJJ0rU0cS5T3MLWiMgkTrGB4/7icq1aywaeDMxxsIzG47C/NPfbcnwaueBJyokQXbMWyF5CQskoh7RsRQICQs/JELoq9IgLopdke0rxikIcPPBD9XgSZ+i/MxLiChG7tmK6RI5EMafBabluJAdbvYR5QSTBo/lCg4hj6WNtEHZYCFTyWAWEhkzViumIKJOksvHGKrowEQC9uRO9oiFkzKH1tnNQsX3uXH/2kshj1HfAyOorWwKkS5inpbrlFYx3FH1Pb7iFk1e4Ay7TAmmljHGHXcIeEBmFIHRaQRgnGlVfxlZ1o4IPAlUu4WNsK5t2zVEUw1wvskyzQDPXDbNZbU7wmkWeIhlFVqOeKlLG3GfDq2KP4ExD7yLXVi7q6tu+5wvlGThKgjsd7ZCa1kvazFMnEx+G/l975Ci3538WbFbw2Lpco607ccS8uIIzeZ1QUEd1NZD5u7ClzdzGSZVdEshpKrc4PCnwAYYRJ+EdNK7snxTpRPboLgq6MY36NIXXTAQFdqNu3MacupZu73Wb2vkx5iD1/dGy9C85yK6Kph6bVxfDZHBSr5rr1Y33G8vbO+NLYx69QC/RKjJRA22jN+gQHSOKPqJP6Av6Wvpc+l76UfqZU2dnxnOeoxuj9OsPlOedJA==</latexit>

Page 81: Text classi cation - LiU

Cross-entropy loss function for the softmax function

For a logistic model with parameters 𝜽 = (𝑾, 𝒃), the cross-entropy loss for a sample (𝒙, 𝑦) is defined as

where we write [𝑘 = 𝑦] for the function that returns 1 if 𝑘 = 𝑦, and 0 otherwise (Iverson bracket).

৴ଔ � ÷ǵ < � ਚ> MPH TPęNBY੍ਲ � <><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 82: Text classi cation - LiU

Regularisation

• The term regularisation refers to all changes to a model that are intended to decrease generalisation error but not training error.

• A standard for logistic regression is L2-regularisation, where we penalise parameter vectors based on their lengths:

regularisation strength

ଔ Ն� ଔ ÷ ʐò৴ଔ � ૣјଔј�ʞ<latexit sha1_base64="RueUXWVanBvf8bR2DRDwJIHEzvs=">AAAFe3icjVRdT9swFE2Bbqz7gu1xL9YqJGBN1RRG6aRKCCa0SXSwlY9KTVU5zm1r4XzIdpC60D+5t/2S7WXS7KSDkhYNS0ntc889vj43tRMyKmSl8jO3sLiUf/R4+Unh6bPnL16urL46F0HECZyRgAW87WABjPpwJqlk0A45YM9hcOFcHuj4xRVwQQP/VI5C6Hp44NM+JVgqqLfC7CsgsS2HIPEY2R6WQw4s/tBQi+mIiWz1i2yHDtg6sn3sMIyO1qc5G+gdsvkwQPb1NGxf96pJHt/orRQr5Uoy0OzEmkyKxmSc9FYXf9luQCIPfEkYFqJjVULZjTGXlDAYF+xIQIjJJR5AR0197IHoxoktY7SmEBf1A64eX6IEnc6IsSfEyHMUU59bZGManBfrRLK/242pH0YSfJJu1I8YkgHSHiOXciCSjdQEE05VrYgMMcdEqk7c2UUXJkIgd0+idzSFHDFofG4dl5yAubfLbhz5lAQumEl9BVuA9DD1tVSnsIbQvqIf6gM3ULyJWuBRLTAuFRBq0e9wCFhGHIQOKwihWKNqZe6Wt0voOFTlYjbBdpO0W46imNZWlmVZGZq1ZVr1cn2GV8/yFMnMsmrVsSIlzCM6uCn2FNoa+hJ5jnJRV98M/EAoz8BVEsxtaYdUWjduUl993OiEB//skcPUnv9ZsFNCE+tSjabuxCn1RyWUyOuAglqqq6FM35kj7aQ2zqoc4FDOU7nH4VmBbzCIGOYP0Lixf1akFTnDhyjoxtSq8xQ+UhFm2LWqeR9z7l66vbdtaqafMQOpr42Grf+S/eSqqOuxc3MxzE7Oq2Vrq7z9dbu4tz+5NJaNN8ZbY92wjJqxZ3wyTowzgxg/jN+5XG5h6U++mN/Ml1LqQm6S89q4M/Lv/wKs0bHO</latexit>

Page 83: Text classi cation - LiU

Logistic Regression and Naive Bayes

• Learning a Naive Bayes classifier involves fitting a probability model that optimizes the joint likelihood 𝑃(class, document). 𝑃(class | document) is obtained via Bayes’ rule.

• Logistic regression fits the same probability model to directly optimize the conditional 𝑃(class | document). lower aymptotic error

• Naive Bayes and logistic regression form a generative–discriminative pair.

Page 84: Text classi cation - LiU

Support Vector Machines

Page 85: Text classi cation - LiU

Linear model

input vector (1-by-F)

weight vector (F-by-1)

predicted output

F = number of features (independent variables, covariates)

Оਚ � ੍ੌ � <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

bias

Page 86: Text classi cation - LiU

Linearly separable problems

linearly separable not linearly separable

Page 87: Text classi cation - LiU

Linearly separable problems

• A classification problem is called linearly separable, if there exists a linear model that can learn the problem with zero error.

• Any line in feature space that separates the classes from each other is called a decision boundary.

For simplicity, we restrict ourselves to binary classification problems here.

Page 88: Text classi cation - LiU

Different decision boundaries

margin

Page 89: Text classi cation - LiU

Different decision boundaries

We would like to choose a decision boundary that maximises the margin to the closest training samples.

Page 90: Text classi cation - LiU

Computing the margin

• Mathematically, the decision boundary is a plane that is defined by the parameters (weights and bias) of the linear classifier. More specifically, this plane is the set of all points 𝒙 such that 𝒙𝒘 + 𝑏 = 0.

• The margin between a data point 𝒙𝑖 and the decision boundary is the distance between 𝒙𝑖 and the plane along the direction of 𝒘.

• Subject to some convenient normalisations, for the closest data points this yields a margin of 1/∥𝒘∥.

Page 91: Text classi cation - LiU

Support vectors

support vector

support vector

support vector

Page 92: Text classi cation - LiU

Maximising the margin

• If the training data is linearly separable, then the optimal margin can be found using quadratic programming. more recent approach: sub-gradient descent, coordinate descent

• To generalise this to training sets which are not linearly separable, one can use gradient-based optimisation.

Page 94: Text classi cation - LiU

Practical concerns

• More than two classes can be handled according to a one-versus-one or one-versus-all scheme.

• For general SVM, training time scales quadratically with the number of samples, which quickly becomes impractical.

• An alternative is to use a special-purpose implementation of a linear SVM, which can have significantly shorter training times.

Page 95: Text classi cation - LiU

This lecture

• Introduction to text classification

• Evaluation of text classifiers

• The Naive Bayes classifier

• The Logistic Regression classifier

• Support Vector Machines

Page 96: Text classi cation - LiU

Project idea

• There are many other text classification methods than Naive Bayes, Logistic regression, and Support Vector Machines. decision trees, many different types of neural networks, …

• Pick a method that you find interesting and compare its performance to that of other methods on a standard data set.

• In the report, make it clear why you chose this method, and that you have understood how it works.

Page 97: Text classi cation - LiU

Beyond the bag-of-words assumption

• The linear sequence of the words in a text carries meaning. The brown dog on the mat saw the striped cat through the window.

The brown cat saw the striped dog through the window on the mat.

• We can try to capture at least parts of this sequence by representing texts as 𝒏-grams, contiguous sequences of 𝑛 words. unigram (not), bigram (not bad)

• Recurrent neural networks have been designed to explicitly capture long-range dependencies between words.