text classi cation - liu
Post on 24-Feb-2022
8 Views
Preview:
TRANSCRIPT
This work is licensed under a Creative Commons Attribution 4.0 International License.
Text classification
Marco Kuhlmann Department of Computer and Information Science
732A92/TDDE16 Text Mining (2020)
Text classification
• Text classification is the task of categorising text documents into predefined classes.
• The term ‘document’ is applied to everything from tweets over press releases to complete books.
Topic classification
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
diamond baseball
forward soccer
team captain
UK China Elections Sports
Adapted from Manning et al. (2008)
Forensic linguistics
‘I realized the faxed copy I just received was an outline of the manifesto, using much of the same wording, definitely the same topics and themes. … I invented [the language analysis] for this case and really, forensics linguistics took off after that.’
James Fitzgerald, profiler
Sources: Wikipedia, Newsweek
Sentiment analysis
The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.
… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.
positive negative
The standard text classification pipeline
predictor
document collection
matrix of document vectors
vector of class labels
vectorizer Example: CountVectorizer
Example: MultinomialNB
The standard text classification pipeline
predictor
document collection
matrix of document vectors
vector of class labels
vectorizer CountVectorizer
MultinomialNB
Reminder: Documents as tf–idf vectors
Scandal in Bohemia
Final problem
Empty house
Norwood builder
Dancing men
Retired colourman
Adair 0,0000 0,0000 0,0692 0,0000 0,0000 0,0000
Adler 0,0531 0,0000 0,0000 0,0000 0,0000 0,0000
Lestrade 0,0000 0,0000 0,0291 0,1424 0,0000 0,0000
Moriarty 0,0000 0,0845 0,0528 0,0034 0,0000 0,0000
Documents as count vectors – the bag of words
the gorgeously elaborate continuation of the lord of the rings trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson’s expanded vision of j.r.r. tolkien’s middle-earth.
… is a sour little movie at its core an exploration of the emptiness that underlay the relentless gaiety of the 1920’s as if to stop would hasten the economic and global political turmoil that was to come
positive negative
Documents as count vectors – the bag of words
a adequately cannot co-writer/director column continuation describe elaborate expanded gorgeously huge is j.r.r. jackson lord middle-earth of of of of peter rings so that the the the tolkien trilogy vision words
… 1920’s a an and as at come core economic emptiness exploration gaiety global hasten if is its little movie of of political relentless sour stop that that the the the the to to turmoil underlay was would
positive negative
Documents as count vectors – the bag of words
Word Count
of 4
the 3
words 1
vision 1
trilogy 1
…
Word Count
the 4
to 2
that 2
of 2
would 1
…positive negative
The standard text classification pipeline
predictor
document collection
matrix of document vectors
vector of class labels
vectorizer CountVectorizer
MultinomialNB
Text classification as supervised machine learning
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
A B C
A
A
B
B
C
C
Text classification as supervised machine learning
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
A B C
A
A
B
B
C
C
training set
test set
learning
evaluation
Training and testing
• Training
When we train a classifier, we present it with a document 𝑥 and its gold-standard class 𝑦 and apply some learning algorithm. maximise likelihood/minimise cross-entropy of the training data
• Testing
When we evaluate a classifier, we present it with 𝑥 and compare the predicted class for this input with the gold-standard class 𝑦.
Two challenges in text classification
• Standard document representations such as the bag of words easily yield tens of thousands of features. computational challenge, data sparsity
• Many document collections are highly imbalanced with respect to the represented classes. frequency bias, problems for evaluation
This lecture
• Introduction to text classification
• Evaluation of text classifiers
• The Naive Bayes classifier
• The Logistic Regression classifier
• Support Vector Machines
Evaluation of text classifiers
Evaluation of text classifiers
• We require a test set consisting of a number of documents, each of which has been tagged with its correct class. typically part of a larger gold-standard data set
• To evaluate a classifier, we apply it to the test set and compare the predicted classes with the gold-standard classes.
• We do so in order to estimate how well the classifier will perform on new, previously unseen documents. assume that all samples are drawn from the same distribution
Evaluation of text classifiers
Windsor The Queen
Mao Communist
TV-ads campaign
A B C
A B A
gold-standard class
predicted class
Accuracy
The accuracy of a classifier is the proportion of documents for which the classifier predicts the gold-standard class:
number of correctly classified documentsnumber of all documents
accuracy =
Accuracy
Document Gold-standard class Predicted class
Chinese Beijing Chinese China China
Chinese Chinese Shanghai China China
Chinese Macao China China
Tokyo Japan Chinese Japan China
accuracy for this example: 3/4 = 75%
Accuracy and imbalanced data sets
50 %50 %
80 %
20 %
Is 80% accuracy good or bad?
The role of baselines
• Evaluation measures are no absolute measures of performance. Whether ‘80% accuracy’ is good or not depends on the task at hand.
• Instead, we should ask for a classifier’s performance relative to other classifiers, or other points of comparison. ‘Logistic Regression has a higher accuracy than Naive Bayes.’
• When other classifiers are not available, a simple baseline is to always predict the most frequent class in the training data.
Grading criteria
"TTFTTNFOU DSJUFSJB
8IFO HSBEJOH ZPVS SFQPSU * XJMM BTTFTT JU XJUI SFTQFDU UP UIF DSJUFSJB EFĕOFE CFMPX�'PS FBDI DSJUFSJPO * XJMM BTTJHO FJUIFS UIF TDPSF ƈ EPFT OPU NFFU FYQFDUBUJPOT PS BTDPSF CFUXFFO Ɖ NFFUT FYQFDUBUJPOT BOE Ƒ FYDFFET FYQFDUBUJPOT� 5P HFU B QBTTJOHHSBEF ZPVS TDPSF GPS FBDI DSJUFSJPO NVTU CF BU MFBTU Ɖ� :PVS HSBEF JT DBMDVMBUFE GSPNUXP DPNQPOFOU TDPSFT BT TQFDJĕFE JO UIF GPMMPXJOH UBCMF� 5PHFUIFS XJUI ZPVS HSBEF ZPV XJMM SFDFJWF B TIPSU XSJUUFO NPUJWBUJPO CBTFE PO UIF EFTDSJQUPST GPS FBDI DSJUFSJPO�
$PNQPOFOU � $PNQPOFOU � (SBEF ���"�� (SBEF 5%%&��
� � & �
� � % �
� � $ �
� � % �
� � $ �
� � # �
� � $ �
� � # �
� � " �
$PNQPOFOU �
ćJT DPNQPOFOU JT UIF TDPSF GPS B TJOHMF DSJUFSJPO�
4PVOEOFTT BOE DPSSFDUOFTT *T UIF UFDIOJDBM BQQSPBDI TPVOE BOE XFMM�DIPTFO "SFUIF DMBJNT NBEF JO UIF SFQPSU TVQQPSUFE CZ QSPQFS FYQFSJNFOUT BOE BSF UIF SFTVMUTPG UIFTF FYQFSJNFOUT DPSSFDUMZ JOUFSQSFUFE
� 5SPVCMFTPNF� ćFSF BSF TPNF JEFBT XPSUI TBMWBHJOH IFSF CVU UIF XPSL TIPVMESFBMMZ IBWF CFFO EPOF PS FWBMVBUFE EJČFSFOUMZ�
� 'BJSMZ SFBTPOBCMF XPSL� ćF BQQSPBDI JT OPU CBE UIF FWBMVBUJPO JT BQQSPQSJBUF BOE BU MFBTU UIF NBJO DMBJNT BSF QSPCBCMZ DPSSFDU�
� (FOFSBMMZ TPMJE XPSL BMUIPVHI UIFSF BSF TPNF BTQFDUT PG UIF BQQSPBDI PS UIFFWBMVBUJPO UIBU * BN OPU TVSF BCPVU�
� ćF BQQSPBDI JT GVMMZ BQU BOE BMM DMBJNT BSF DPOWJODJOHMZ TVQQPSUFE� ćF SFQPSUEJTDVTTFT UIF MJNJUBUJPOT PG UIF XPSL�
�
The role of baselines
• Evaluation measures are no absolute measures of performance. Whether ‘80% accuracy’ is good or not depends on the task at hand.
• Instead, we should ask for a classifier’s performance relative to other classifiers, or other points of comparison. ‘Logistic Regression has a higher accuracy than Naive Bayes.’
• When other classifiers are not available, a simple baseline is to always predict the most frequent class in the training data.
Confusion matrix
classifier ‘positive’
classifier ‘negative’
gold standard ‘positive’
true positives false negatives
gold standard ‘negative’
false positives true negatives
Accuracy
classifier ‘positive’
classifier ‘negative’
gold standard ‘positive’
true positives false negatives
gold standard ‘negative’
false positives true negatives
Precision and recall
• Precision and recall ‘zoom in’ on how good a system is at identifying documents of a specific class 𝑐.
• Precision is the proportion of correctly classified documents among all documents for which the system predicts class 𝑐. When the system predicts class 𝑐, how often is it correct?
• Recall is the proportion of correctly classified documents among all documents with gold-standard class 𝑐. When the document has class 𝑐, how often does the system predict it?
Precision with respect to the positive class
classifier ‘positive’
classifier ‘negative’
gold standard ‘positive’
true positives false negatives
gold standard ‘negative’
false positives true negatives
Recall with respect to the positive class
classifier ‘positive’
classifier ‘negative’
gold standard ‘positive’
true positives false negatives
gold standard ‘negative’
false positives true negatives
Precision and recall with respect to the positive class
# true positives# true positives + # false positives
precision =
# true positives# true positives + # false negatives
recall =
F1-measure
A good classifier should balance between precision and recall. The F1-measure is the harmonic mean of the two values:
2 · precision · recallprecision + recall
F1 =
F1-measure
��� ��� ��� ��� ��� ���QSFDJTJPO
������������������
SFDBMM
���������������������������
Accuracy with three classes
A B C
A 58 6 1
B 5 11 2
C 0 7 43
Precision with respect to class B
A B C
A 58 6 1
B 5 11 2
C 0 7 43
Recall with respect to class B
A B C
A 58 6 1
B 5 11 2
C 0 7 43
Classification report in scikit-learn
precision recall f1-score support
C 0.63 0.04 0.07 671 KD 0.70 0.02 0.03 821 L 0.92 0.02 0.04 560 M 0.36 0.68 0.47 1644 MP 0.36 0.25 0.29 809 S 0.46 0.84 0.59 2773 SD 0.57 0.12 0.20 1060 V 0.59 0.15 0.24 950
accuracy 0.43 9288 macro avg 0.57 0.26 0.24 9288 weighted avg 0.52 0.43 0.34 9288
This lecture
• Introduction to text classification
• Evaluation of text classifiers
• The Naive Bayes classifier
• The Logistic Regression classifier
• Support Vector machines
The Naive Bayes classifier
Naive Bayes
• The Naive Bayes classifier is a simple but surprisingly effective probabilistic text classifier that builds on Bayes’ rule.
• It is called ‘naive’ because it makes strong (unrealistic) independence assumptions about probabilities.
Naive Bayes decision rule, informally
elaborate gorgeously
elaborate gorgeously
elaborate gorgeously
negativepositive
?
𝑃(pos) 𝑃(elaborate |pos) 𝑃(gorgeously |pos)
score(pos) =
𝑃(neg) 𝑃(elaborate |neg) 𝑃(gorgeously |neg)
score(neg) =
70% 30%
elaborate gorgeously
?
Naive Bayes decision rule, informally
elaborate gorgeously
elaborate gorgeously
negative
positive
𝑃(pos) 𝑃(elaborate |pos) 𝑃(gorgeously |pos)
score(pos) =
𝑃(neg) 𝑃(elaborate |neg) 𝑃(gorgeously |neg)
score(neg) =
70% 30%
The role of Bayes’ rules
• For classification, we would like to know 𝑃(class |document). 𝑃(disease |symptom)
• But a Naive Bayes classifier contains 𝑃(document |class). 𝑃(symptom |disease)
• The classifier uses Bayes’ rule to convert between the two. 𝑃(class |document) ∝ 𝑃(class) 𝑃(document |class)
Formal definition of a Naive Bayes classifier
𝐶 a set of possible classes
𝑉 a set of possible words; the model’s vocabulary
𝑃(𝑐) probabilities that specify how likely it is for a document to belong to class 𝑐 (one probability for each class)
𝑃(𝑤 |𝑐) probabilities that specify how likely it is for a document to contain the word 𝑤, given that the document belongs to class 𝑐 (one probability for each class–word pair)
Naive Bayes decision rule
choose that class 𝑐 which maximises the term to the right of the ‘arg max’
predicted class for the document
count of the word 𝑤 in the document
Оਅ � BSHNBYਅó৫ ৸ਅ ď Ƕਘó৾৸ਘ ] ਅ�ਘ
What does the Naive Bayes classifier do if all words in the document to be classified
are out-0f-vocabulary?
Log probabilities
• In order to avoid underflow, implementations use the logarithms of probabilities instead of the probabilities themselves. 𝑃(𝑤 |𝑐) becomes log 𝑃(𝑤 |𝑐)
• Note that in this case, instead of multiplying probabilities, we have to add their logarithms.
Log probabilities
��� ��� ��� ��� ��� ���÷�÷�÷�÷��MPH
Learning a Naive Bayes classifier
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
A B C
A
A
B
B
C
C
Learning a Naive Bayes classifier
congestion London
Parliament Big Ben
Windsor The Queen
Olympics Beijing
tourism Great Wall
Mao Communist
recount votes
seat run-off
TV-ads campaign
A B C
A
A
B
B
C
C
training set
test set
learning
evaluation
Learning a Naive Bayes classifier
congestion London
Parliament Big Ben
Olympics Beijing
tourism Great Wall
recount votes
seat run-off
A B C
A B C
𝑃(𝑐) 𝑃(𝑤 |𝑐)class probabilities word probabilities
Maximum Likelihood Estimation (MLE)
• One standard technique for estimating the probabilities of a model is Maximum Likelihood Estimation (MLE). Find probabilities that maximise the probability of the training data.
• Depending on the type of model under consideration, MLE can be computationally challenging.
• For the special case of the Naive Bayes classifier, MLE amounts to equating probabilities with relative frequencies.
MLE for the Naive Bayes classifier
• To estimate the class probabilities 𝑃(𝑐):
Compute the percentage of documents with class 𝑐 among all documents in the training set.
• To estimate the word probabilities 𝑃(𝑤 |𝑐):
Compute the percentage of occurrences of the word 𝑤 among all word occurrences in documents with class 𝑐.
MLE for the Naive Bayes classifier
#(𝑐) number of documents with gold-standard class 𝑐
#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐
৸ਅ � �ਅöਙó৫ �ਙ ৸ਘ ] ਅ � �ਘ ਅöਙó৾ �ਙ ਅ
MLE of word probabilities
31 tokens 37 tokens
The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.
… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.
positive negative
MLE of word probabilities
Word Count
of 4
The 2
words 1
vision 1
trilogy 1
… …
Word Count
the 4
to 2
that 2
of 2
would 1
… …positive negative
MLE of word probabilities
Probability Estimated value
𝑃(of |pos) 4/31
𝑃(The |pos) 2/31
𝑃(words |pos) 1/31
𝑃(vision |pos) 1/31
𝑃(trilogy |pos) 1/31
… …
Probability Estimated value
𝑃(the |neg) 4/37
𝑃(to |neg) 2/37
𝑃(that |neg) 2/37
𝑃(of |neg) 2/37
𝑃(would |neg) 1/37
… …positive negative
Smoothing
• If we equate word probabilities with relative frequencies, some probabilities may be zero. For example, ‘gorgeously’ may only occur in positive documents.
• This is a problem because we multiply probabilities in the decision rule for Naive Bayes. Slogan: Zero probabilities destroy information.
• To avoid zero probabilities, we can use techniques for smoothing the probability distribution.
Additive smoothing
• In additive smoothing, we add a constant value 𝛼 ≥ 0 to all counts before computing relative frequencies. For 𝛼 = 1, this technique is also known as Laplace smoothing.
• Effectively, we hallucinate 𝛼 extra occurrences of each word, including words that never occurred.
• The smoothing constant 𝛼 is a hyperparameter of the model that needs to be tuned on validation data.
Hyperparameters and cross-validation
• A hyperparameter is a parameter of a machine learning model that is not set by the learning algorithm itself.
• To tune hyperparameters, we need a separate validation set, or need to use cross-validation:
• Split the data into 𝑘 parts.
• In each fold, use 𝑘 − 1 parts for training, 1 part for validation.
• Take the mean performance over the 𝑘 folds.
Five-fold cross validation
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
80% training, 20% validation
Additive smoothing
• In additive smoothing, we add a constant value 𝛼 ≥ 0 to all counts before computing relative frequencies. For 𝛼 = 1, this technique is also known as Laplace smoothing.
• Effectively, we hallucinate 𝛼 extra occurrences of each word, including words that never occurred.
• The smoothing constant 𝛼 is a hyperparameter of the model that needs to be tuned on validation data.
৸ਘ ] ਅ � �ਘ ਅ � öਙó৾<�ਙ ਅ � ><latexit sha1_base64="JOxlUkKj5gvzRD2EaG79HryUar8=">AAAFVHicjVRdb9MwFPU+CqPAPuCRF4tq0mBp1XTTuj5UmjaYQKKs0H1JTVQ5jttac5wodraVkF/Er+EFCfghvPCAnZSNpZ2YpUTX5x4fX5+b2AkYFbJa/T4zOzdfuHd/4UHx4aPHi0vLK0+OhR+FmBxhn/nhqYMEYZSTI0klI6dBSJDnMHLinO3p/Mk5CQX1+aEcBcT20IDTPsVIKqi3/Lq9dmEZny0Dv4BNaPVDhGOrtHah5usWYsEQJbElIq8XX0KLcnicwK7KX2r+OswYdtJbLlUr1XTAycAcByUwHu3eytwvy/Vx5BEuMUNCdM1qIO0YhZJiRpKiFQkSIHyGBqSrQo48Iuw4PW8CVxXiwr4fqodLmKL/roiRJ8TIcxTTQ3Io8jkNTst1I9nftmPKg0gSjrON+hGD0ofaPOjSkGDJRipAOKSqVoiHSHkmlcU3dtGFiYDgmyfRO5aFHDHSfNs5MByfuddTO444xb5Lyml9RUsQ6SHKtVS3uArhrqLv6wM3YfwSdohHtUBiFCHs0E9knyAZhUTotIIgjDWqZuXtyqYBDwJVLmJjbDtdds1RlLK5kWeZZo5mbpTNRqUxwWvkeYpUzrPqtUSRUuY7Orgq9pCcauh95DnKRV19y+e+UJ4RV0kwt6MdUsvsuEW5+mphO/T/2iOHmT3/s2DLgGPrMo2W7sQh5SMDpvI6oaCO6mogs3fuSFuZjZMqeyiQ01RucXhS4CMZRAyFd9C4sn9SpBM5w7so6MbUa9MUXlER5Nj1Wvk25tS9dHuv29TKPmNGpHVOcNPSv2Q/vSoaemxdXQyTwXGtYm5UNj9slnZ2x5fGAngGnoM1YII62AFvQBscAQy+gK/gB/g5/23+d2GuUMioszPjNU/BjVFY/APPmKKb</latexit>
#(𝑐) number of documents with gold-standard class 𝑐
#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐
MLE with additive smoothing
no smoothing here!
৸ਅ � �ਅöਙó৫ �ਙ
#(𝑐) number of documents with gold-standard class 𝑐
#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐
MLE with additive smoothing
৸ਅ � �ਅöਙó৫ �ਙ ৸ਘ ] ਅ � �ਘ ਅ � ʺöਙó৾ �ਙ ਅˁ � ৾<latexit sha1_base64="CjRgHhyU22p8bV3xgZls7ozdoVg=">AAAFYnicjVRbb9MwFPYuhVFuHXuEB4tq0mBJ1XTTuj5UmjY0gURZobtJTVQ5jttac5wodraV0B/GT+GJFx7gF/CKnZSNpZ2YpUTH53znO8ffSeyGjApZrX6bm19YLNy7v/Sg+PDR4ydPS8vPjkUQR5gc4YAF0amLBGGUkyNJJSOnYUSQ7zJy4p7t6fjJOYkEDfihHIXE8dGA0z7FSCpXr9Rpr13YxhfbwK9gE9r9COHELq9dqP26jVg4ROPEdumAdaEtYr+XXEKbcng8hgp1qbN0NHLgOszgKtQrlauVarrgtGFNjDKYrHZveeG37QU49gmXmCEhulY1lE6CIkkxI+OiHQsSInyGBqSrTI58IpwkPf0YriqPB/tBpB4uYer9NyNBvhAj31VIH8mhyMe0c1asG8v+tpNQHsaScJwV6scMygBqKaFHI4IlGykD4YiqXiEeIqWgVILfqKIbEyHBN0+iK5pCjhhpvuscGG7AvOutk8Sc4sAjZtpf0RZE+ohyTdUtrkK4q+D7+sBNmLyGHeJTTTA2ihB26GeyT5CMIyJ0WLkgTLRX7cztyqYBD0LVLmIT33aado1RENPayKMsKwezNkyrUWlM4Rp5nAKZeVS9NlagFPmeDq6aPSSn2vUh9l2lou6+FfBAKM2IpyiY19EKqTQnaVGuvmHYjoK/8shhJs//JNgy4ES6jKOlJ3FI+ciAKb0OKFdHTTWU2Tt3pK1MxmmWPRTKWSy3KDxN8IkMYoaiO3BcyT9N0ond4V0Y9GDqtVkMb6gIc+h6zbwNObOWHu/1mFrZZ8yItM8Jbtr6l+ynV0VDr62ri2HaOK5VrI3K5sfN8s7u5NJYAs/BS7AGLFAHO+AtaIMjgMFX8B38BL8WfxSKheXCSgadn5vkrIAbq/DiDySBptg=</latexit>
Estimating word probabilities
31 tokens 37 tokens
The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.
… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.
positive negative
Vocabulary
1920’s J.R.R. Jackson’s Lord Middle-earth Peter Rings The Tolkien’s a adequately an and as at cannot co-writer/director column come continuation core describe economic elaborate emptiness expanded exploration gaiety global gorgeously hasten huge if is its little movie of political relentless so sour stop that the to trilogy turmoil underlay vision was words would
53 unique words
MLE with add-one smoothing
Word Modified count
of 4 + 1
The 2 + 1
words 1 + 1
vision 1 + 1
trilogy 1 + 1
… …
Word Modified count
of 2 + 1
The 0 + 1
words 0 + 1
vision 0 + 1
trilogy 0 + 1
… …positive negative
MLE with add-one smoothing
Probability Estimated value
𝑃(of |pos) (4 + 1)/(31 + 53)
𝑃(The |pos) (2 + 1)/(31 + 53)
𝑃(words |pos) (1 + 1)/(31 + 53)
𝑃(vision |pos) (1 + 1)/(31 + 53)
𝑃(trilogy |pos) (1 + 1)/(31 + 53)
… …
Probability Estimated value
𝑃(of |neg) (2 + 1)/(37 + 53)
𝑃(The |neg) (0 + 1)/(37 + 53)
𝑃(words |neg) (0 + 1)/(37 + 53)
𝑃(vision |neg) (0 + 1)/(37 + 53)
𝑃(trilogy |neg) (0 + 1)/(37 + 53)
… …positive negative
Additive smoothing is assuming a uniform prior
Let 𝑁𝑐 be the total number of word occurrences in class 𝑐. Then
can be written as a linear interpolation of the maximum-likelihood estimate and the uniform distribution over word types:
where
৸ਘ ] ਅ � �ਘ ਅ � ৶ਅ � ৾<latexit sha1_base64="R9dEYwk+CSVSLJizl8pecD8Edls=">AAAFQXicjVTLbtNAFJ22AUp4pbBkMyKqVKgTxWnVNItIVYsqkEgbSF9SHEXj8U0y6vghz7hVMPkPvoYNC/gFPoEdYsOCDTN2aKmTio5k6865Z87cOdceO+BMyErl29z8Qu7W7TuLd/P37j94+Kiw9PhI+FFI4ZD63A9PbCKAMw8OJZMcToIQiGtzOLZPd3T++AxCwXzvQI4C6Lpk4LE+o0QqqFeotlbOLeODZdDnuIGtfkhobBVXztV81SI8GJJxvNejeBWnM3w07hWKlXIlGXg6MCdBEU1Gq7e08MtyfBq54EnKiRAdsxLIbkxCySiHcd6KBASEnpIBdFToERdEN04ON8bLCnFw3w/V40mcoP+uiIkrxMi1FdMlciiyOQ3OynUi2d/sxswLIgkeTTfqRxxLH2unsMNCoJKPVEBoyFStmA6JMkgqP6/sogsTAdCrJ9E7loQccWi8bu8bts+dy2k3jjxGfQdKSX15S4B0CfO0VCe/jPG2ou/qAzdw/AK3wWVaYGzkMW6z97ALREYhCJ1WEMaxRtWstFleN/B+oMolfIJtJssuOYpSMteyLNPM0My1klkv16d49SxPkUpZVq06VqSE+YYNLoo9gBMN7UWurVzU1Td9zxfKM3CUBHfa2iG1rBs3mac+UdwK/b/2yGFqz/8s2DDwxLpUo6k7ccC8kYETeZ1QUFt1NZDpO3OkjdTGaZUdEshZKtc4PC3wDgYRJ+ENNC7snxZpR/bwJgq6MbXqLIWXTAQZdq1auo45cy/d3ss2NdPPmIO0zoA2LP1L9pOroq7HxsXFMB0cVcvmWnn97Xpxa3tyaSyip+gZWkEmqqEt9Aq10CGi6CP6hL6gr7nPue+5H7mfKXV+brLmCboycr//AIhxnmg=</latexit>
৸ਘ ] ਅ � �ਘ ਅ৶ਅ � � ÷ �৾<latexit sha1_base64="p3lV1p5lnnNdlt5LoL4fmxbrShk=">AAAFUXicjVRdb9MwFPU+CqMDtsEjLxbVpA6aqummdX2oNG1oAoluhe5LaqrKcW5ba86HYmdTCfk//BpeeAB+CeINOw0bazsxS61uzj33+PrcxHbAmZCVys+5+YXF3IOHS4/yy4+fPF1ZXXt2KvwopHBCfe6H5zYRwJkHJ5JJDudBCMS1OZzZF/s6f3YJoWC+dyxHAXRdMvBYn1EiFdRb3WsVr6zSZ6tEN3ADW1xVOgRb/ZDQ2CoUrxSexIc9muDXuGgaGWEjY5hJfJr0VguVciVdeDows6CAstXqrS38shyfRi54knIiRMesBLIbk1AyyiHJW5GAgNALMoCOCj3igujG6WETvK4QB/f9UP08iVP034qYuEKMXFsxXSKHYjKnwVm5TiT7O92YeUEkwaPjjfoRx9LH2jnssBCo5CMVEBoy1SumQ6JskMrfW7voxkQA9PZJ9I6GkCMOjXfto5Ltc+fmsRtHHqO+A0baX94SIF3CPC3Vya9jvKfoB/rADRy/wm1wmRZISnmM2+wTHACRUQhCpxWEcaxR9WTslLdK+ChQ7RKeYTtp2Q1HUQxzc5JlmhM0c9Mw6+X6FK8+yVMkY5JVqyaKlDLfs8F1s8dwrqHDyLWVi7r7pu/5QnkGjpLgTls7pMq6cZN56pXFrdD/a48cju35nwXbJZxZN9Zo6kkcM29Uwqm8TiioraYayPH/xJG2xzZOq+yTQM5SucPhaYGPMIg4Ce+hcW3/tEg7sof3UdCDqVVnKbxhIphg16rGXcyZe+nx3oypOX6NOUjrEmjD0p9kP70q6nptX18M08FptWxulrc+bBV297JLYwm9QC9REZmohnbRW9RCJ4iiL+gr+o5+LH5b/J1DufkxdX4uq3mObq3c8h9gqqF+</latexit>
� ৶ਅ৶ਅ � ৾<latexit sha1_base64="Ee9pFSkvjwL8laKaOLbVZ0SITto=">AAAFM3icjVRdb9MwFPW2AqN8rINHXizGJARN1XTTuj5UmjY0gUS3QfclNVXlOLetNceJYmdSifoH+DW88AD/BPGGeOWZV+ykbCztxCwlujn3+Pj6XMduyJlU1eq3ufmFwq3bdxbvFu/df/BwqbT86FgGcUThiAY8iE5dIoEzAUeKKQ6nYQTEdzmcuGc7Jn9yDpFkgThUoxC6PhkI1meUKA31Ss8crskewU3s9CNCk70eHZsXfokdwsMhwcfjXmmlWqmmA08H9iRYQZNx0Fte+O14AY19EIpyImXHroaqm5BIMcphXHRiCSGhZ2QAHR0K4oPsJul2xnhVIx7uB5F+hMIp+u+MhPhSjnxXM32ihjKfM+CsXCdW/c1uwkQYKxA0W6gfc6wCbLzBHouAKj7SAaER07ViOiTaFaUdvLKKKUyGQK/uxKxoSTXi0HzT3i+7AfcuP7tJLBgNPLDS+oqOBOUTJoxUp7iK8bam75oNN3HyArfBZ0ZgXC5i3GYfYBeIiiOQJq0hjBOD6i9rs7JexvuhLpfwCbaZTrvkaIplr+VZtp2j2WuW3ag0pniNPE+TrDyrXhtrUsp8ywYXxR7CqYH2Yt/VLprqW4EIpPYMPC3BvbZxSE/rJi0m9KHEB1Hw1x41zOz5nwUbZTyxLtNomU4cMjEq41TeJDTU1l0NVfbObWkjs3FaZYeEapbKNQ5PC7yHQcxJdAONC/unRdqxO7yJgmlMvTZL4RWTYY5dr1nXMWeuZdp72aZWdow5KOccaNMxv2Q/vSoaZmxcXAzTwXGtYq9V1t+tr2xtTy6NRfQEPUXPkY3qaAu9RgfoCFH0EX1CX9DXwufC98KPws+MOj83mfMYXRmFX38A1YiaPw==</latexit>
This lecture
• Introduction to text classification
• Evaluation of text classifiers
• The Naive Bayes classifier
• The Logistic Regression classifier
• Support Vector machines
The Logistic Regression classifier
The multi-class linear model
input vector (1-by-F)
weight matrix (F-by-K)
output vector (1-by-K)
bias vector (1-by-K)
F = number of features (input variables), K = number of classes (output variables)
О � ੍ਲ � <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
The logistic model
• The logistic model extends the linear model by adding a pointwise logistic function 𝑓:
• The term logistic regression refers to the procedure of learning the parameters of the logistic model. That model is then used for classification. Confusing terminology!
logitО � ਈ XIFSF � ੍ਲ � <latexit sha1_base64="kqlvgRIJZ5DumuuM3LN7IGHG36c=">AAAFcHicjVRbb9MwFPYuhVFuG7wg8YBhmjS2pmq6aV0fKk0bmkCibNDdpKaaHOe0teZciJ2xLsrv4zfwI3jhgb1iJ2WjaSdmKcnJd77z+fhzYjvgTMhK5cfU9Mxs4d79uQfFh48eP3k6v/DsSPhRSOGQ+twPT2wigDMPDiWTHE6CEIhrczi2z3Z0/vgcQsF870AOAui4pOexLqNEKuh0nljnQGOrT2Q8SBLcwN3lFLlM3mLra0Sc9IYtCRcy/taHEJIRPOOqujS6SLLncYJXs8hOTucXK+VKOvB4YA6DRTQc+6cLM1eW49PIBU9SToRom5VAdmISSkY5JEUrEhAQekZ60FahR1wQnTj1IsFLCnFw1w/V5Umcov9WxMQVYuDaiukS2Rf5nAYn5dqR7G52YuYFkQSPZhN1I46lj7Wx2GEhUMkHKiA0ZKpXTPskJFQq+0dm0Y2JAOjoSvSMhpADDo0Prb2S7XPn5rUTRx6jvgNG2l/REiBdwjwt1S4uYbyt6Lt6wQ0cr+AWuEwLJKUixi12CbtAZBSC0GkFYRxrVL0Zm+X1Et4LVLuED7HNtOyGoyiGuZZnmWaOZq4ZZr1cH+PV8zxFMvKsWjVRpJT5kfWumz2AEw19ilxbuai7b/qeL5Rn4CgJ7rS0Q6qsEzeZp75ovB/6f+2R/cye/1mwUcJD6zKNpt6JA+YNSjiV1wkFtdSuBjK755a0kdk4rrJDAjlJ5RaHxwW+QC/iJLyDxrX94yKtyO7fRUFvTK06SeEdE0GOXasatzEnzqW392abmtlnzEHqY6Jh6V+ymx4VdT02rg+G8eCoWjbXyuuf1xe3toeHxhx6id6gZWSiGtpC79E+OkQUfUc/0W90Nfur8KLwqvA6o05PDWueo5FRWPkDwvWwxw==</latexit>
The standard logistic function for binary classification
÷��� ÷��� ��� ��� �����������������������
Оਚ � �� � FYQ÷ਛ<latexit sha1_base64="549xeEhtfoSeI9qr017T01X+qbc=">AAAFNXicjVTLbtNAFJ22AUp4tIUlmxFVUYE4itMqj0WkqkUVSIQW0pcUR9V4fJOMOn7IM66aWv4DvoYNC/gRFuwQW5ZsmbFDS51UdCRb1+eeOXPn3PHYAWdCVirfZmbnCrdu35m/W7x3/8HDhcWlRwfCj0IK+9TnfnhkEwGcebAvmeRwFIRAXJvDoX2ypfOHpxAK5nt7chRAzyUDj/UZJVJBx4vPrCGR8SjBLWz1Q0JjM4lN/BJbEs5kDGdBsmqcP0+OF5cr5Uo68GRgjoNlNB67x0tzvy3Hp5ELnqScCNE1K4HsxSSUjHJIilYkICD0hAygq0KPuCB6cbqhBK8oxMF9P1SPJ3GK/jsjJq4QI9dWTJfIocjnNDgt141kv9GLmRdEEjyaLdSPOJY+1u5gh4VAJR+pgNCQqVoxHRLli1QeXllFFyYCoFd3olc0hBxxaL3p7JRsnzuXn7048hj1HTDS+oqWAOkS5mmpbnEF401F39YbbuH4Be6Ay7RAUipi3GHnsA1ERiEInVYQxrFG1ZfRKK+X8E6gyiV8jDXSaZccRTHMtTzLNHM0c80wm+XmBK+Z5ymSkWfVq4kipcy3bHBR7B4caehd5NrKRV192/d8oTwDR0lwp6MdUtN6cZt56lji3dD/a48cZvb8z4JaCY+tyzTauhN7zBuVcCqvEwrqqK4GMnvntlTLbJxU2SKBnKZyjcOTAh9gEHES3kDjwv5JkU5kD2+ioBtTr05TeMVEkGPXq8Z1zKlr6fZetqmdHWMO0joF2rL0L9lPr4qmHrWLi2EyOKiWzbXy+vv15Y3N8aUxj56gp2gVmaiONtBrtIv2EUUf0Sf0BX0tfC58L/wo/MyoszPjOY/RlVH49QdSlJto</latexit>
The softmax function for k-ary classification
• The softmax function converts a vector of activations into a vector of probabilities (a finite probability distribution).
• The softmax function is the natural generalisation of the standard logistic function to 𝑘-ary classification problems.
TPęNBYʐʞ<ਊ> � FYQʐ<ਊ>ʞö FYQʐ<>ʞ<latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit>
Training a logistic regression classifier
• We present the model with training samples of the form (𝒙, 𝑦) where 𝒙 is a feature vector and 𝑦 is the gold-standard class.
• The output of the model is a vector of conditional probabilities 𝑃(𝑘 | 𝒙) where 𝑘 ranges over the possible classes.
• We want to train the model so as to maximise the likelihood of the training data under this probability distribution. Same training principle as for Naive Bayes – but no closed-form solution.
Gradient descent
• Step 0: Start with an arbitrary value for the parameters 𝜽.
• Step 1: Compute the gradient of the loss function, ∇𝐿(𝜽).
• Step 2: Update the parameters 𝜽 as follows: 𝜽 ≔ 𝜽 − 𝜂 ∇𝐿(𝜽) The hyperparameter 𝜂 is the learning rate.
• Repeat step 1–2 until the error is sufficiently low.
‘Follow the gradient into the valley of error.’
���� ���� ���� ���� ����Оਚ���
MPTT
���� ���� ���� ���� ����Оਚ���
MPTT
Cross-entropy loss function
𝑦 = 1 𝑦 = 0
৴ଔ � ÷ʐਚ MPH Оਚ � � ÷ ਚ MPH� ÷ Оਚʞ<latexit sha1_base64="CE9DLhcmq8FduK1f26l6kxpQ2OU=">AAAFWHicjVRbT9swFDaXDuhuwB73Yg0htVtTNQVR+lAJwYSGRAdbuUlNVTnOaWvhXBQ7SFnU37Rfs4ftYfsZe52dZDDaomEp0cl3Pn8+/o5jO+BMyFrt+9z8wmLhydLySvHps+cvXq6urV8IPwopnFOf++GVTQRw5sG5ZJLDVRACcW0Ol/b1gc5f3kAomO+dyTiAnkuGHhswSqSC+qtHxyXrBmhiyRFIMi7jFjawZbMhL+EYW9wfYmtEZBKP8TtcMo24nIEqzPFySg/L/dWNWrWWDjwdmHmwgfJx2l9b+G05Po1c8CTlRIiuWQtkLyGhZJTDuGhFAgJCr8kQuir0iAuil6R7HuNNhTh44Ifq8SRO0X9nJMQVInZtxXSJHInJnAZn5bqRHOz2EuYFkQSPZgsNIo6lj7WB2GEhUMljFRAaMlUrpiMSEiqVzfdW0YWJAOj9negVDSFjDq2jzknF9rlz99lLIo9R3wEjra9oCZAuYZ6W6hY3Md5X9EO94RZO3uIOuEwLjCtFjDvsCxwCkVEIQqcVhHGiUd3U3ep2BZ8EqlzCc2w3nXbHURTD3JpkmeYEzdwyzGa1OcVrTvIUyZhkNepjRUqZx2x4W+wZXGnoY+TaykVdfdv3fKE8A0dJcKejHVLTekmbeerk4tPQ/2uPHGX2/M+CnQrOrcs02roTZ8yLKziV1wkFdVRXA5m9J7a0k9k4rXJAAjlL5QGHpwU+wzDiJHyExq390yKdyB49RkE3plGfpfCeiWCC3agbDzFnrqXbe9emdnaMOUh9zbQs/UsO0quiqcfO7cUwHVzUq+ZWdfvT9sbefn5pLKPX6A0qIRM10B76gE7ROaLoK/qGfqJfiz8KqLBUWMmo83P5nFfo3iis/wEAo6Kx</latexit>
÷��� ÷��� ��� ��� ���ਛ���
MPTTԯMPHJTUJD
÷��� ��� ��� ���ਛ���
MPTTԯMPHJTUJD
Cross-entropy loss for logistic units
𝑦 = 1 𝑦 = 0
ਆ৴ଔਆਛ � Оਚ ÷ ਚ<latexit sha1_base64="Wa9zpBIQpRiesmO2FcpJkvxZTHc=">AAAFOHicjVRNb9MwGPa2AqN8bIMjF4tp0oaaqummdT1UmjY0gbSyQfclNVXlOG9ba86HYmdSFuU38Gu4cIDfwY0b4sqFK3ZSNpZ2YpYSvXnex49fP69jO+BMyFrt28zsXOne/QfzD8uPHj95urC49OxE+FFI4Zj63A/PbCKAMw+OJZMczoIQiGtzOLXPd3X+9AJCwXzvSMYB9Fwy9NiAUSIV1F9cswYhoYmzv2pdAE0sOQJJ0rU0cS5T3MLWiMgkTrGB4/7icq1aywaeDMxxsIzG47C/NPfbcnwaueBJyokQXbMWyF5CQskoh7RsRQICQs/JELoq9IgLopdke0rxikIcPPBD9XgSZ+i/MxLiChG7tmK6RI5EMafBabluJAdbvYR5QSTBo/lCg4hj6WNtEHZYCFTyWAWEhkzViumIKJOksvHGKrowEQC9uRO9oiFkzKH1tnNQsX3uXH/2kshj1HfAyOorWwKkS5inpbrlFYx3FH1Pb7iFk1e4Ay7TAmmljHGHXcIeEBmFIHRaQRgnGlVfxlZ1o4IPAlUu4WNsK5t2zVEUw1wvskyzQDPXDbNZbU7wmkWeIhlFVqOeKlLG3GfDq2KP4ExD7yLXVi7q6tu+5wvlGThKgjsd7ZCa1kvazFMnEx+G/l975Ci3538WbFbw2Lpco607ccS8uIIzeZ1QUEd1NZD5u7ClzdzGSZVdEshpKrc4PCnwAYYRJ+EdNK7snxTpRPboLgq6MY36NIXXTAQFdqNu3MacupZu73Wb2vkx5iD1/dGy9C85yK6Kph6bVxfDZHBSr5rr1Y33G8vbO+NLYx69QC/RKjJRA22jN+gQHSOKPqJP6Av6Wvpc+l76UfqZU2dnxnOeoxuj9OsPlOedJA==</latexit>
Cross-entropy loss function for the softmax function
For a logistic model with parameters 𝜽 = (𝑾, 𝒃), the cross-entropy loss for a sample (𝒙, 𝑦) is defined as
where we write [𝑘 = 𝑦] for the function that returns 1 if 𝑘 = 𝑦, and 0 otherwise (Iverson bracket).
৴ଔ � ÷ǵ < � ਚ> MPH TPęNBY੍ਲ � <><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Regularisation
• The term regularisation refers to all changes to a model that are intended to decrease generalisation error but not training error.
• A standard for logistic regression is L2-regularisation, where we penalise parameter vectors based on their lengths:
regularisation strength
ଔ Ն� ଔ ÷ ʐò৴ଔ � ૣјଔј�ʞ<latexit sha1_base64="RueUXWVanBvf8bR2DRDwJIHEzvs=">AAAFe3icjVRdT9swFE2Bbqz7gu1xL9YqJGBN1RRG6aRKCCa0SXSwlY9KTVU5zm1r4XzIdpC60D+5t/2S7WXS7KSDkhYNS0ntc889vj43tRMyKmSl8jO3sLiUf/R4+Unh6bPnL16urL46F0HECZyRgAW87WABjPpwJqlk0A45YM9hcOFcHuj4xRVwQQP/VI5C6Hp44NM+JVgqqLfC7CsgsS2HIPEY2R6WQw4s/tBQi+mIiWz1i2yHDtg6sn3sMIyO1qc5G+gdsvkwQPb1NGxf96pJHt/orRQr5Uoy0OzEmkyKxmSc9FYXf9luQCIPfEkYFqJjVULZjTGXlDAYF+xIQIjJJR5AR0197IHoxoktY7SmEBf1A64eX6IEnc6IsSfEyHMUU59bZGManBfrRLK/242pH0YSfJJu1I8YkgHSHiOXciCSjdQEE05VrYgMMcdEqk7c2UUXJkIgd0+idzSFHDFofG4dl5yAubfLbhz5lAQumEl9BVuA9DD1tVSnsIbQvqIf6gM3ULyJWuBRLTAuFRBq0e9wCFhGHIQOKwihWKNqZe6Wt0voOFTlYjbBdpO0W46imNZWlmVZGZq1ZVr1cn2GV8/yFMnMsmrVsSIlzCM6uCn2FNoa+hJ5jnJRV98M/EAoz8BVEsxtaYdUWjduUl993OiEB//skcPUnv9ZsFNCE+tSjabuxCn1RyWUyOuAglqqq6FM35kj7aQ2zqoc4FDOU7nH4VmBbzCIGOYP0Lixf1akFTnDhyjoxtSq8xQ+UhFm2LWqeR9z7l66vbdtaqafMQOpr42Grf+S/eSqqOuxc3MxzE7Oq2Vrq7z9dbu4tz+5NJaNN8ZbY92wjJqxZ3wyTowzgxg/jN+5XG5h6U++mN/Ml1LqQm6S89q4M/Lv/wKs0bHO</latexit>
Logistic Regression and Naive Bayes
• Learning a Naive Bayes classifier involves fitting a probability model that optimizes the joint likelihood 𝑃(class, document). 𝑃(class | document) is obtained via Bayes’ rule.
• Logistic regression fits the same probability model to directly optimize the conditional 𝑃(class | document). lower aymptotic error
• Naive Bayes and logistic regression form a generative–discriminative pair.
Support Vector Machines
Linear model
input vector (1-by-F)
weight vector (F-by-1)
predicted output
F = number of features (independent variables, covariates)
Оਚ � ੍ੌ � <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
bias
Linearly separable problems
linearly separable not linearly separable
Linearly separable problems
• A classification problem is called linearly separable, if there exists a linear model that can learn the problem with zero error.
• Any line in feature space that separates the classes from each other is called a decision boundary.
For simplicity, we restrict ourselves to binary classification problems here.
Different decision boundaries
margin
Different decision boundaries
We would like to choose a decision boundary that maximises the margin to the closest training samples.
Computing the margin
• Mathematically, the decision boundary is a plane that is defined by the parameters (weights and bias) of the linear classifier. More specifically, this plane is the set of all points 𝒙 such that 𝒙𝒘 + 𝑏 = 0.
• The margin between a data point 𝒙𝑖 and the decision boundary is the distance between 𝒙𝑖 and the plane along the direction of 𝒘.
• Subject to some convenient normalisations, for the closest data points this yields a margin of 1/∥𝒘∥.
Support vectors
support vector
support vector
support vector
Maximising the margin
• If the training data is linearly separable, then the optimal margin can be found using quadratic programming. more recent approach: sub-gradient descent, coordinate descent
• To generalise this to training sets which are not linearly separable, one can use gradient-based optimisation.
Practical concerns
• More than two classes can be handled according to a one-versus-one or one-versus-all scheme.
• For general SVM, training time scales quadratically with the number of samples, which quickly becomes impractical.
• An alternative is to use a special-purpose implementation of a linear SVM, which can have significantly shorter training times.
This lecture
• Introduction to text classification
• Evaluation of text classifiers
• The Naive Bayes classifier
• The Logistic Regression classifier
• Support Vector Machines
Project idea
• There are many other text classification methods than Naive Bayes, Logistic regression, and Support Vector Machines. decision trees, many different types of neural networks, …
• Pick a method that you find interesting and compare its performance to that of other methods on a standard data set.
• In the report, make it clear why you chose this method, and that you have understood how it works.
Beyond the bag-of-words assumption
• The linear sequence of the words in a text carries meaning. The brown dog on the mat saw the striped cat through the window.
The brown cat saw the striped dog through the window on the mat.
• We can try to capture at least parts of this sequence by representing texts as 𝒏-grams, contiguous sequences of 𝑛 words. unigram (not), bigram (not bad)
• Recurrent neural networks have been designed to explicitly capture long-range dependencies between words.
top related