text classi cation - liu

This work is licensed under a Creative Commons Attribution 4.0 International License.

Text classification

Marco Kuhlmann Department of Computer and Information Science

732A92/TDDE16 Text Mining (2020)

Text classification

• Text classification is the task of categorising text documents into predefined classes.

• The term ‘document’ is applied to everything from tweets over press releases to complete books.

Topic classification

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

diamond baseball

forward soccer

team captain

UK China Elections Sports

Adapted from Manning et al. (2008)

Forensic linguistics

‘I realized the faxed copy I just received was an outline of the manifesto, using much of the same wording, definitely the same topics and themes. … I invented [the language analysis] for this case and really, forensics linguistics took off after that.’

James Fitzgerald, profiler

Sources: Wikipedia, Newsweek

Sentiment analysis

The gorgeously elaborate continuation of “The Lord of the Rings” trilogy is so huge that a column of words cannot adequately describe co-writer/director Peter Jackson’s expanded vision of J.R.R. Tolkien’s Middle-earth.

… is a sour little movie at its core; an exploration of the emptiness that underlay the relentless gaiety of the 1920’s, as if to stop would hasten the economic and global political turmoil that was to come.

positive negative

The standard text classification pipeline

predictor

document collection

matrix of document vectors

vector of class labels

vectorizer Example: CountVectorizer

Example: MultinomialNB

predictor

document collection

vectorizer CountVectorizer

MultinomialNB

Reminder: Documents as tf–idf vectors

Scandal in Bohemia

Final problem

Empty house

Norwood builder

Dancing men

Retired colourman

Adair 0,0000 0,0000 0,0692 0,0000 0,0000 0,0000

Adler 0,0531 0,0000 0,0000 0,0000 0,0000 0,0000

Lestrade 0,0000 0,0000 0,0291 0,1424 0,0000 0,0000

Moriarty 0,0000 0,0845 0,0528 0,0034 0,0000 0,0000

Documents as count vectors – the bag of words

the gorgeously elaborate continuation of the lord of the rings trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson’s expanded vision of j.r.r. tolkien’s middle-earth.

… is a sour little movie at its core an exploration of the emptiness that underlay the relentless gaiety of the 1920’s as if to stop would hasten the economic and global political turmoil that was to come

positive negative

a adequately cannot co-writer/director column continuation describe elaborate expanded gorgeously huge is j.r.r. jackson lord middle-earth of of of of peter rings so that the the the tolkien trilogy vision words

… 1920’s a an and as at come core economic emptiness exploration gaiety global hasten if is its little movie of of political relentless sour stop that that the the the the to to turmoil underlay was would

positive negative

Word Count

words 1

vision 1

trilogy 1

Word Count

that 2

would 1

…positive negative

predictor

document collection

vectorizer CountVectorizer

MultinomialNB

Text classification as supervised machine learning

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

Text classification as supervised machine learning

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

training set

test set

learning

evaluation

Training and testing

• Training

When we train a classifier, we present it with a document 𝑥 and its gold-standard class 𝑦 and apply some learning algorithm. maximise likelihood/minimise cross-entropy of the training data

• Testing

When we evaluate a classifier, we present it with 𝑥 and compare the predicted class for this input with the gold-standard class 𝑦.

Two challenges in text classification

• Standard document representations such as the bag of words easily yield tens of thousands of features. computational challenge, data sparsity

• Many document collections are highly imbalanced with respect to the represented classes. frequency bias, problems for evaluation

This lecture

• Introduction to text classification

• Evaluation of text classifiers

• The Naive Bayes classifier

• The Logistic Regression classifier

• Support Vector Machines

Evaluation of text classifiers

• We require a test set consisting of a number of documents, each of which has been tagged with its correct class. typically part of a larger gold-standard data set

• To evaluate a classifier, we apply it to the test set and compare the predicted classes with the gold-standard classes.

• We do so in order to estimate how well the classifier will perform on new, previously unseen documents. assume that all samples are drawn from the same distribution

Evaluation of text classifiers

Windsor The Queen

Mao Communist

TV-ads campaign

gold-standard class

predicted class

Accuracy

The accuracy of a classifier is the proportion of documents for which the classifier predicts the gold-standard class:

number of correctly classified documentsnumber of all documents

accuracy =

Accuracy

Document Gold-standard class Predicted class

Chinese Beijing Chinese China China

Chinese Chinese Shanghai China China

Chinese Macao China China

Tokyo Japan Chinese Japan China

accuracy for this example: 3/4 = 75%

Accuracy and imbalanced data sets

50 %50 %

Is 80% accuracy good or bad?

The role of baselines

• Evaluation measures are no absolute measures of performance. Whether ‘80% accuracy’ is good or not depends on the task at hand.

• Instead, we should ask for a classifier’s performance relative to other classifiers, or other points of comparison. ‘Logistic Regression has a higher accuracy than Naive Bayes.’

• When other classifiers are not available, a simple baseline is to always predict the most frequent class in the training data.

Grading criteria

"TTFTTNFOU DSJUFSJB

8IFO HSBEJOH ZPVS SFQPSU * XJMM BTTFTT JU XJUI SFTQFDU UP UIF DSJUFSJB EFĕOFE CFMPX�'PS FBDI DSJUFSJPO * XJMM BTTJHO FJUIFS UIF TDPSF ƈ EPFT OPU NFFU FYQFDUBUJPOT PS BTDPSF CFUXFFO Ɖ NFFUT FYQFDUBUJPOT BOE Ƒ FYDFFET FYQFDUBUJPOT� 5P HFU B QBTTJOHHSBEF ZPVS TDPSF GPS FBDI DSJUFSJPO NVTU CF BU MFBTU Ɖ� :PVS HSBEF JT DBMDVMBUFE GSPNUXP DPNQPOFOU TDPSFT BT TQFDJĕFE JO UIF GPMMPXJOH UBCMF� 5PHFUIFS XJUI ZPVS HSBEF ZPV XJMM SFDFJWF B TIPSU XSJUUFO NPUJWBUJPO CBTFE PO UIF EFTDSJQUPST GPS FBDI DSJUFSJPO�

$PNQPOFOU � $PNQPOFOU � (SBEF ��"�� (SBEF 5%%&��

� � & �

� � % �

� � $ �

� � % �

� � $ �

� � # �

� � $ �

� � # �

� � " �

$PNQPOFOU �

ćJT DPNQPOFOU JT UIF TDPSF GPS B TJOHMF DSJUFSJPO�

4PVOEOFTT BOE DPSSFDUOFTT *T UIF UFDIOJDBM BQQSPBDI TPVOE BOE XFMM�DIPTFO "SFUIF DMBJNT NBEF JO UIF SFQPSU TVQQPSUFE CZ QSPQFS FYQFSJNFOUT BOE BSF UIF SFTVMUTPG UIFTF FYQFSJNFOUT DPSSFDUMZ JOUFSQSFUFE

� 5SPVCMFTPNF� ćFSF BSF TPNF JEFBT XPSUI TBMWBHJOH IFSF CVU UIF XPSL TIPVMESFBMMZ IBWF CFFO EPOF PS FWBMVBUFE EJČFSFOUMZ�

� 'BJSMZ SFBTPOBCMF XPSL� ćF BQQSPBDI JT OPU CBE UIF FWBMVBUJPO JT BQQSPQSJBUF BOE BU MFBTU UIF NBJO DMBJNT BSF QSPCBCMZ DPSSFDU�

� (FOFSBMMZ TPMJE XPSL BMUIPVHI UIFSF BSF TPNF BTQFDUT PG UIF BQQSPBDI PS UIFFWBMVBUJPO UIBU * BN OPU TVSF BCPVU�

� ćF BQQSPBDI JT GVMMZ BQU BOE BMM DMBJNT BSF DPOWJODJOHMZ TVQQPSUFE� ćF SFQPSUEJTDVTTFT UIF MJNJUBUJPOT PG UIF XPSL�

The role of baselines

• Evaluation measures are no absolute measures of performance. Whether ‘80% accuracy’ is good or not depends on the task at hand.

• Instead, we should ask for a classifier’s performance relative to other classifiers, or other points of comparison. ‘Logistic Regression has a higher accuracy than Naive Bayes.’

• When other classifiers are not available, a simple baseline is to always predict the most frequent class in the training data.

Confusion matrix

classifier ‘positive’

classifier ‘negative’

gold standard ‘positive’

true positives false negatives

gold standard ‘negative’

false positives true negatives

Accuracy

Precision and recall

• Precision and recall ‘zoom in’ on how good a system is at identifying documents of a specific class 𝑐.

• Precision is the proportion of correctly classified documents among all documents for which the system predicts class 𝑐. When the system predicts class 𝑐, how often is it correct?

• Recall is the proportion of correctly classified documents among all documents with gold-standard class 𝑐. When the document has class 𝑐, how often does the system predict it?

Precision with respect to the positive class

Recall with respect to the positive class

Precision and recall with respect to the positive class

# true positives# true positives + # false positives

precision =

# true positives# true positives + # false negatives

recall =

F1-measure

A good classifier should balance between precision and recall. The F1-measure is the harmonic mean of the two values:

2 · precision · recallprecision + recall

F1-measure

�� QSFDJTJPO

��

SFDBMM

��

Accuracy with three classes

A 58 6 1

B 5 11 2

C 0 7 43

Precision with respect to class B

A 58 6 1

B 5 11 2

C 0 7 43

Recall with respect to class B

A 58 6 1

B 5 11 2

C 0 7 43

Classification report in scikit-learn

precision recall f1-score support

C 0.63 0.04 0.07 671 KD 0.70 0.02 0.03 821 L 0.92 0.02 0.04 560 M 0.36 0.68 0.47 1644 MP 0.36 0.25 0.29 809 S 0.46 0.84 0.59 2773 SD 0.57 0.12 0.20 1060 V 0.59 0.15 0.24 950

accuracy 0.43 9288 macro avg 0.57 0.26 0.24 9288 weighted avg 0.52 0.43 0.34 9288

This lecture

• Support Vector machines

The Naive Bayes classifier

Naive Bayes

• The Naive Bayes classifier is a simple but surprisingly effective probabilistic text classifier that builds on Bayes’ rule.

• It is called ‘naive’ because it makes strong (unrealistic) independence assumptions about probabilities.

Naive Bayes decision rule, informally

elaborate gorgeously

negativepositive

𝑃(pos) 𝑃(elaborate |pos) 𝑃(gorgeously |pos)

score(pos) =

𝑃(neg) 𝑃(elaborate |neg) 𝑃(gorgeously |neg)

score(neg) =

70% 30%

Naive Bayes decision rule, informally

negative

positive

𝑃(pos) 𝑃(elaborate |pos) 𝑃(gorgeously |pos)

score(pos) =

𝑃(neg) 𝑃(elaborate |neg) 𝑃(gorgeously |neg)

score(neg) =

70% 30%

The role of Bayes’ rules

• For classification, we would like to know 𝑃(class |document). 𝑃(disease |symptom)

• But a Naive Bayes classifier contains 𝑃(document |class). 𝑃(symptom |disease)

• The classifier uses Bayes’ rule to convert between the two. 𝑃(class |document) ∝ 𝑃(class) 𝑃(document |class)

Formal definition of a Naive Bayes classifier

𝐶 a set of possible classes

𝑉 a set of possible words; the model’s vocabulary

𝑃(𝑐) probabilities that specify how likely it is for a document to belong to class 𝑐 (one probability for each class)

𝑃(𝑤 |𝑐) probabilities that specify how likely it is for a document to contain the word 𝑤, given that the document belongs to class 𝑐 (one probability for each class–word pair)

Naive Bayes decision rule

choose that class 𝑐 which maximises the term to the right of the ‘arg max’

predicted class for the document

count of the word 𝑤 in the document

Оਅ � BSHNBYਅó৫ ৸ਅ ď Ƕਘó৾৸ਘ ] ਅ�ਘ

What does the Naive Bayes classifier do if all words in the document to be classified

are out-0f-vocabulary?

Log probabilities

• In order to avoid underflow, implementations use the logarithms of probabilities instead of the probabilities themselves. 𝑃(𝑤 |𝑐) becomes log 𝑃(𝑤 |𝑐)

• Note that in this case, instead of multiplying probabilities, we have to add their logarithms.

Log probabilities

�� ÷�÷�÷�÷��MPH

Learning a Naive Bayes classifier

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

congestion London

Parliament Big Ben

Windsor The Queen

Olympics Beijing

tourism Great Wall

Mao Communist

recount votes

seat run-off

TV-ads campaign

training set

test set

learning

evaluation

congestion London

Parliament Big Ben

Olympics Beijing

tourism Great Wall

recount votes

seat run-off

𝑃(𝑐) 𝑃(𝑤 |𝑐)class probabilities word probabilities

Maximum Likelihood Estimation (MLE)

• One standard technique for estimating the probabilities of a model is Maximum Likelihood Estimation (MLE). Find probabilities that maximise the probability of the training data.

• Depending on the type of model under consideration, MLE can be computationally challenging.

• For the special case of the Naive Bayes classifier, MLE amounts to equating probabilities with relative frequencies.

MLE for the Naive Bayes classifier

• To estimate the class probabilities 𝑃(𝑐):

Compute the percentage of documents with class 𝑐 among all documents in the training set.

• To estimate the word probabilities 𝑃(𝑤 |𝑐):

Compute the percentage of occurrences of the word 𝑤 among all word occurrences in documents with class 𝑐.

MLE for the Naive Bayes classifier

#(𝑐) number of documents with gold-standard class 𝑐

#(𝑤, 𝑐) number of occurrences of 𝑤 in documents with class 𝑐

৸ਅ � �ਅöਙó৫ �ਙ ৸ਘ ] ਅ � �ਘ ਅöਙó৾ �ਙ ਅ

MLE of word probabilities

31 tokens 37 tokens

positive negative

Word Count

words 1

vision 1

trilogy 1

… …

Word Count

that 2

would 1

… …positive negative

Probability Estimated value

𝑃(of |pos) 4/31

𝑃(The |pos) 2/31

𝑃(words |pos) 1/31

𝑃(vision |pos) 1/31

𝑃(trilogy |pos) 1/31

… …

𝑃(the |neg) 4/37

𝑃(to |neg) 2/37

𝑃(that |neg) 2/37

𝑃(of |neg) 2/37

𝑃(would |neg) 1/37

Smoothing

• If we equate word probabilities with relative frequencies, some probabilities may be zero. For example, ‘gorgeously’ may only occur in positive documents.

• This is a problem because we multiply probabilities in the decision rule for Naive Bayes. Slogan: Zero probabilities destroy information.

• To avoid zero probabilities, we can use techniques for smoothing the probability distribution.

Additive smoothing

• In additive smoothing, we add a constant value 𝛼 ≥ 0 to all counts before computing relative frequencies. For 𝛼 = 1, this technique is also known as Laplace smoothing.

• Effectively, we hallucinate 𝛼 extra occurrences of each word, including words that never occurred.

• The smoothing constant 𝛼 is a hyperparameter of the model that needs to be tuned on validation data.

Hyperparameters and cross-validation

• A hyperparameter is a parameter of a machine learning model that is not set by the learning algorithm itself.

• To tune hyperparameters, we need a separate validation set, or need to use cross-validation:

• Split the data into 𝑘 parts.

• In each fold, use 𝑘 − 1 parts for training, 1 part for validation.

• Take the mean performance over the 𝑘 folds.

Five-fold cross validation

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

80% training, 20% validation

Additive smoothing

• In additive smoothing, we add a constant value 𝛼 ≥ 0 to all counts before computing relative frequencies. For 𝛼 = 1, this technique is also known as Laplace smoothing.

• Effectively, we hallucinate 𝛼 extra occurrences of each word, including words that never occurred.

• The smoothing constant 𝛼 is a hyperparameter of the model that needs to be tuned on validation data.

৸ਘ ] ਅ � �ਘ ਅ � öਙó৾<�ਙ ਅ � ><latexit sha1_base64="JOxlUkKj5gvzRD2EaG79HryUar8=">AAAFVHicjVRdb9MwFPU+CqPAPuCRF4tq0mBp1XTTuj5UmjaYQKKs0H1JTVQ5jttac5wodraVkF/Er+EFCfghvPCAnZSNpZ2YpUTX5x4fX5+b2AkYFbJa/T4zOzdfuHd/4UHx4aPHi0vLK0+OhR+FmBxhn/nhqYMEYZSTI0klI6dBSJDnMHLinO3p/Mk5CQX1+aEcBcT20IDTPsVIKqi3/Lq9dmEZny0Dv4BNaPVDhGOrtHah5usWYsEQJbElIq8XX0KLcnicwK7KX2r+OswYdtJbLlUr1XTAycAcByUwHu3eytwvy/Vx5BEuMUNCdM1qIO0YhZJiRpKiFQkSIHyGBqSrQo48Iuw4PW8CVxXiwr4fqodLmKL/roiRJ8TIcxTTQ3Io8jkNTst1I9nftmPKg0gSjrON+hGD0ofaPOjSkGDJRipAOKSqVoiHSHkmlcU3dtGFiYDgmyfRO5aFHDHSfNs5MByfuddTO444xb5Lyml9RUsQ6SHKtVS3uArhrqLv6wM3YfwSdohHtUBiFCHs0E9knyAZhUTotIIgjDWqZuXtyqYBDwJVLmJjbDtdds1RlLK5kWeZZo5mbpTNRqUxwWvkeYpUzrPqtUSRUuY7Orgq9pCcauh95DnKRV19y+e+UJ4RV0kwt6MdUsvsuEW5+mphO/T/2iOHmT3/s2DLgGPrMo2W7sQh5SMDpvI6oaCO6mogs3fuSFuZjZMqeyiQ01RucXhS4CMZRAyFd9C4sn9SpBM5w7so6MbUa9MUXlER5Nj1Wvk25tS9dHuv29TKPmNGpHVOcNPSv2Q/vSoaemxdXQyTwXGtYm5UNj9slnZ2x5fGAngGnoM1YII62AFvQBscAQy+gK/gB/g5/23+d2GuUMioszPjNU/BjVFY/APPmKKb</latexit>

MLE with additive smoothing

no smoothing here!

৸ਅ � �ਅöਙó৫ �ਙ

MLE with additive smoothing

৸ਅ � �ਅöਙó৫ �ਙ ৸ਘ ] ਅ � �ਘ ਅ � ʺöਙó৾ �ਙ ਅˁ � ৾<latexit sha1_base64="CjRgHhyU22p8bV3xgZls7ozdoVg=">AAAFYnicjVRbb9MwFPYuhVFuHXuEB4tq0mBJ1XTTuj5UmjY0gURZobtJTVQ5jttac5wodraV0B/GT+GJFx7gF/CKnZSNpZ2YpUTH53znO8ffSeyGjApZrX6bm19YLNy7v/Sg+PDR4ydPS8vPjkUQR5gc4YAF0amLBGGUkyNJJSOnYUSQ7zJy4p7t6fjJOYkEDfihHIXE8dGA0z7FSCpXr9Rpr13YxhfbwK9gE9r9COHELq9dqP26jVg4ROPEdumAdaEtYr+XXEKbcng8hgp1qbN0NHLgOszgKtQrlauVarrgtGFNjDKYrHZveeG37QU49gmXmCEhulY1lE6CIkkxI+OiHQsSInyGBqSrTI58IpwkPf0YriqPB/tBpB4uYer9NyNBvhAj31VIH8mhyMe0c1asG8v+tpNQHsaScJwV6scMygBqKaFHI4IlGykD4YiqXiEeIqWgVILfqKIbEyHBN0+iK5pCjhhpvuscGG7AvOutk8Sc4sAjZtpf0RZE+ohyTdUtrkK4q+D7+sBNmLyGHeJTTTA2ihB26GeyT5CMIyJ0WLkgTLRX7cztyqYBD0LVLmIT33aado1RENPayKMsKwezNkyrUWlM4Rp5nAKZeVS9NlagFPmeDq6aPSSn2vUh9l2lou6+FfBAKM2IpyiY19EKqTQnaVGuvmHYjoK/8shhJs//JNgy4ES6jKOlJ3FI+ciAKb0OKFdHTTWU2Tt3pK1MxmmWPRTKWSy3KDxN8IkMYoaiO3BcyT9N0ond4V0Y9GDqtVkMb6gIc+h6zbwNObOWHu/1mFrZZ8yItM8Jbtr6l+ynV0VDr62ri2HaOK5VrI3K5sfN8s7u5NJYAs/BS7AGLFAHO+AtaIMjgMFX8B38BL8WfxSKheXCSgadn5vkrIAbq/DiDySBptg=</latexit>

Estimating word probabilities

31 tokens 37 tokens

positive negative

Vocabulary

1920’s J.R.R. Jackson’s Lord Middle-earth Peter Rings The Tolkien’s a adequately an and as at cannot co-writer/director column come continuation core describe economic elaborate emptiness expanded exploration gaiety global gorgeously hasten huge if is its little movie of political relentless so sour stop that the to trilogy turmoil underlay vision was words would

53 unique words

MLE with add-one smoothing

Word Modified count

of 4 + 1

The 2 + 1

words 1 + 1

vision 1 + 1

trilogy 1 + 1

… …

Word Modified count

of 2 + 1

The 0 + 1

words 0 + 1

vision 0 + 1

trilogy 0 + 1

MLE with add-one smoothing

𝑃(of |pos) (4 + 1)/(31 + 53)

𝑃(The |pos) (2 + 1)/(31 + 53)

𝑃(words |pos) (1 + 1)/(31 + 53)

𝑃(vision |pos) (1 + 1)/(31 + 53)

𝑃(trilogy |pos) (1 + 1)/(31 + 53)

… …

𝑃(of |neg) (2 + 1)/(37 + 53)

𝑃(The |neg) (0 + 1)/(37 + 53)

𝑃(words |neg) (0 + 1)/(37 + 53)

𝑃(vision |neg) (0 + 1)/(37 + 53)

𝑃(trilogy |neg) (0 + 1)/(37 + 53)

Additive smoothing is assuming a uniform prior

Let 𝑁𝑐 be the total number of word occurrences in class 𝑐. Then

can be written as a linear interpolation of the maximum-likelihood estimate and the uniform distribution over word types:

৸ਘ ] ਅ � �ਘ ਅ � ৶ਅ � ৾<latexit sha1_base64="R9dEYwk+CSVSLJizl8pecD8Edls=">AAAFQXicjVTLbtNAFJ22AUp4pbBkMyKqVKgTxWnVNItIVYsqkEgbSF9SHEXj8U0y6vghz7hVMPkPvoYNC/gFPoEdYsOCDTN2aKmTio5k6865Z87cOdceO+BMyErl29z8Qu7W7TuLd/P37j94+Kiw9PhI+FFI4ZD63A9PbCKAMw8OJZMcToIQiGtzOLZPd3T++AxCwXzvQI4C6Lpk4LE+o0QqqFeotlbOLeODZdDnuIGtfkhobBVXztV81SI8GJJxvNejeBWnM3w07hWKlXIlGXg6MCdBEU1Gq7e08MtyfBq54EnKiRAdsxLIbkxCySiHcd6KBASEnpIBdFToERdEN04ON8bLCnFw3w/V40mcoP+uiIkrxMi1FdMlciiyOQ3OynUi2d/sxswLIgkeTTfqRxxLH2unsMNCoJKPVEBoyFStmA6JMkgqP6/sogsTAdCrJ9E7loQccWi8bu8bts+dy2k3jjxGfQdKSX15S4B0CfO0VCe/jPG2ou/qAzdw/AK3wWVaYGzkMW6z97ALREYhCJ1WEMaxRtWstFleN/B+oMolfIJtJssuOYpSMteyLNPM0My1klkv16d49SxPkUpZVq06VqSE+YYNLoo9gBMN7UWurVzU1Td9zxfKM3CUBHfa2iG1rBs3mac+UdwK/b/2yGFqz/8s2DDwxLpUo6k7ccC8kYETeZ1QUFt1NZDpO3OkjdTGaZUdEshZKtc4PC3wDgYRJ+ENNC7snxZpR/bwJgq6MbXqLIWXTAQZdq1auo45cy/d3ss2NdPPmIO0zoA2LP1L9pOroq7HxsXFMB0cVcvmWnn97Xpxa3tyaSyip+gZWkEmqqEt9Aq10CGi6CP6hL6gr7nPue+5H7mfKXV+brLmCboycr//AIhxnmg=</latexit>

৸ਘ ] ਅ � �ਘ ਅ৶ਅ � � ÷ �৾<latexit sha1_base64="p3lV1p5lnnNdlt5LoL4fmxbrShk=">AAAFUXicjVRdb9MwFPU+CqMDtsEjLxbVpA6aqummdX2oNG1oAoluhe5LaqrKcW5ba86HYmdTCfk//BpeeAB+CeINOw0bazsxS61uzj33+PrcxHbAmZCVys+5+YXF3IOHS4/yy4+fPF1ZXXt2KvwopHBCfe6H5zYRwJkHJ5JJDudBCMS1OZzZF/s6f3YJoWC+dyxHAXRdMvBYn1EiFdRb3WsVr6zSZ6tEN3ADW1xVOgRb/ZDQ2CoUrxSexIc9muDXuGgaGWEjY5hJfJr0VguVciVdeDows6CAstXqrS38shyfRi54knIiRMesBLIbk1AyyiHJW5GAgNALMoCOCj3igujG6WETvK4QB/f9UP08iVP034qYuEKMXFsxXSKHYjKnwVm5TiT7O92YeUEkwaPjjfoRx9LH2jnssBCo5CMVEBoy1SumQ6JskMrfW7voxkQA9PZJ9I6GkCMOjXfto5Ltc+fmsRtHHqO+A0baX94SIF3CPC3Vya9jvKfoB/rADRy/wm1wmRZISnmM2+wTHACRUQhCpxWEcaxR9WTslLdK+ChQ7RKeYTtp2Q1HUQxzc5JlmhM0c9Mw6+X6FK8+yVMkY5JVqyaKlDLfs8F1s8dwrqHDyLWVi7r7pu/5QnkGjpLgTls7pMq6cZN56pXFrdD/a48cju35nwXbJZxZN9Zo6kkcM29Uwqm8TiioraYayPH/xJG2xzZOq+yTQM5SucPhaYGPMIg4Ce+hcW3/tEg7sof3UdCDqVVnKbxhIphg16rGXcyZe+nx3oypOX6NOUjrEmjD0p9kP70q6nptX18M08FptWxulrc+bBV297JLYwm9QC9REZmohnbRW9RCJ4iiL+gr+o5+LH5b/J1DufkxdX4uq3mObq3c8h9gqqF+</latexit>

� ৶ਅ৶ਅ � ৾<latexit sha1_base64="Ee9pFSkvjwL8laKaOLbVZ0SITto=">AAAFM3icjVRdb9MwFPW2AqN8rINHXizGJARN1XTTuj5UmjY0gUS3QfclNVXlOLetNceJYmdSifoH+DW88AD/BPGGeOWZV+ykbCztxCwlujn3+Pj6XMduyJlU1eq3ufmFwq3bdxbvFu/df/BwqbT86FgGcUThiAY8iE5dIoEzAUeKKQ6nYQTEdzmcuGc7Jn9yDpFkgThUoxC6PhkI1meUKA31Ss8crskewU3s9CNCk70eHZsXfokdwsMhwcfjXmmlWqmmA08H9iRYQZNx0Fte+O14AY19EIpyImXHroaqm5BIMcphXHRiCSGhZ2QAHR0K4oPsJul2xnhVIx7uB5F+hMIp+u+MhPhSjnxXM32ihjKfM+CsXCdW/c1uwkQYKxA0W6gfc6wCbLzBHouAKj7SAaER07ViOiTaFaUdvLKKKUyGQK/uxKxoSTXi0HzT3i+7AfcuP7tJLBgNPLDS+oqOBOUTJoxUp7iK8bam75oNN3HyArfBZ0ZgXC5i3GYfYBeIiiOQJq0hjBOD6i9rs7JexvuhLpfwCbaZTrvkaIplr+VZtp2j2WuW3ag0pniNPE+TrDyrXhtrUsp8ywYXxR7CqYH2Yt/VLprqW4EIpPYMPC3BvbZxSE/rJi0m9KHEB1Hw1x41zOz5nwUbZTyxLtNomU4cMjEq41TeJDTU1l0NVfbObWkjs3FaZYeEapbKNQ5PC7yHQcxJdAONC/unRdqxO7yJgmlMvTZL4RWTYY5dr1nXMWeuZdp72aZWdow5KOccaNMxv2Q/vSoaZmxcXAzTwXGtYq9V1t+tr2xtTy6NRfQEPUXPkY3qaAu9RgfoCFH0EX1CX9DXwufC98KPws+MOj83mfMYXRmFX38A1YiaPw==</latexit>

This lecture

• Support Vector machines

The Logistic Regression classifier

The multi-class linear model

input vector (1-by-F)

weight matrix (F-by-K)

output vector (1-by-K)

bias vector (1-by-K)

F = number of features (input variables), K = number of classes (output variables)

О � ੍ਲ � <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

The logistic model

• The logistic model extends the linear model by adding a pointwise logistic function 𝑓:

• The term logistic regression refers to the procedure of learning the parameters of the logistic model. That model is then used for classification. Confusing terminology!

logitО � ਈ XIFSF � ੍ਲ � <latexit sha1_base64="kqlvgRIJZ5DumuuM3LN7IGHG36c=">AAAFcHicjVRbb9MwFPYuhVFuG7wg8YBhmjS2pmq6aV0fKk0bmkCibNDdpKaaHOe0teZciJ2xLsrv4zfwI3jhgb1iJ2WjaSdmKcnJd77z+fhzYjvgTMhK5cfU9Mxs4d79uQfFh48eP3k6v/DsSPhRSOGQ+twPT2wigDMPDiWTHE6CEIhrczi2z3Z0/vgcQsF870AOAui4pOexLqNEKuh0nljnQGOrT2Q8SBLcwN3lFLlM3mLra0Sc9IYtCRcy/taHEJIRPOOqujS6SLLncYJXs8hOTucXK+VKOvB4YA6DRTQc+6cLM1eW49PIBU9SToRom5VAdmISSkY5JEUrEhAQekZ60FahR1wQnTj1IsFLCnFw1w/V5Umcov9WxMQVYuDaiukS2Rf5nAYn5dqR7G52YuYFkQSPZhN1I46lj7Wx2GEhUMkHKiA0ZKpXTPskJFQq+0dm0Y2JAOjoSvSMhpADDo0Prb2S7XPn5rUTRx6jvgNG2l/REiBdwjwt1S4uYbyt6Lt6wQ0cr+AWuEwLJKUixi12CbtAZBSC0GkFYRxrVL0Zm+X1Et4LVLuED7HNtOyGoyiGuZZnmWaOZq4ZZr1cH+PV8zxFMvKsWjVRpJT5kfWumz2AEw19ilxbuai7b/qeL5Rn4CgJ7rS0Q6qsEzeZp75ovB/6f+2R/cye/1mwUcJD6zKNpt6JA+YNSjiV1wkFtdSuBjK755a0kdk4rrJDAjlJ5RaHxwW+QC/iJLyDxrX94yKtyO7fRUFvTK06SeEdE0GOXasatzEnzqW392abmtlnzEHqY6Jh6V+ymx4VdT02rg+G8eCoWjbXyuuf1xe3toeHxhx6id6gZWSiGtpC79E+OkQUfUc/0W90Nfur8KLwqvA6o05PDWueo5FRWPkDwvWwxw==</latexit>

The standard logistic function for binary classification

÷�� ÷��

Оਚ � �� FYQ÷ਛ<latexit sha1_base64="549xeEhtfoSeI9qr017T01X+qbc=">AAAFNXicjVTLbtNAFJ22AUp4tIUlmxFVUYE4itMqj0WkqkUVSIQW0pcUR9V4fJOMOn7IM66aWv4DvoYNC/gRFuwQW5ZsmbFDS51UdCRb1+eeOXPn3PHYAWdCVirfZmbnCrdu35m/W7x3/8HDhcWlRwfCj0IK+9TnfnhkEwGcebAvmeRwFIRAXJvDoX2ypfOHpxAK5nt7chRAzyUDj/UZJVJBx4vPrCGR8SjBLWz1Q0JjM4lN/BJbEs5kDGdBsmqcP0+OF5cr5Uo68GRgjoNlNB67x0tzvy3Hp5ELnqScCNE1K4HsxSSUjHJIilYkICD0hAygq0KPuCB6cbqhBK8oxMF9P1SPJ3GK/jsjJq4QI9dWTJfIocjnNDgt141kv9GLmRdEEjyaLdSPOJY+1u5gh4VAJR+pgNCQqVoxHRLli1QeXllFFyYCoFd3olc0hBxxaL3p7JRsnzuXn7048hj1HTDS+oqWAOkS5mmpbnEF401F39YbbuH4Be6Ay7RAUipi3GHnsA1ERiEInVYQxrFG1ZfRKK+X8E6gyiV8jDXSaZccRTHMtTzLNHM0c80wm+XmBK+Z5ymSkWfVq4kipcy3bHBR7B4caehd5NrKRV192/d8oTwDR0lwp6MdUtN6cZt56lji3dD/a48cZvb8z4JaCY+tyzTauhN7zBuVcCqvEwrqqK4GMnvntlTLbJxU2SKBnKZyjcOTAh9gEHES3kDjwv5JkU5kD2+ioBtTr05TeMVEkGPXq8Z1zKlr6fZetqmdHWMO0joF2rL0L9lPr4qmHrWLi2EyOKiWzbXy+vv15Y3N8aUxj56gp2gVmaiONtBrtIv2EUUf0Sf0BX0tfC58L/wo/MyoszPjOY/RlVH49QdSlJto</latexit>

The softmax function for k-ary classification

• The softmax function converts a vector of activations into a vector of probabilities (a finite probability distribution).

• The softmax function is the natural generalisation of the standard logistic function to 𝑘-ary classification problems.

TPęNBYʐʞ<ਊ> � FYQʐ<ਊ>ʞö FYQʐ<>ʞ<latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit><latexit sha1_base64="Dd7XuLILJ3oQV6Ct1ypTAygwSgQ=">AAAFvnicjVRtT9swEE43Olj2VraP+2KtQgKtrZqCKAxVQoDQJq0jo7xJdVU5ybVYTeLMdrqWKD90H/dPZrcF1hQkLLU53z332H7OZyfyqZDV6p/cs+dL+RfLKy/NV6/fvH1XWH1/IVjMXTh3mc/4lUME+DSEc0mlD1cRBxI4Plw6g0MdvxwCF5SFZ3IcQScg/ZD2qEukcnULCZYwkolgPRmQUYod2vfX8RDc5GYy4Rtt2kF4r4H3EO5x4s4SYBTNgxVsik8TLOKgO0CPAQe3wG6hWK1UJwMtGtbMKBqzYXdXlz5jj7lxAKF0fSJE26pGspMQLqnrQ2riWEBE3AHpQ1uZIQlAdJKJSilaUx4P9RhXv1Ciiff/jIQEIiDyWiH1R5j4FH7FlIM9i+s0EYGbmmY21naY75WFHPvQ+NY6KWmC+2kniUPqMg/KE34TC1Bq01ATtk2EWvQGjoHImINADZQoF0KJ9qpZeaeyVUInkToi8We+nbQ0h1GQsrWZRVlWBmZtlq3dyu4CbjeLU6ByFlWvpQo0QX6n/bvNnsGVdv2IA0ddM737JguZUJqCpyh8r6VVUGmdpElDdeeQzdmtBPL6aRJsl9CxrlkDTTmaWt4zGo5LaEKvA8rVcjmN5PQ/c6TtqYyLLIckkg+xPKLwIsEp9GOf8Cdw3Mm/SNKKneunMOjC1GsPMRxREWXQ9Vr5MeSDa+ny3pepOb2qPkjduA0sxoHTm+sXXT/JmC8ULOJsSD1wWRCQ0MN9OoQwUf4jUI2q20T1ineknqmASuBXto3t05ODttVJ7HR9I0mTNbU65hDC73kOHLJQTPaHv2BPEwh1OrUlLudCejMs0ovqcxSt1FTPi5V9TBaNi1rFUvbPreL+weyhWTE+Gp+MdcMy6sa+8dWwjXPDNf7mlnOF3Gp+P9/LB3k2hT7LzXI+GHMjP/oHhE7OMw==</latexit>

Training a logistic regression classifier

• We present the model with training samples of the form (𝒙, 𝑦) where 𝒙 is a feature vector and 𝑦 is the gold-standard class.

• The output of the model is a vector of conditional probabilities 𝑃(𝑘 | 𝒙) where 𝑘 ranges over the possible classes.

• We want to train the model so as to maximise the likelihood of the training data under this probability distribution. Same training principle as for Naive Bayes – but no closed-form solution.

Gradient descent

• Step 0: Start with an arbitrary value for the parameters 𝜽.

• Step 1: Compute the gradient of the loss function, ∇𝐿(𝜽).

• Step 2: Update the parameters 𝜽 as follows: 𝜽 ≔ 𝜽 − 𝜂 ∇𝐿(𝜽) The hyperparameter 𝜂 is the learning rate.

• Repeat step 1–2 until the error is sufficiently low.

‘Follow the gradient into the valley of error.’

�� Оਚ��

Cross-entropy loss function

𝑦 = 1 𝑦 = 0

৴ଔ � ÷ʐਚ MPH Оਚ � � ÷ ਚ MPH� ÷ Оਚʞ<latexit sha1_base64="CE9DLhcmq8FduK1f26l6kxpQ2OU=">AAAFWHicjVRbT9swFDaXDuhuwB73Yg0htVtTNQVR+lAJwYSGRAdbuUlNVTnOaWvhXBQ7SFnU37Rfs4ftYfsZe52dZDDaomEp0cl3Pn8+/o5jO+BMyFrt+9z8wmLhydLySvHps+cvXq6urV8IPwopnFOf++GVTQRw5sG5ZJLDVRACcW0Ol/b1gc5f3kAomO+dyTiAnkuGHhswSqSC+qtHxyXrBmhiyRFIMi7jFjawZbMhL+EYW9wfYmtEZBKP8TtcMo24nIEqzPFySg/L/dWNWrWWDjwdmHmwgfJx2l9b+G05Po1c8CTlRIiuWQtkLyGhZJTDuGhFAgJCr8kQuir0iAuil6R7HuNNhTh44Ifq8SRO0X9nJMQVInZtxXSJHInJnAZn5bqRHOz2EuYFkQSPZgsNIo6lj7WB2GEhUMljFRAaMlUrpiMSEiqVzfdW0YWJAOj9negVDSFjDq2jzknF9rlz99lLIo9R3wEjra9oCZAuYZ6W6hY3Md5X9EO94RZO3uIOuEwLjCtFjDvsCxwCkVEIQqcVhHGiUd3U3ep2BZ8EqlzCc2w3nXbHURTD3JpkmeYEzdwyzGa1OcVrTvIUyZhkNepjRUqZx2x4W+wZXGnoY+TaykVdfdv3fKE8A0dJcKejHVLTekmbeerk4tPQ/2uPHGX2/M+CnQrOrcs02roTZ8yLKziV1wkFdVRXA5m9J7a0k9k4rXJAAjlL5QGHpwU+wzDiJHyExq390yKdyB49RkE3plGfpfCeiWCC3agbDzFnrqXbe9emdnaMOUh9zbQs/UsO0quiqcfO7cUwHVzUq+ZWdfvT9sbefn5pLKPX6A0qIRM10B76gE7ROaLoK/qGfqJfiz8KqLBUWMmo83P5nFfo3iis/wEAo6Kx</latexit>

÷�� ÷�� ਛ��

MPTTԯMPHJTUJD

÷�� ਛ��

MPTTԯMPHJTUJD

Cross-entropy loss for logistic units

𝑦 = 1 𝑦 = 0

ਆ৴ଔਆਛ � Оਚ ÷ ਚ<latexit sha1_base64="Wa9zpBIQpRiesmO2FcpJkvxZTHc=">AAAFOHicjVRNb9MwGPa2AqN8bIMjF4tp0oaaqummdT1UmjY0gbSyQfclNVXlOG9ba86HYmdSFuU38Gu4cIDfwY0b4sqFK3ZSNpZ2YpYSvXnex49fP69jO+BMyFrt28zsXOne/QfzD8uPHj95urC49OxE+FFI4Zj63A/PbCKAMw+OJZMczoIQiGtzOLXPd3X+9AJCwXzvSMYB9Fwy9NiAUSIV1F9cswYhoYmzv2pdAE0sOQJJ0rU0cS5T3MLWiMgkTrGB4/7icq1aywaeDMxxsIzG47C/NPfbcnwaueBJyokQXbMWyF5CQskoh7RsRQICQs/JELoq9IgLopdke0rxikIcPPBD9XgSZ+i/MxLiChG7tmK6RI5EMafBabluJAdbvYR5QSTBo/lCg4hj6WNtEHZYCFTyWAWEhkzViumIKJOksvHGKrowEQC9uRO9oiFkzKH1tnNQsX3uXH/2kshj1HfAyOorWwKkS5inpbrlFYx3FH1Pb7iFk1e4Ay7TAmmljHGHXcIeEBmFIHRaQRgnGlVfxlZ1o4IPAlUu4WNsK5t2zVEUw1wvskyzQDPXDbNZbU7wmkWeIhlFVqOeKlLG3GfDq2KP4ExD7yLXVi7q6tu+5wvlGThKgjsd7ZCa1kvazFMnEx+G/l975Ci3538WbFbw2Lpco607ccS8uIIzeZ1QUEd1NZD5u7ClzdzGSZVdEshpKrc4PCnwAYYRJ+EdNK7snxTpRPboLgq6MY36NIXXTAQFdqNu3MacupZu73Wb2vkx5iD1/dGy9C85yK6Kph6bVxfDZHBSr5rr1Y33G8vbO+NLYx69QC/RKjJRA22jN+gQHSOKPqJP6Av6Wvpc+l76UfqZU2dnxnOeoxuj9OsPlOedJA==</latexit>

Cross-entropy loss function for the softmax function

For a logistic model with parameters 𝜽 = (𝑾, 𝒃), the cross-entropy loss for a sample (𝒙, 𝑦) is defined as

where we write [𝑘 = 𝑦] for the function that returns 1 if 𝑘 = 𝑦, and 0 otherwise (Iverson bracket).

৴ଔ � ÷ǵ < � ਚ> MPH TPęNBY੍ਲ � <><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Regularisation

• The term regularisation refers to all changes to a model that are intended to decrease generalisation error but not training error.

• A standard for logistic regression is L2-regularisation, where we penalise parameter vectors based on their lengths:

regularisation strength

ଔ Ն� ଔ ÷ ʐò৴ଔ � ૣјଔј�ʞ<latexit sha1_base64="RueUXWVanBvf8bR2DRDwJIHEzvs=">AAAFe3icjVRdT9swFE2Bbqz7gu1xL9YqJGBN1RRG6aRKCCa0SXSwlY9KTVU5zm1r4XzIdpC60D+5t/2S7WXS7KSDkhYNS0ntc889vj43tRMyKmSl8jO3sLiUf/R4+Unh6bPnL16urL46F0HECZyRgAW87WABjPpwJqlk0A45YM9hcOFcHuj4xRVwQQP/VI5C6Hp44NM+JVgqqLfC7CsgsS2HIPEY2R6WQw4s/tBQi+mIiWz1i2yHDtg6sn3sMIyO1qc5G+gdsvkwQPb1NGxf96pJHt/orRQr5Uoy0OzEmkyKxmSc9FYXf9luQCIPfEkYFqJjVULZjTGXlDAYF+xIQIjJJR5AR0197IHoxoktY7SmEBf1A64eX6IEnc6IsSfEyHMUU59bZGManBfrRLK/242pH0YSfJJu1I8YkgHSHiOXciCSjdQEE05VrYgMMcdEqk7c2UUXJkIgd0+idzSFHDFofG4dl5yAubfLbhz5lAQumEl9BVuA9DD1tVSnsIbQvqIf6gM3ULyJWuBRLTAuFRBq0e9wCFhGHIQOKwihWKNqZe6Wt0voOFTlYjbBdpO0W46imNZWlmVZGZq1ZVr1cn2GV8/yFMnMsmrVsSIlzCM6uCn2FNoa+hJ5jnJRV98M/EAoz8BVEsxtaYdUWjduUl993OiEB//skcPUnv9ZsFNCE+tSjabuxCn1RyWUyOuAglqqq6FM35kj7aQ2zqoc4FDOU7nH4VmBbzCIGOYP0Lixf1akFTnDhyjoxtSq8xQ+UhFm2LWqeR9z7l66vbdtaqafMQOpr42Grf+S/eSqqOuxc3MxzE7Oq2Vrq7z9dbu4tz+5NJaNN8ZbY92wjJqxZ3wyTowzgxg/jN+5XG5h6U++mN/Ml1LqQm6S89q4M/Lv/wKs0bHO</latexit>

Logistic Regression and Naive Bayes

• Learning a Naive Bayes classifier involves fitting a probability model that optimizes the joint likelihood 𝑃(class, document). 𝑃(class | document) is obtained via Bayes’ rule.

• Logistic regression fits the same probability model to directly optimize the conditional 𝑃(class | document). lower aymptotic error

• Naive Bayes and logistic regression form a generative–discriminative pair.

Support Vector Machines

Linear model

input vector (1-by-F)

weight vector (F-by-1)

predicted output

F = number of features (independent variables, covariates)

Оਚ � ੍ੌ � <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Linearly separable problems

linearly separable not linearly separable

Linearly separable problems

• A classification problem is called linearly separable, if there exists a linear model that can learn the problem with zero error.

• Any line in feature space that separates the classes from each other is called a decision boundary.

For simplicity, we restrict ourselves to binary classification problems here.

Different decision boundaries

margin

Different decision boundaries

We would like to choose a decision boundary that maximises the margin to the closest training samples.

Computing the margin

• Mathematically, the decision boundary is a plane that is defined by the parameters (weights and bias) of the linear classifier. More specifically, this plane is the set of all points 𝒙 such that 𝒙𝒘 + 𝑏 = 0.

• The margin between a data point 𝒙𝑖 and the decision boundary is the distance between 𝒙𝑖 and the plane along the direction of 𝒘.

• Subject to some convenient normalisations, for the closest data points this yields a margin of 1/∥𝒘∥.

Support vectors

support vector

Maximising the margin

• If the training data is linearly separable, then the optimal margin can be found using quadratic programming. more recent approach: sub-gradient descent, coordinate descent

• To generalise this to training sets which are not linearly separable, one can use gradient-based optimisation.

Different kernels

Source

Practical concerns

• More than two classes can be handled according to a one-versus-one or one-versus-all scheme.

• For general SVM, training time scales quadratically with the number of samples, which quickly becomes impractical.

• An alternative is to use a special-purpose implementation of a linear SVM, which can have significantly shorter training times.

This lecture

• Support Vector Machines

Project idea

• There are many other text classification methods than Naive Bayes, Logistic regression, and Support Vector Machines. decision trees, many different types of neural networks, …

• Pick a method that you find interesting and compare its performance to that of other methods on a standard data set.

• In the report, make it clear why you chose this method, and that you have understood how it works.

Beyond the bag-of-words assumption

• The linear sequence of the words in a text carries meaning. The brown dog on the mat saw the striped cat through the window.

The brown cat saw the striped dog through the window on the mat.

• We can try to capture at least parts of this sequence by representing texts as 𝒏-grams, contiguous sequences of 𝑛 words. unigram (not), bigram (not bad)

• Recurrent neural networks have been designed to explicitly capture long-range dependencies between words.

text classi cation - liu

Documents

classi cation-based improvement of application robustness

naïve bayes text classi cation

classi cation-based financial markets prediction using

inducing and evaluating classi cation ... -...

naive bayes classi cation

visual pattern recognition: pattern classification and...

deep transfer learning ensemble for classi cation

valis: an evolutionary classi cation...

bayesian classi cation

classi˜cation of living things discussion questions

mathematics subject classi cation 2010

natural language processing€¦ · classi cation tasks in...

report 3: classi cation and object recognition

partial monitoring { classi cation, regret bounds, and

chinese calligraphy font classi cation and...

classi cation on manifolds - midag

multi-scale classi cation for electro-sensing

classi cation problems in symplectic linear algebra

atlas detector event classi cation with tensorflow

automated classi cation of sleep stages using eeg/eog...