natural language processingevaluation methods and measures •extrinsic evaluation: measure effects...

Text ClassificationNatural Language Processing: Lecture 9

02.11.2017

Kairit Sirts

Text/document classification

• Spam detection

• Topic classification (sport/finance/travel etc)

• Genre classification (news/sports/fiction/social media etc)

• Sentiment analysis

5

Authorship attribution

• Native language identification

• Clinical text classification – diagnosing psychiatric or cognitive impairments

• Identification of gender, dialect, educational background

6

Types of classification tasks

• Binary classification (true/false, 1/0, 1/-1)• Naïve Bayes

• Logistic regression

• Multi-class classification (politics/finance/travel)

• Multi-label classification (image captioning)

• Clustering: mostly unsupervised• Topic modeling – important but I will not talk about it today

7

Topics

• Generative vs discriminative classifiers

• Document representation

• Naïve Bayes classification

• Logistic regression

• Neural text classification

• Evaluation

• Slides mostly based on:• Text classification with Naïve Bayes by Sharon Goldwater• Text classification by Karl Moritz Hermann

8

https://www.inf.ed.ac.uk/teaching/courses/anlp/slides/anlp07-2x2.pdf

https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/Lecture 5 - Text Classification.pdf

Generative task formulation

• Given document d and a set of class labels C, assign to d the most probable label ĉ.

9

Bayes rule

Likelihood

Prior probability

Posterior probability

Naïve Bayes

Discriminative task formulation

• Given document d and a set of class labels C, assign to d the most probable label ĉ.

10

Logistic regression

Generative vs discriminative models

• Generative (joint) models – P(c, d)• Model the probability of both input document and output label

• Can be used to generate a new document with a particular label

• N-gram models, HMMs, PCFGs, Naïve Bayes

• Discriminative (conditional) models - P(c|d)• Learn boundaries between classes. Input data is taken as given

• Logistic regression, maximum entropy models, CRFs

11

How to represent document d?

• Bag of Words (BOW)• Easy, no effort required

• Ignores sentence structure

• Hand-crafted features• Can use NLP pipeline, class-specific features

• Incomplete, makes use of NLP pipeline

• Learned feature representation• Can learn to contain all relevant information

• Needs to be learned

12

BOW representations

• Binary BOW

• Multinomial BOW

• Tf-idf

13

the your model cash Viagra class account orderz

1 1 0 1 1 0 1 1


14 2 0 1 3 0 1 1

Tf-idf

• Tf-idf = tf (term frequency) x idf (inverse document frequency)

• Term frequency:• Raw frequency

• Binary

• Raw frequency normalised by the document length

• Inverse document frequency• N - the number of documents

• nt – the number of documents containing term t

14

Tf-idf example

• Tf-idf(doc4, the) =

• Tf-idf(doc4, your) =

• Tf-idf(doc4, model) =

• Tf-idf(doc4, cash) =

• Tf-idf(doc4, Viagra) =

• Tf-idf(doc4, class) =

• Tf-idf(doc4, account) =

• Tf-idf(doc4, orderz) =

15


doc1 12 3 1 0 0 2 0 0

doc2 10 4 0 4 0 0 2 0

doc3 25 4 0 0 0 1 1 0

doc4 14 2 0 1 3 0 1 1

doc5 17 5 0 2 0 0 1 1

Tf-idf example

• Tf-idf(doc4, the) = 0

• Tf-idf(doc4, your) = 0

• Tf-idf(doc4, model) = 0

• Tf-idf(doc4, cash) = log(5/3) = 0.51

• Tf-idf(doc4, Viagra) = 3log5 = 4.83

• Tf-idf(doc4, class) = 0

• Tf-idf(doc4, account) = log(5/4) = 0.22

• Tf-idf(doc4, orderz) = log(5/2) = 0.92

16


doc1 12 3 1 0 0 2 0 0

doc2 10 4 0 4 0 0 2 0

doc3 25 4 0 0 0 1 1 0

doc4 14 2 0 1 3 0 1 1

doc5 17 5 0 2 0 0 1 1

Naïve Bayes

17

Naïve Bayes classifier

• Bayes rule

• Ignore the denominator because it does not depend on class label c:

• Assume that the document is represented as a list of features:

• How to compute the likelihood?

18

Naïve Bayes assumption

• Assume that the features are conditionally independent given the class label c:

19

Full model

• Given document with features and a set of labels C, choose:

20

Naïve Bayes as generative model

• Naïve Bayes model describes a generative process

• Assumes the data (features in each document) were generated as follows:

1. For each document, sample its class c from the prior P(c)

2. For each document, sample the value of each feature from P(fi|c)

21

Learning the class priors

• P(c) is usually estimated with MLE:

• Nc- the number of training documents with class c

• N – the total number of training documents

22

Learning the class priors: example

• Given training documents with class labels:

• P(spam) =

• P(not spam) =

23

the your model cash Viagra class account orderz Spam?

doc1 12 3 1 0 0 2 0 0 -

doc2 10 4 0 4 0 0 2 0 +

doc3 25 4 0 0 0 1 1 0 -

doc4 14 2 0 1 3 0 1 1 +

doc5 17 5 0 2 0 0 1 1 +

Learning the class priors: example

• Given training documents with class labels:

• P(spam) = 3/5

• P(not spam) = 2/5

24


doc1 12 3 1 0 0 2 0 0 -

doc2 10 4 0 4 0 0 2 0 +

doc3 25 4 0 0 0 1 1 0 -

doc4 14 2 0 1 3 0 1 1 +

doc5 17 5 0 2 0 0 1 1 +

Learning the feature probabilities

• P(fi|c) are normally estimated using smoothing:

• count(fi, c) – the number of times feature fi occurs with class c

• F – the set of possible features

• α - smoothing parameter, optimised on development set

25

Learning the feature probabilities: example

26


doc1 12 3 1 0 0 2 0 0 -

doc2 10 4 0 4 0 0 2 0 +

doc3 25 4 0 0 0 1 1 0 -

doc4 14 2 0 1 3 0 1 1 +

doc5 17 5 0 2 0 0 1 1 +

Learning the feature probabilities: example

27


doc1 12 3 1 0 0 2 0 0 -

doc2 10 4 0 4 0 0 2 0 +

doc3 25 4 0 0 0 1 1 0 -

doc4 14 2 0 1 3 0 1 1 +

doc5 17 5 0 2 0 0 1 1 +

Classifying a new document: example

• Test document d: get your cash and your orderz

[your, cash, your orderz]

• Suppose that there are no other features than those given in the previous table

• Compute

• Compute

• Choose the class with larger value

28

Advantages of Naïve Bayes

• Very easy to implement (scikit-learn also has it)

• Very fast to train and test

• Might work reasonably well even when there is not that much training data

• One of the default baselines for any text classification tasks

29

Problems with Naïve Bayes

• It is naïve! The features usually really aren’t conditionally independent

• Consider the categories travel, finance, sport

• Are the following features independent given the category?• beach, sun, ski, snow, pitch, palm, football, relax, ocean

• Many feature types are themselves correlated (e.g. words and morphemes)

• The accuracy can be ok but the classifier might become over-confident because it treats the info given by correlated features as independent sources of evidence.

30

Logistic regression

31

Logistic regression

32https://en.wikipedia.org/wiki/Logistic_function

w – parameter vectorf(d) - document representation

Advantages of logistic regression

• Still reasonably simple

• Results are interpretable (regularization makes it harder!)

• Features can be arbitrarily complex

• Do not assume statistical independence between features

33

Drawbacks of logistic regression

• Harder to learn than Naïve Bayes

• Feature engineering is difficult

• Extracting complex features can be expensive

34

Neural text classification

35

Recurrent neural language model

36

• Agnostic to actual recurrent function (vanilla RNN, LSTM, GRU, ..)

• Reads inputs xi to accumulate state hi and predicts output yi

Text representation with RNN

37

• hi is a function of x{0:i} and h{0:i-1}

• It contains information about all text read up to point i

• hi is basically a representation of the text x{0:i}

• Given input text xi, …, xn, hn is the representation of the whole text

• This representation can be viewed as an alternative to other representations (BOW, tf-idf, …)

Text classification with an RNN

• hn as input given to a logistic regression (softmax) classifier

38

Text classification with an RNN

39

Training with cross-entropy loss

• yc’ = 1 if c’ is equal to the correct class label c and 0 otherwise

• The loss is 0 when the model predicts correctly P(c|d) = 1

• The loss is large when P(c|d) is small, e.g. the model predicts incorrectly

40

Cross-entropy loss: example

• L(doc1) =

• L(doc2) =

41

Sports Travel Finance Politics Health

Doc1 0 0 1 0 0

Doc2 0 1 0 0 0

Predicted 0.1 0.02 0.87 0.006 0.004

Cross-entropy loss: example

• L(doc1) = -log(0.87) = 0.14

• L(doc2) = -log(0.02) = 3.91

42

Sports Travel Finance Politics Health

Doc1 0 0 1 0 0

Doc2 0 1 0 0 0

Predicted 0.1 0.02 0.87 0.006 0.004

Multi-label classification

• What if each document can have multiple labels?• Turn the multi-label classification problem into several binary classification

problems

• Use binary cross-entropy

• Form “super-classes” from label combinations and perform multi-class classification

43

Dual Objective RNN

• In practice it may make sense to combine and LM objective with classifier training and to optimise the two losses jointly

44

Bidirectional RNN classifier

45

Non-sequential neural networks

• Recursive neural networks model the intrinsic hierarchical structure of language

• Convolutional neural networks were designed for image classification but can be adapted to sequential language data too

46

Recursive neural networks

• Composition follows syntax

• Composition function

• xl – representation of the left branch

• xr – representation of the right branch

• Wl, Wr – parameter matrices

47

Convolutional neural networks

48

EvaluationWe have talked about it before

49

Evaluation methods and measures

• Extrinsic evaluation: measure effects on a downstream task• Difficult

• Intrinsic evaluation: measure the accuracy on a test set• Simple, doesn’t always correlate with the tasks we are intrested in

• Accuracy: not suitable when the classes are unbalanced

• Precision: the proportion of correctly predicted among all predicted

• Recall: the proportion of correctly predicted among all that could have been predicted

• F1-score: combines precision and recall

50

Precision, recall and F1-score

• TP – true positives – correctly predicted positives

• FP – false positives – negatives that were predicted as positives

• FN – false negatives – positives that were predicted as negatives

51

Precision and recall: example

• True Positives =

• False Positives =

• False Negatives =

• Precision =

• Recall =

• F1-score =

52

Document about Sports?

Gold Y Y N N Y N N N Y N N

Predicted N Y N Y N N N N Y N N

Precision and recall: example

• True Positives = 2

• False Positives = 1

• False Negatives = 2

• Precision = 2/3

• Recall = 2/4

• F1-score = 4/7

53

Document about Sports?

Gold Y Y N N Y N N N Y N N

Predicted N Y N Y N N N N Y N N

Tuning precision vs recall

• The default classification threshold is 0.5• P(spam|doc) > 0.5 --> email is spam

• P(spam|doc) <= 0.5 --> email is not spam

• The threshold can be tuned to change precision and recall:• Raise threshold --> higher precision, lower recall

• Lower threshold --> lower precision, higher recall

54

Precision-recall curve

55https://stackoverflow.com/questions/33294574/good-roc-curve-but-poor-precision-recall-curve

Conclusion

• For solving a text classification task several decisions have to be made:• Representation: BOW, linguistic, distributed

• Classifier: Naïve Bayes, logistic regression, SVM, Decision tree

• In case of neural classifier the architecture: sequential, recursive, CNN

• Evaluation measure: accuracy, precision, recall, F1-score• Preferably compute all of them

56

natural language processingevaluation methods and measures •extrinsic evaluation: measure effects...

Documents