natural language processingevaluation methods and measures •extrinsic evaluation: measure effects...
TRANSCRIPT
Text ClassificationNatural Language Processing: Lecture 9
02.11.2017
Kairit Sirts
2
3
4
Text/document classification
• Spam detection
• Topic classification (sport/finance/travel etc)
• Genre classification (news/sports/fiction/social media etc)
• Sentiment analysis
5
Authorship attribution
• Native language identification
• Clinical text classification – diagnosing psychiatric or cognitive impairments
• Identification of gender, dialect, educational background
6
Types of classification tasks
• Binary classification (true/false, 1/0, 1/-1)• Naïve Bayes
• Logistic regression
• Multi-class classification (politics/finance/travel)
• Multi-label classification (image captioning)
• Clustering: mostly unsupervised• Topic modeling – important but I will not talk about it today
7
Topics
• Generative vs discriminative classifiers
• Document representation
• Naïve Bayes classification
• Logistic regression
• Neural text classification
• Evaluation
• Slides mostly based on:• Text classification with Naïve Bayes by Sharon Goldwater• Text classification by Karl Moritz Hermann
8
Generative task formulation
• Given document d and a set of class labels C, assign to d the most probable label ĉ.
9
Bayes rule
Likelihood
Prior probability
Posterior probability
Naïve Bayes
Discriminative task formulation
• Given document d and a set of class labels C, assign to d the most probable label ĉ.
10
Logistic regression
Generative vs discriminative models
• Generative (joint) models – P(c, d)• Model the probability of both input document and output label
• Can be used to generate a new document with a particular label
• N-gram models, HMMs, PCFGs, Naïve Bayes
• Discriminative (conditional) models - P(c|d)• Learn boundaries between classes. Input data is taken as given
• Logistic regression, maximum entropy models, CRFs
11
How to represent document d?
• Bag of Words (BOW)• Easy, no effort required
• Ignores sentence structure
• Hand-crafted features• Can use NLP pipeline, class-specific features
• Incomplete, makes use of NLP pipeline
• Learned feature representation• Can learn to contain all relevant information
• Needs to be learned
12
BOW representations
• Binary BOW
• Multinomial BOW
• Tf-idf
13
the your model cash Viagra class account orderz
1 1 0 1 1 0 1 1
the your model cash Viagra class account orderz
14 2 0 1 3 0 1 1
Tf-idf
• Tf-idf = tf (term frequency) x idf (inverse document frequency)
• Term frequency:• Raw frequency
• Binary
• Raw frequency normalised by the document length
• Inverse document frequency• N - the number of documents
• nt – the number of documents containing term t
14
Tf-idf example
• Tf-idf(doc4, the) =
• Tf-idf(doc4, your) =
• Tf-idf(doc4, model) =
• Tf-idf(doc4, cash) =
• Tf-idf(doc4, Viagra) =
• Tf-idf(doc4, class) =
• Tf-idf(doc4, account) =
• Tf-idf(doc4, orderz) =
15
the your model cash Viagra class account orderz
doc1 12 3 1 0 0 2 0 0
doc2 10 4 0 4 0 0 2 0
doc3 25 4 0 0 0 1 1 0
doc4 14 2 0 1 3 0 1 1
doc5 17 5 0 2 0 0 1 1
Tf-idf example
• Tf-idf(doc4, the) = 0
• Tf-idf(doc4, your) = 0
• Tf-idf(doc4, model) = 0
• Tf-idf(doc4, cash) = log(5/3) = 0.51
• Tf-idf(doc4, Viagra) = 3log5 = 4.83
• Tf-idf(doc4, class) = 0
• Tf-idf(doc4, account) = log(5/4) = 0.22
• Tf-idf(doc4, orderz) = log(5/2) = 0.92
16
the your model cash Viagra class account orderz
doc1 12 3 1 0 0 2 0 0
doc2 10 4 0 4 0 0 2 0
doc3 25 4 0 0 0 1 1 0
doc4 14 2 0 1 3 0 1 1
doc5 17 5 0 2 0 0 1 1
Naïve Bayes
17
Naïve Bayes classifier
• Bayes rule
• Ignore the denominator because it does not depend on class label c:
• Assume that the document is represented as a list of features:
• How to compute the likelihood?
18
Naïve Bayes assumption
• Assume that the features are conditionally independent given the class label c:
19
Full model
• Given document with features and a set of labels C, choose:
20
Naïve Bayes as generative model
• Naïve Bayes model describes a generative process
• Assumes the data (features in each document) were generated as follows:
1. For each document, sample its class c from the prior P(c)
2. For each document, sample the value of each feature from P(fi|c)
21
Learning the class priors
• P(c) is usually estimated with MLE:
• Nc- the number of training documents with class c
• N – the total number of training documents
22
Learning the class priors: example
• Given training documents with class labels:
• P(spam) =
• P(not spam) =
23
the your model cash Viagra class account orderz Spam?
doc1 12 3 1 0 0 2 0 0 -
doc2 10 4 0 4 0 0 2 0 +
doc3 25 4 0 0 0 1 1 0 -
doc4 14 2 0 1 3 0 1 1 +
doc5 17 5 0 2 0 0 1 1 +
Learning the class priors: example
• Given training documents with class labels:
• P(spam) = 3/5
• P(not spam) = 2/5
24
the your model cash Viagra class account orderz Spam?
doc1 12 3 1 0 0 2 0 0 -
doc2 10 4 0 4 0 0 2 0 +
doc3 25 4 0 0 0 1 1 0 -
doc4 14 2 0 1 3 0 1 1 +
doc5 17 5 0 2 0 0 1 1 +
Learning the feature probabilities
• P(fi|c) are normally estimated using smoothing:
• count(fi, c) – the number of times feature fi occurs with class c
• F – the set of possible features
• α - smoothing parameter, optimised on development set
25
Learning the feature probabilities: example
26
the your model cash Viagra class account orderz Spam?
doc1 12 3 1 0 0 2 0 0 -
doc2 10 4 0 4 0 0 2 0 +
doc3 25 4 0 0 0 1 1 0 -
doc4 14 2 0 1 3 0 1 1 +
doc5 17 5 0 2 0 0 1 1 +
Learning the feature probabilities: example
27
the your model cash Viagra class account orderz Spam?
doc1 12 3 1 0 0 2 0 0 -
doc2 10 4 0 4 0 0 2 0 +
doc3 25 4 0 0 0 1 1 0 -
doc4 14 2 0 1 3 0 1 1 +
doc5 17 5 0 2 0 0 1 1 +
Classifying a new document: example
• Test document d: get your cash and your orderz
[your, cash, your orderz]
• Suppose that there are no other features than those given in the previous table
• Compute
• Compute
• Choose the class with larger value
28
Advantages of Naïve Bayes
• Very easy to implement (scikit-learn also has it)
• Very fast to train and test
• Might work reasonably well even when there is not that much training data
• One of the default baselines for any text classification tasks
29
Problems with Naïve Bayes
• It is naïve! The features usually really aren’t conditionally independent
• Consider the categories travel, finance, sport
• Are the following features independent given the category?• beach, sun, ski, snow, pitch, palm, football, relax, ocean
• Many feature types are themselves correlated (e.g. words and morphemes)
• The accuracy can be ok but the classifier might become over-confident because it treats the info given by correlated features as independent sources of evidence.
30
Logistic regression
31
Logistic regression
32https://en.wikipedia.org/wiki/Logistic_function
w – parameter vectorf(d) - document representation
Advantages of logistic regression
• Still reasonably simple
• Results are interpretable (regularization makes it harder!)
• Features can be arbitrarily complex
• Do not assume statistical independence between features
33
Drawbacks of logistic regression
• Harder to learn than Naïve Bayes
• Feature engineering is difficult
• Extracting complex features can be expensive
34
Neural text classification
35
Recurrent neural language model
36
• Agnostic to actual recurrent function (vanilla RNN, LSTM, GRU, ..)
• Reads inputs xi to accumulate state hi and predicts output yi
Text representation with RNN
37
• hi is a function of x{0:i} and h{0:i-1}
• It contains information about all text read up to point i
• hi is basically a representation of the text x{0:i}
• Given input text xi, …, xn, hn is the representation of the whole text
• This representation can be viewed as an alternative to other representations (BOW, tf-idf, …)
Text classification with an RNN
• hn as input given to a logistic regression (softmax) classifier
38
Text classification with an RNN
39
Training with cross-entropy loss
• yc’ = 1 if c’ is equal to the correct class label c and 0 otherwise
• The loss is 0 when the model predicts correctly P(c|d) = 1
• The loss is large when P(c|d) is small, e.g. the model predicts incorrectly
40
Cross-entropy loss: example
• L(doc1) =
• L(doc2) =
41
Sports Travel Finance Politics Health
Doc1 0 0 1 0 0
Doc2 0 1 0 0 0
Predicted 0.1 0.02 0.87 0.006 0.004
Cross-entropy loss: example
• L(doc1) = -log(0.87) = 0.14
• L(doc2) = -log(0.02) = 3.91
42
Sports Travel Finance Politics Health
Doc1 0 0 1 0 0
Doc2 0 1 0 0 0
Predicted 0.1 0.02 0.87 0.006 0.004
Multi-label classification
• What if each document can have multiple labels?• Turn the multi-label classification problem into several binary classification
problems
• Use binary cross-entropy
• Form “super-classes” from label combinations and perform multi-class classification
43
Dual Objective RNN
• In practice it may make sense to combine and LM objective with classifier training and to optimise the two losses jointly
44
Bidirectional RNN classifier
45
Non-sequential neural networks
• Recursive neural networks model the intrinsic hierarchical structure of language
• Convolutional neural networks were designed for image classification but can be adapted to sequential language data too
46
Recursive neural networks
• Composition follows syntax
• Composition function
• xl – representation of the left branch
• xr – representation of the right branch
• Wl, Wr – parameter matrices
47
Convolutional neural networks
48
EvaluationWe have talked about it before
49
Evaluation methods and measures
• Extrinsic evaluation: measure effects on a downstream task• Difficult
• Intrinsic evaluation: measure the accuracy on a test set• Simple, doesn’t always correlate with the tasks we are intrested in
• Accuracy: not suitable when the classes are unbalanced
• Precision: the proportion of correctly predicted among all predicted
• Recall: the proportion of correctly predicted among all that could have been predicted
• F1-score: combines precision and recall
50
Precision, recall and F1-score
• TP – true positives – correctly predicted positives
• FP – false positives – negatives that were predicted as positives
• FN – false negatives – positives that were predicted as negatives
51
Precision and recall: example
• True Positives =
• False Positives =
• False Negatives =
• Precision =
• Recall =
• F1-score =
52
Document about Sports?
Gold Y Y N N Y N N N Y N N
Predicted N Y N Y N N N N Y N N
Precision and recall: example
• True Positives = 2
• False Positives = 1
• False Negatives = 2
• Precision = 2/3
• Recall = 2/4
• F1-score = 4/7
53
Document about Sports?
Gold Y Y N N Y N N N Y N N
Predicted N Y N Y N N N N Y N N
Tuning precision vs recall
• The default classification threshold is 0.5• P(spam|doc) > 0.5 --> email is spam
• P(spam|doc) <= 0.5 --> email is not spam
• The threshold can be tuned to change precision and recall:• Raise threshold --> higher precision, lower recall
• Lower threshold --> lower precision, higher recall
54
Precision-recall curve
55https://stackoverflow.com/questions/33294574/good-roc-curve-but-poor-precision-recall-curve
Conclusion
• For solving a text classification task several decisions have to be made:• Representation: BOW, linguistic, distributed
• Classifier: Naïve Bayes, logistic regression, SVM, Decision tree
• In case of neural classifier the architecture: sequential, recursive, CNN
• Evaluation measure: accuracy, precision, recall, F1-score• Preferably compute all of them
56