precision & recall - university of washingtonprecision-recall curve • given training data,...

Precision & Recall

Sewoong Oh

CSE/STAT 416University of Washington

Evaluating a classifier

!2

Example• Consider a restaurant owner who wants to use ML to

promote his business• Crawl reviews from Yelp on his restaurant• Classify each review as {positive,negative} sentiment• Post positive reviews on web

• Need sentiment classifier:

!3

Easily best sushi in Seattle.

Sentence SentimentClassifier


Input xi:

Output: ŷiPredicted sentiment

!4

ClassifierMODEL

Sentences predicted to be positive ŷ = +1


I like the interior decoration and the blackboard menu on the wall.

All the sushi was delicious.

The sushi was amazing, and the rice is just outstanding.

Sentences predicted to be negative ŷ = -1

The seaweed salad was just OK, vegetable salad was just ordinary.

My wife tried their ramen and it was pretty forgettable.

The service is somewhat hectic.








Sentences from all reviews

for my restaurant

• Among many choices of classifiers, logistic regression, decision trees, boosting, etc., which one should we use?

Accuracy

• where True Positive Ctp is the number of examples in (test or train) data with

• True Negative Ctn is the number of examples with

• we can use as a baseline a random guess, to get a sense of what is a good accuracy or not• for binary classification, a good classifier should at the

minimum get better than accuracy 1/2 (as that is what random guess with no training gets)

• for k-ary classification, the baseline is accuracy 1/k (or equivalently error 1-1/k)

!5

Accuracy = Ctp+Ctn

N<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

yi = +1 and yi = +1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

yi = �1 and yi = �1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

But high accuracy does not always mean good classifier

• If 99% of people do not have cancer, then predicting “no cancer” is 99% accurate, but meaningless

• In the restaurant example, it will choose 10 reviews to show on its website

• A measure of performance that is widely used is precision and recall • Precision: did I (mistakenly) show a negative sentence? • Recall: did I miss any (potentially great) positive sentences?

!6








Break all reviews into sentences

True Positive

False Positive (FP)

False Negative (FN)

True NegativeTr

ue la

bel

Predicted label

True Positive

False Negative

False Positive

True Negative

ŷi=+1 ŷi=-1

yi=+1

yi=-1

Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ctp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ctn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

• We care about “how many negative reviews did I include in my list of n positively predicted examples?

• precision is designed to answer this question: n*precision

!7







�

✘

�

�

�

✘ Only 4 out of 6 sentences

predicted to be positive are

actually positive

Sentences predicted to be positive: ŷi=+1

Precision

precision = Ctp

Ctp+Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Recall

• We also care about “how many positive reviews did I miss (or include)?”• Recall is designed to answer this question

!8

recall = Ctp

Ctp+Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

ClassifierMODEL

True positive sentences: yi=+1

Predicted positive ŷi=+1Easily best sushi in Seattle.






Predicted negative ŷi=-1The seaweed salad was just OK, vegetable salad was just ordinary.

My wife tried their ramen and it was delicious.



The service was perfect.

Sentences from all reviews

for my restaurant

precision-recall curve• given training data, each trained classifier achieves a certain

(recall,precision) on the test data• which we can plot in 2-d as below• for both precision and recall, higher is better• it is easy to achieve certain points in the graph• most reasonable models will be incomparable

!9

precision

recall 1.0

1.0

0.50.0

True Positive

False Positive (FP)

False Negative (FN)

True NegativeTr

ue la

bel

Predicted label

True Positive

False Negative

False Positive

True Negative

ŷi=+1 ŷi=-1

yi=+1

yi=-1

Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ctp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ctn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

precision = Ctp

Ctp+Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

recall = Ctp

Ctp+Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Precision-Recall tradeoff

!10

Want to find many positive sentences, but minimize risk of

incorrect predictions!!

Finds all positive sentences, but includes many false

positives

PESSIMISTICMODEL

OPTIMISTICMODEL

Finds few positive sentences, but includes no

false positives

Predicted negative ŷi=-1The service is somewhat hectic.













Predicted negative ŷi=-1










precision

recall 1.0

1.0

0.50.0

How do we trade-off precision and recall?• In the case of logistic regression,

we can choose how many positive predictions we want, and then include the higher confident ones first

!11

Definite

“The sushi & everything else were awesome!”

P(y=+1|x= ) = 0.99

“The sushi & everything else were awesome!”

“The sushi was good, the service was OK”

Not sure

P(y=+1|x= ) = 0.55

“The sushi was good, the service was OK”

Can be used to tradeoff precision and recall

logistic regression trade-off

!12

1

1+

e�w

>h(x

)

w>h(x)

precision

recall 1.0

1.0

0.50.0

pessimistic threshold

optimistic threshold

Standard threshold

precision-recall curve of a classifier• given a classifier, we can traverse the thresholds to get a

full curve

!13

recall

prec

isio

n

t=0.01

t=0.8

t=0.6

t=0.4

t=0.2

1

1

Classifier ABest classifierPessimistic

Optimistic

• Some times a classifier B is strictly better than classifier A

!14

recall

prec

isio

n1

1

Classifier A

Optimistic

Pessimistic Classifier B

• but most of the time, two curves are not directly comparable

!15

recall

prec

isio

n

0

1

1

Classifier A

Optimistic

Pessimistic Classifier C

Classifier C better here

Classifier A better here

• So for comparisons, people use a single number • F1-score• AUC

• Or use precision at k, • precision when top k examples are chosen to be

predicted as positive

!16

Showing k=5 sentences

on website

Sentences model most sure are positive





My wife tried their ramen and it was pretty forgettable. precision at k = 0.8

FPR = Cfp

Cfp+Ctn<latexit sha1_base64="vDmWLmnacK27k+okU7UQfKDwRig=">AAACDXicbVDLSgMxFL3js9bXqEs3wVYQhDJTF7oRigVxWcU+oC0lk2ba0ExmSDJCGeYH3Pgrblwo4ta9O//GtJ2Fth4I93DOvdzc40WcKe0439bS8srq2npuI7+5tb2za+/tN1QYS0LrJOShbHlYUc4ErWumOW1FkuLA47TpjaoTv/lApWKhuNfjiHYDPBDMZwRrI/XsYvG6docuUceXmCTVXuJHaZrVU1O0SNNizy44JWcKtEjcjBQgQ61nf3X6IYkDKjThWKm260S6m2CpGeE0zXdiRSNMRnhA24YKHFDVTabXpOjYKH3kh9I8odFU/T2R4ECpceCZzgDroZr3JuJ/XjvW/kU3YSKKNRVktsiPOdIhmkSD+kxSovnYEEwkM39FZIhNLtoEmDchuPMnL5JGueSelcq35ULlKosjB4dwBCfgwjlU4AZqUAcCj/AMr/BmPVkv1rv1MWtdsrKZA/gD6/MHcGWbMQ==</latexit>

TPR = Ctp

Ctp+Cfn= Recall

<latexit sha1_base64="GV4I+GRP89Mt0ZjZi8zST/Wl+zg=">AAACFnicbVDLSgMxFM3UV62vqks3wVYQxDJTF7oRit24rKUvaEvJpHfa0ExmSDJCGeYr3Pgrblwo4lbc+Tem7Sy0eiDcwzn3cnOPG3KmtG1/WZmV1bX1jexmbmt7Z3cvv3/QUkEkKTRpwAPZcYkCzgQ0NdMcOqEE4rsc2u6kOvPb9yAVC0RDT0Po+2QkmMco0UYa5M+LjVodX+OeJwmNq4NYh0mS1jNTPJEkxq4DJZwXB/mCXbLnwH+Jk5ICSlEb5D97w4BGPghNOVGq69ih7sdEakY5JLlepCAkdEJG0DVUEB9UP56fleATowyxF0jzhMZz9edETHylpr5rOn2ix2rZm4n/ed1Ie1f9mIkw0iDoYpEXcawDPMsID5kEqvnUEEIlM3/FdExMQNokmTMhOMsn/yWtcsm5KJXvyoXKTRpHFh2hY3SKHHSJKugW1VATUfSAntALerUerWfrzXpftGasdOYQ/YL18Q2zBJ53</latexit>

F1score = 2⇥ precision⇥recallprecision+recall

<latexit sha1_base64="za6Jvbec2EbXKEcRzahamKb0Bws=">AAACLnicbVBdS8MwFE3n15xfVR99CW6CIIy2PuiLMBTFxwnuA9Yy0izdwtKmJKkwyn6RL/4VfRBUxFd/hulWUTcvBE7OuSc39/gxo1JZ1otRWFhcWl4prpbW1jc2t8ztnabkicCkgTnjou0jSRiNSENRxUg7FgSFPiMtf3iR6a07IiTl0a0axcQLUT+iAcVIaaprXlaubIm5IPAMOtBVNCQSuoFAONXvYJoZv2l9R4yNf4SjnKl0zbJVtSYF54GdgzLIq941n9wex0lIIoUZkrJjW7HyUiQUxYyMS24iSYzwEPVJR8MI6fleOll3DA8004MBF/pECk7Y344UhVKOQl93hkgN5KyWkf9pnUQFp15KozhRJMLTQUHCoOIwyw72qF5YsZEGCAuq/wrxAOmslE64pEOwZ1eeB02nah9XnRunXDvP4yiCPbAPDoENTkANXIM6aAAM7sEjeAVvxoPxbLwbH9PWgpF7dsGfMj6/AJY2qPY=</latexit>

precision & recall - university of washingtonprecision-recall curve • given training data,...

Documents