precision & recall - university of washingtonprecision-recall curve • given training data,...

16
Precision & Recall Sewoong Oh CSE/STAT 416 University of Washington

Upload: others

Post on 08-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

Precision & Recall

Sewoong Oh

CSE/STAT 416University of Washington

Page 2: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

Evaluating a classifier

!2

Page 3: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

Example• Consider a restaurant owner who wants to use ML to

promote his business• Crawl reviews from Yelp on his restaurant• Classify each review as {positive,negative} sentiment• Post positive reviews on web

• Need sentiment classifier:

!3

Easily best sushi in Seattle.

Sentence SentimentClassifier

Easily best sushi in Seattle.

Input xi:

Output: ŷiPredicted sentiment

Page 4: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

!4

ClassifierMODEL

Sentences predicted to be positive ŷ = +1

Easily best sushi in Seattle.

I like the interior decoration and the blackboard menu on the wall.

All the sushi was delicious.

The sushi was amazing, and the rice is just outstanding.

Sentences predicted to be negative ŷ = -1

The seaweed salad was just OK, vegetable salad was just ordinary.

My wife tried their ramen and it was pretty forgettable.

The service is somewhat hectic.

The seaweed salad was just OK, vegetable salad was just ordinary.

I like the interior decoration and the blackboard menu on the wall.

My wife tried their ramen and it was pretty forgettable.

The service is somewhat hectic.

Easily best sushi in Seattle.

All the sushi was delicious.

The sushi was amazing, and the rice is just outstanding.

Sentences from all reviews

for my restaurant

• Among many choices of classifiers, logistic regression, decision trees, boosting, etc., which one should we use?

Page 5: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

Accuracy

• where True Positive Ctp is the number of examples in (test or train) data with

• True Negative Ctn is the number of examples with

• we can use as a baseline a random guess, to get a sense of what is a good accuracy or not• for binary classification, a good classifier should at the

minimum get better than accuracy 1/2 (as that is what random guess with no training gets)

• for k-ary classification, the baseline is accuracy 1/k (or equivalently error 1-1/k)

!5

Accuracy = Ctp+Ctn

N<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

yi = +1 and yi = +1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

yi = �1 and yi = �1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 6: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

But high accuracy does not always mean good classifier

• If 99% of people do not have cancer, then predicting “no cancer” is 99% accurate, but meaningless

• In the restaurant example, it will choose 10 reviews to show on its website

• A measure of performance that is widely used is precision and recall • Precision: did I (mistakenly) show a negative sentence? • Recall: did I miss any (potentially great) positive sentences?

!6

The seaweed salad was just OK, vegetable salad was just ordinary.

I like the interior decoration and the blackboard menu on the wall.

My wife tried their ramen and it was pretty forgettable.

The service is somewhat hectic.

Easily best sushi in Seattle.

All the sushi was delicious.

The sushi was amazing, and the rice is just outstanding.

Break all reviews into sentences

True Positive

False Positive (FP)

False Negative (FN)

True NegativeTr

ue la

bel

Predicted label

True Positive

False Negative

False Positive

True Negative

ŷi=+1 ŷi=-1

yi=+1

yi=-1

Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ctp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ctn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 7: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

• We care about “how many negative reviews did I include in my list of n positively predicted examples?

• precision is designed to answer this question: n*precision

!7

Easily best sushi in Seattle.

I like the interior decoration and the blackboard menu on the wall.

All the sushi was delicious.

The sushi was amazing, and the rice is just outstanding.

The seaweed salad was just OK, vegetable salad was just ordinary.

The service is somewhat hectic.

✘ Only 4 out of 6 sentences

predicted to be positive are

actually positive

Sentences predicted to be positive: ŷi=+1

Precision

precision = Ctp

Ctp+Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 8: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

Recall

• We also care about “how many positive reviews did I miss (or include)?”• Recall is designed to answer this question

!8

recall = Ctp

Ctp+Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

ClassifierMODEL

True positive sentences: yi=+1

Predicted positive ŷi=+1Easily best sushi in Seattle.

I like the interior decoration and the blackboard menu on the wall.

All the sushi was delicious.

The sushi was amazing, and the rice is just outstanding.

The seaweed salad was just OK, vegetable salad was just ordinary.

The service is somewhat hectic.

Predicted negative ŷi=-1The seaweed salad was just OK, vegetable salad was just ordinary.

My wife tried their ramen and it was delicious.

The service is somewhat hectic.

My wife tried their ramen and it was pretty forgettable.

The service was perfect.

Sentences from all reviews

for my restaurant

Page 9: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

precision-recall curve• given training data, each trained classifier achieves a certain

(recall,precision) on the test data• which we can plot in 2-d as below• for both precision and recall, higher is better• it is easy to achieve certain points in the graph• most reasonable models will be incomparable

!9

precision

recall 1.0

1.0

0.50.0

True Positive

False Positive (FP)

False Negative (FN)

True NegativeTr

ue la

bel

Predicted label

True Positive

False Negative

False Positive

True Negative

ŷi=+1 ŷi=-1

yi=+1

yi=-1

Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ctp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Ctn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

precision = Ctp

Ctp+Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

recall = Ctp

Ctp+Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Page 10: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

Precision-Recall tradeoff

!10

Want to find many positive sentences, but minimize risk of

incorrect predictions!!

Finds all positive sentences, but includes many false

positives

PESSIMISTICMODEL

OPTIMISTICMODEL

Finds few positive sentences, but includes no

false positives

Predicted negative ŷi=-1The service is somewhat hectic.

Predicted positive ŷi=+1Easily best sushi in Seattle.

I like the interior decoration and the blackboard menu on the wall.

All the sushi was delicious.

The sushi was amazing, and the rice is just outstanding.

The seaweed salad was just OK, vegetable salad was just ordinary.

The service is somewhat hectic.

The seaweed salad was just OK, vegetable salad was just ordinary.

My wife tried their ramen and it was delicious.

The service was perfect.

My wife tried their ramen and it was pretty forgettable.

Predicted positive ŷi=+1Easily best sushi in Seattle.

The sushi was amazing, and the rice is just outstanding.

Predicted negative ŷi=-1

The service is somewhat hectic.

I like the interior decoration and the blackboard menu on the wall.

All the sushi was delicious.

The service is somewhat hectic.

The seaweed salad was just OK, vegetable salad was just ordinary.

My wife tried their ramen and it was delicious.

The service was perfect.

My wife tried their ramen and it was pretty forgettable.

The seaweed salad was just OK, vegetable salad was just ordinary.

precision

recall 1.0

1.0

0.50.0

Page 11: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

How do we trade-off precision and recall?• In the case of logistic regression,

we can choose how many positive predictions we want, and then include the higher confident ones first

!11

Definite

“The sushi & everything else were awesome!”

P(y=+1|x= ) = 0.99

“The sushi & everything else were awesome!”

“The sushi was good, the service was OK”

Not sure

P(y=+1|x= ) = 0.55

“The sushi was good, the service was OK”

Can be used to tradeoff precision and recall

Page 12: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

logistic regression trade-off

!12

1

1+

e�w

>h(x

)

w>h(x)

precision

recall 1.0

1.0

0.50.0

pessimistic threshold

optimistic threshold

Standard threshold

Page 13: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

precision-recall curve of a classifier• given a classifier, we can traverse the thresholds to get a

full curve

!13

recall

prec

isio

n

t=0.01

t=0.8

t=0.6

t=0.4

t=0.2

1

1

Classifier ABest classifierPessimistic

Optimistic

Page 14: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

• Some times a classifier B is strictly better than classifier A

!14

recall

prec

isio

n1

1

Classifier A

Optimistic

Pessimistic Classifier B

Page 15: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

• but most of the time, two curves are not directly comparable

!15

recall

prec

isio

n

0

1

1

Classifier A

Optimistic

Pessimistic Classifier C

Classifier C better here

Classifier A better here

Page 16: Precision & Recall - University of Washingtonprecision-recall curve • given training data, each trained classifier achieves a certain (recall,precision) on the test data • which

• So for comparisons, people use a single number • F1-score• AUC

• Or use precision at k, • precision when top k examples are chosen to be

predicted as positive

!16

Showing k=5 sentences

on website

Sentences model most sure are positive

Easily best sushi in Seattle.

All the sushi was delicious.

The sushi was amazing, and the rice is just outstanding.

The service was perfect.

My wife tried their ramen and it was pretty forgettable. precision at k = 0.8

FPR = Cfp

Cfp+Ctn<latexit sha1_base64="vDmWLmnacK27k+okU7UQfKDwRig=">AAACDXicbVDLSgMxFL3js9bXqEs3wVYQhDJTF7oRigVxWcU+oC0lk2ba0ExmSDJCGeYH3Pgrblwo4ta9O//GtJ2Fth4I93DOvdzc40WcKe0439bS8srq2npuI7+5tb2za+/tN1QYS0LrJOShbHlYUc4ErWumOW1FkuLA47TpjaoTv/lApWKhuNfjiHYDPBDMZwRrI/XsYvG6docuUceXmCTVXuJHaZrVU1O0SNNizy44JWcKtEjcjBQgQ61nf3X6IYkDKjThWKm260S6m2CpGeE0zXdiRSNMRnhA24YKHFDVTabXpOjYKH3kh9I8odFU/T2R4ECpceCZzgDroZr3JuJ/XjvW/kU3YSKKNRVktsiPOdIhmkSD+kxSovnYEEwkM39FZIhNLtoEmDchuPMnL5JGueSelcq35ULlKosjB4dwBCfgwjlU4AZqUAcCj/AMr/BmPVkv1rv1MWtdsrKZA/gD6/MHcGWbMQ==</latexit>

TPR = Ctp

Ctp+Cfn= Recall

<latexit sha1_base64="GV4I+GRP89Mt0ZjZi8zST/Wl+zg=">AAACFnicbVDLSgMxFM3UV62vqks3wVYQxDJTF7oRit24rKUvaEvJpHfa0ExmSDJCGeYr3Pgrblwo4lbc+Tem7Sy0eiDcwzn3cnOPG3KmtG1/WZmV1bX1jexmbmt7Z3cvv3/QUkEkKTRpwAPZcYkCzgQ0NdMcOqEE4rsc2u6kOvPb9yAVC0RDT0Po+2QkmMco0UYa5M+LjVodX+OeJwmNq4NYh0mS1jNTPJEkxq4DJZwXB/mCXbLnwH+Jk5ICSlEb5D97w4BGPghNOVGq69ih7sdEakY5JLlepCAkdEJG0DVUEB9UP56fleATowyxF0jzhMZz9edETHylpr5rOn2ix2rZm4n/ed1Ie1f9mIkw0iDoYpEXcawDPMsID5kEqvnUEEIlM3/FdExMQNokmTMhOMsn/yWtcsm5KJXvyoXKTRpHFh2hY3SKHHSJKugW1VATUfSAntALerUerWfrzXpftGasdOYQ/YL18Q2zBJ53</latexit>

F1score = 2⇥ precision⇥recallprecision+recall

<latexit sha1_base64="za6Jvbec2EbXKEcRzahamKb0Bws=">AAACLnicbVBdS8MwFE3n15xfVR99CW6CIIy2PuiLMBTFxwnuA9Yy0izdwtKmJKkwyn6RL/4VfRBUxFd/hulWUTcvBE7OuSc39/gxo1JZ1otRWFhcWl4prpbW1jc2t8ztnabkicCkgTnjou0jSRiNSENRxUg7FgSFPiMtf3iR6a07IiTl0a0axcQLUT+iAcVIaaprXlaubIm5IPAMOtBVNCQSuoFAONXvYJoZv2l9R4yNf4SjnKl0zbJVtSYF54GdgzLIq941n9wex0lIIoUZkrJjW7HyUiQUxYyMS24iSYzwEPVJR8MI6fleOll3DA8004MBF/pECk7Y344UhVKOQl93hkgN5KyWkf9pnUQFp15KozhRJMLTQUHCoOIwyw72qF5YsZEGCAuq/wrxAOmslE64pEOwZ1eeB02nah9XnRunXDvP4yiCPbAPDoENTkANXIM6aAAM7sEjeAVvxoPxbLwbH9PWgpF7dsGfMj6/AJY2qPY=</latexit>