precision & recall - university of washingtonprecision-recall curve • given training data,...
TRANSCRIPT
Precision & Recall
Sewoong Oh
CSE/STAT 416University of Washington
Evaluating a classifier
!2
Example• Consider a restaurant owner who wants to use ML to
promote his business• Crawl reviews from Yelp on his restaurant• Classify each review as {positive,negative} sentiment• Post positive reviews on web
• Need sentiment classifier:
!3
Easily best sushi in Seattle.
Sentence SentimentClassifier
Easily best sushi in Seattle.
Input xi:
Output: ŷiPredicted sentiment
!4
ClassifierMODEL
Sentences predicted to be positive ŷ = +1
Easily best sushi in Seattle.
I like the interior decoration and the blackboard menu on the wall.
All the sushi was delicious.
The sushi was amazing, and the rice is just outstanding.
Sentences predicted to be negative ŷ = -1
The seaweed salad was just OK, vegetable salad was just ordinary.
My wife tried their ramen and it was pretty forgettable.
The service is somewhat hectic.
The seaweed salad was just OK, vegetable salad was just ordinary.
I like the interior decoration and the blackboard menu on the wall.
My wife tried their ramen and it was pretty forgettable.
The service is somewhat hectic.
Easily best sushi in Seattle.
All the sushi was delicious.
The sushi was amazing, and the rice is just outstanding.
Sentences from all reviews
for my restaurant
• Among many choices of classifiers, logistic regression, decision trees, boosting, etc., which one should we use?
Accuracy
• where True Positive Ctp is the number of examples in (test or train) data with
• True Negative Ctn is the number of examples with
• we can use as a baseline a random guess, to get a sense of what is a good accuracy or not• for binary classification, a good classifier should at the
minimum get better than accuracy 1/2 (as that is what random guess with no training gets)
• for k-ary classification, the baseline is accuracy 1/k (or equivalently error 1-1/k)
!5
Accuracy = Ctp+Ctn
N<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
yi = +1 and yi = +1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
yi = �1 and yi = �1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
But high accuracy does not always mean good classifier
• If 99% of people do not have cancer, then predicting “no cancer” is 99% accurate, but meaningless
• In the restaurant example, it will choose 10 reviews to show on its website
• A measure of performance that is widely used is precision and recall • Precision: did I (mistakenly) show a negative sentence? • Recall: did I miss any (potentially great) positive sentences?
!6
The seaweed salad was just OK, vegetable salad was just ordinary.
I like the interior decoration and the blackboard menu on the wall.
My wife tried their ramen and it was pretty forgettable.
The service is somewhat hectic.
Easily best sushi in Seattle.
All the sushi was delicious.
The sushi was amazing, and the rice is just outstanding.
Break all reviews into sentences
True Positive
False Positive (FP)
False Negative (FN)
True NegativeTr
ue la
bel
Predicted label
True Positive
False Negative
False Positive
True Negative
ŷi=+1 ŷi=-1
yi=+1
yi=-1
Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Ctp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Ctn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
• We care about “how many negative reviews did I include in my list of n positively predicted examples?
• precision is designed to answer this question: n*precision
!7
Easily best sushi in Seattle.
I like the interior decoration and the blackboard menu on the wall.
All the sushi was delicious.
The sushi was amazing, and the rice is just outstanding.
The seaweed salad was just OK, vegetable salad was just ordinary.
The service is somewhat hectic.
�
✘
�
�
�
✘ Only 4 out of 6 sentences
predicted to be positive are
actually positive
Sentences predicted to be positive: ŷi=+1
Precision
precision = Ctp
Ctp+Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Recall
• We also care about “how many positive reviews did I miss (or include)?”• Recall is designed to answer this question
!8
recall = Ctp
Ctp+Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
ClassifierMODEL
True positive sentences: yi=+1
Predicted positive ŷi=+1Easily best sushi in Seattle.
I like the interior decoration and the blackboard menu on the wall.
All the sushi was delicious.
The sushi was amazing, and the rice is just outstanding.
The seaweed salad was just OK, vegetable salad was just ordinary.
The service is somewhat hectic.
Predicted negative ŷi=-1The seaweed salad was just OK, vegetable salad was just ordinary.
My wife tried their ramen and it was delicious.
The service is somewhat hectic.
My wife tried their ramen and it was pretty forgettable.
The service was perfect.
Sentences from all reviews
for my restaurant
precision-recall curve• given training data, each trained classifier achieves a certain
(recall,precision) on the test data• which we can plot in 2-d as below• for both precision and recall, higher is better• it is easy to achieve certain points in the graph• most reasonable models will be incomparable
!9
precision
recall 1.0
1.0
0.50.0
True Positive
False Positive (FP)
False Negative (FN)
True NegativeTr
ue la
bel
Predicted label
True Positive
False Negative
False Positive
True Negative
ŷi=+1 ŷi=-1
yi=+1
yi=-1
Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Ctp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Ctn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
precision = Ctp
Ctp+Cfp<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
recall = Ctp
Ctp+Cfn<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
Precision-Recall tradeoff
!10
Want to find many positive sentences, but minimize risk of
incorrect predictions!!
Finds all positive sentences, but includes many false
positives
PESSIMISTICMODEL
OPTIMISTICMODEL
Finds few positive sentences, but includes no
false positives
Predicted negative ŷi=-1The service is somewhat hectic.
Predicted positive ŷi=+1Easily best sushi in Seattle.
I like the interior decoration and the blackboard menu on the wall.
All the sushi was delicious.
The sushi was amazing, and the rice is just outstanding.
The seaweed salad was just OK, vegetable salad was just ordinary.
The service is somewhat hectic.
The seaweed salad was just OK, vegetable salad was just ordinary.
My wife tried their ramen and it was delicious.
The service was perfect.
My wife tried their ramen and it was pretty forgettable.
Predicted positive ŷi=+1Easily best sushi in Seattle.
The sushi was amazing, and the rice is just outstanding.
Predicted negative ŷi=-1
The service is somewhat hectic.
I like the interior decoration and the blackboard menu on the wall.
All the sushi was delicious.
The service is somewhat hectic.
The seaweed salad was just OK, vegetable salad was just ordinary.
My wife tried their ramen and it was delicious.
The service was perfect.
My wife tried their ramen and it was pretty forgettable.
The seaweed salad was just OK, vegetable salad was just ordinary.
precision
recall 1.0
1.0
0.50.0
How do we trade-off precision and recall?• In the case of logistic regression,
we can choose how many positive predictions we want, and then include the higher confident ones first
!11
Definite
“The sushi & everything else were awesome!”
P(y=+1|x= ) = 0.99
“The sushi & everything else were awesome!”
“The sushi was good, the service was OK”
Not sure
P(y=+1|x= ) = 0.55
“The sushi was good, the service was OK”
Can be used to tradeoff precision and recall
logistic regression trade-off
!12
1
1+
e�w
>h(x
)
w>h(x)
precision
recall 1.0
1.0
0.50.0
pessimistic threshold
optimistic threshold
Standard threshold
precision-recall curve of a classifier• given a classifier, we can traverse the thresholds to get a
full curve
!13
recall
prec
isio
n
t=0.01
t=0.8
t=0.6
t=0.4
t=0.2
1
1
Classifier ABest classifierPessimistic
Optimistic
• Some times a classifier B is strictly better than classifier A
!14
recall
prec
isio
n1
1
Classifier A
Optimistic
Pessimistic Classifier B
• but most of the time, two curves are not directly comparable
!15
recall
prec
isio
n
0
1
1
Classifier A
Optimistic
Pessimistic Classifier C
Classifier C better here
Classifier A better here
• So for comparisons, people use a single number • F1-score• AUC
• Or use precision at k, • precision when top k examples are chosen to be
predicted as positive
!16
Showing k=5 sentences
on website
Sentences model most sure are positive
Easily best sushi in Seattle.
All the sushi was delicious.
The sushi was amazing, and the rice is just outstanding.
The service was perfect.
My wife tried their ramen and it was pretty forgettable. precision at k = 0.8
FPR = Cfp
Cfp+Ctn<latexit sha1_base64="vDmWLmnacK27k+okU7UQfKDwRig=">AAACDXicbVDLSgMxFL3js9bXqEs3wVYQhDJTF7oRigVxWcU+oC0lk2ba0ExmSDJCGeYH3Pgrblwo4ta9O//GtJ2Fth4I93DOvdzc40WcKe0439bS8srq2npuI7+5tb2za+/tN1QYS0LrJOShbHlYUc4ErWumOW1FkuLA47TpjaoTv/lApWKhuNfjiHYDPBDMZwRrI/XsYvG6docuUceXmCTVXuJHaZrVU1O0SNNizy44JWcKtEjcjBQgQ61nf3X6IYkDKjThWKm260S6m2CpGeE0zXdiRSNMRnhA24YKHFDVTabXpOjYKH3kh9I8odFU/T2R4ECpceCZzgDroZr3JuJ/XjvW/kU3YSKKNRVktsiPOdIhmkSD+kxSovnYEEwkM39FZIhNLtoEmDchuPMnL5JGueSelcq35ULlKosjB4dwBCfgwjlU4AZqUAcCj/AMr/BmPVkv1rv1MWtdsrKZA/gD6/MHcGWbMQ==</latexit>
TPR = Ctp
Ctp+Cfn= Recall
<latexit sha1_base64="GV4I+GRP89Mt0ZjZi8zST/Wl+zg=">AAACFnicbVDLSgMxFM3UV62vqks3wVYQxDJTF7oRit24rKUvaEvJpHfa0ExmSDJCGeYr3Pgrblwo4lbc+Tem7Sy0eiDcwzn3cnOPG3KmtG1/WZmV1bX1jexmbmt7Z3cvv3/QUkEkKTRpwAPZcYkCzgQ0NdMcOqEE4rsc2u6kOvPb9yAVC0RDT0Po+2QkmMco0UYa5M+LjVodX+OeJwmNq4NYh0mS1jNTPJEkxq4DJZwXB/mCXbLnwH+Jk5ICSlEb5D97w4BGPghNOVGq69ih7sdEakY5JLlepCAkdEJG0DVUEB9UP56fleATowyxF0jzhMZz9edETHylpr5rOn2ix2rZm4n/ed1Ie1f9mIkw0iDoYpEXcawDPMsID5kEqvnUEEIlM3/FdExMQNokmTMhOMsn/yWtcsm5KJXvyoXKTRpHFh2hY3SKHHSJKugW1VATUfSAntALerUerWfrzXpftGasdOYQ/YL18Q2zBJ53</latexit>
F1score = 2⇥ precision⇥recallprecision+recall
<latexit sha1_base64="za6Jvbec2EbXKEcRzahamKb0Bws=">AAACLnicbVBdS8MwFE3n15xfVR99CW6CIIy2PuiLMBTFxwnuA9Yy0izdwtKmJKkwyn6RL/4VfRBUxFd/hulWUTcvBE7OuSc39/gxo1JZ1otRWFhcWl4prpbW1jc2t8ztnabkicCkgTnjou0jSRiNSENRxUg7FgSFPiMtf3iR6a07IiTl0a0axcQLUT+iAcVIaaprXlaubIm5IPAMOtBVNCQSuoFAONXvYJoZv2l9R4yNf4SjnKl0zbJVtSYF54GdgzLIq941n9wex0lIIoUZkrJjW7HyUiQUxYyMS24iSYzwEPVJR8MI6fleOll3DA8004MBF/pECk7Y344UhVKOQl93hkgN5KyWkf9pnUQFp15KozhRJMLTQUHCoOIwyw72qF5YsZEGCAuq/wrxAOmslE64pEOwZ1eeB02nah9XnRunXDvP4yiCPbAPDoENTkANXIM6aAAM7sEjeAVvxoPxbLwbH9PWgpF7dsGfMj6/AJY2qPY=</latexit>