machine learning (cse 446): beyond binary classification · 2017-11-20 · machine learning (cse...
TRANSCRIPT
![Page 1: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/1.jpg)
Machine Learning (CSE 446):Beyond Binary Classification
Noah Smithc© 2017
University of [email protected]
November 20, 2017
1 / 25
![Page 2: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/2.jpg)
Unbalanced Data (Binary Classification)
Balanced data: p(Y = +1) ≈ p(Y = −1) ≈ 12 .
Examples where the fraction of positive examples is tiny: fraud detection, web pagerelevance
2 / 25
![Page 3: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/3.jpg)
Unbalanced Data (Binary Classification)
Balanced data: p(Y = +1) ≈ p(Y = −1) ≈ 12 .
Examples where the fraction of positive examples is tiny: fraud detection, web pagerelevance
Some solutions:
1. Throw out negative examples until you achieve balance.
2. Down-weight negative examples until you achieve balance.
3. Modification to the hinge loss:
L(hinge)n (w, b) = max{0,
formerly 1︷ ︸︸ ︷cost(yn) −yn · (w · xn + b)}
cost(yn) =
{α if yn = −1 (false positive)β if yn = +1 (false negative)
3 / 25
![Page 4: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/4.jpg)
Unbalanced Data (Binary Classification)
Balanced data: p(Y = +1) ≈ p(Y = −1) ≈ 12 .
Examples where the fraction of positive examples is tiny: fraud detection, web pagerelevance
Some solutions:
1. Throw out negative examples until you achieve balance.
2. Down-weight negative examples until you achieve balance.
3. Modification to the hinge loss:
L(hinge)n (w, b) = max{0,
formerly 1︷ ︸︸ ︷cost(yn) −yn · (w · xn + b)}
cost(yn) =
{α if yn = −1 (false positive)β if yn = +1 (false negative)
4 / 25
![Page 5: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/5.jpg)
Unbalanced Data (Binary Classification)
Balanced data: p(Y = +1) ≈ p(Y = −1) ≈ 12 .
Examples where the fraction of positive examples is tiny: fraud detection, web pagerelevance
Some solutions:
1. Throw out negative examples until you achieve balance.
2. Down-weight negative examples until you achieve balance.
3. Modification to the hinge loss:
L(hinge)n (w, b) = max{0,
formerly 1︷ ︸︸ ︷cost(yn) −yn · (w · xn + b)}
cost(yn) =
{α if yn = −1 (false positive)β if yn = +1 (false negative)
5 / 25
![Page 6: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/6.jpg)
Unbalanced Data (Binary Classification)Balanced data: p(Y = +1) ≈ p(Y = −1) ≈ 1
2 .Examples where the fraction of positive examples is tiny: fraud detection, web pagerelevanceSome solutions:
1. Throw out negative examples until you achieve balance.2. Down-weight negative examples until you achieve balance. For example,
L(new)(x, y, parameters)← αJy=+1K · L(old)(x, y, parameters)
A similar effect can be achieved in SGD by sampling non-uniformly; assign 12N+
to
positive examples and 12N−
to negative examples.
3. Modification to the hinge loss:
L(hinge)n (w, b) = max{0,
formerly 1︷ ︸︸ ︷cost(yn) −yn · (w · xn + b)}
cost(yn) =
{α if yn = −1 (false positive)β if yn = +1 (false negative)
6 / 25
![Page 7: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/7.jpg)
Unbalanced Data (Binary Classification)
Balanced data: p(Y = +1) ≈ p(Y = −1) ≈ 12 .
Examples where the fraction of positive examples is tiny: fraud detection, web pagerelevance
Some solutions:
1. Throw out negative examples until you achieve balance.
2. Down-weight negative examples until you achieve balance.
3. Modification to the hinge loss:
L(hinge)n (w, b) = max{0,
formerly 1︷ ︸︸ ︷cost(yn) −yn · (w · xn + b)}
cost(yn) =
{α if yn = −1 (false positive)β if yn = +1 (false negative)
7 / 25
![Page 8: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/8.jpg)
Multiclass ClassificationSuppose you have a set of classes, Y, such that |Y| > 2.
1. See A5 for generalizations of familiar loss functions.
2. One-versus-all training: train |Y| binary classifiers, letting each y ∈ Y take a turnas the positive class. Let a(y) be the activation function for the classifier where{y → +1,Y \ {y} → −1}. Then define the classifier f : X → Y as:
f(x) = argmaxy∈Y
a(y)(x)
Theorem: error rate is at most (|Y| − 1) · ε̄, where ε̄ is the average error rateamong the binary classifiers.
3. All-versus-all (“tournament”): build
(|Y|2
)classifiers, pairing every y, y′ ∈ Y.
Theorem: error rate is at most 2(|Y| − 1) · ε̄.4. Tree-structured tournament. Theorem: error rate is at most dlog2 |Y|e · ε̄.
8 / 25
![Page 9: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/9.jpg)
Multiclass ClassificationSuppose you have a set of classes, Y, such that |Y| > 2.
1. See A5 for generalizations of familiar loss functions.
2. One-versus-all training: train |Y| binary classifiers, letting each y ∈ Y take a turnas the positive class. Let a(y) be the activation function for the classifier where{y → +1,Y \ {y} → −1}. Then define the classifier f : X → Y as:
f(x) = argmaxy∈Y
a(y)(x)
Theorem: error rate is at most (|Y| − 1) · ε̄, where ε̄ is the average error rateamong the binary classifiers.
3. All-versus-all (“tournament”): build
(|Y|2
)classifiers, pairing every y, y′ ∈ Y.
Theorem: error rate is at most 2(|Y| − 1) · ε̄.4. Tree-structured tournament. Theorem: error rate is at most dlog2 |Y|e · ε̄.
9 / 25
![Page 10: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/10.jpg)
Multiclass ClassificationSuppose you have a set of classes, Y, such that |Y| > 2.
1. See A5 for generalizations of familiar loss functions.
2. One-versus-all training: train |Y| binary classifiers, letting each y ∈ Y take a turnas the positive class. Let a(y) be the activation function for the classifier where{y → +1,Y \ {y} → −1}. Then define the classifier f : X → Y as:
f(x) = argmaxy∈Y
a(y)(x)
Theorem: error rate is at most (|Y| − 1) · ε̄, where ε̄ is the average error rateamong the binary classifiers.
3. All-versus-all (“tournament”): build
(|Y|2
)classifiers, pairing every y, y′ ∈ Y.
Theorem: error rate is at most 2(|Y| − 1) · ε̄.4. Tree-structured tournament. Theorem: error rate is at most dlog2 |Y|e · ε̄.
10 / 25
![Page 11: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/11.jpg)
Multiclass ClassificationSuppose you have a set of classes, Y, such that |Y| > 2.
1. See A5 for generalizations of familiar loss functions.
2. One-versus-all training: train |Y| binary classifiers, letting each y ∈ Y take a turnas the positive class. Let a(y) be the activation function for the classifier where{y → +1,Y \ {y} → −1}. Then define the classifier f : X → Y as:
f(x) = argmaxy∈Y
a(y)(x)
Theorem: error rate is at most (|Y| − 1) · ε̄, where ε̄ is the average error rateamong the binary classifiers. One bad classifier can ruin f ; in particular, watchout for the more rare labels, and be sure to tune hyperparameters separately.
3. All-versus-all (“tournament”): build
(|Y|2
)classifiers, pairing every y, y′ ∈ Y.
Theorem: error rate is at most 2(|Y| − 1) · ε̄.4. Tree-structured tournament. Theorem: error rate is at most dlog2 |Y|e · ε̄.
11 / 25
![Page 12: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/12.jpg)
Multiclass ClassificationSuppose you have a set of classes, Y, such that |Y| > 2.
1. See A5 for generalizations of familiar loss functions.
2. One-versus-all training: train |Y| binary classifiers, letting each y ∈ Y take a turnas the positive class. Let a(y) be the activation function for the classifier where{y → +1,Y \ {y} → −1}. Then define the classifier f : X → Y as:
f(x) = argmaxy∈Y
a(y)(x)
Theorem: error rate is at most (|Y| − 1) · ε̄, where ε̄ is the average error rateamong the binary classifiers.
3. All-versus-all (“tournament”): build
(|Y|2
)classifiers, pairing every y, y′ ∈ Y.
Theorem: error rate is at most 2(|Y| − 1) · ε̄.
4. Tree-structured tournament. Theorem: error rate is at most dlog2 |Y|e · ε̄.
12 / 25
![Page 13: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/13.jpg)
Multiclass ClassificationSuppose you have a set of classes, Y, such that |Y| > 2.
1. See A5 for generalizations of familiar loss functions.
2. One-versus-all training: train |Y| binary classifiers, letting each y ∈ Y take a turnas the positive class. Let a(y) be the activation function for the classifier where{y → +1,Y \ {y} → −1}. Then define the classifier f : X → Y as:
f(x) = argmaxy∈Y
a(y)(x)
Theorem: error rate is at most (|Y| − 1) · ε̄, where ε̄ is the average error rateamong the binary classifiers.
3. All-versus-all (“tournament”): build
(|Y|2
)classifiers, pairing every y, y′ ∈ Y.
Theorem: error rate is at most 2(|Y| − 1) · ε̄.4. Tree-structured tournament. Theorem: error rate is at most dlog2 |Y|e · ε̄.
13 / 25
![Page 14: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/14.jpg)
Multiclass ClassificationSuppose you have a set of classes, Y, such that |Y| > 2.
1. See A5 for generalizations of familiar loss functions.
2. One-versus-all training: train |Y| binary classifiers, letting each y ∈ Y take a turnas the positive class. Let a(y) be the activation function for the classifier where{y → +1,Y \ {y} → −1}. Then define the classifier f : X → Y as:
f(x) = argmaxy∈Y
a(y)(x)
Theorem: error rate is at most (|Y| − 1) · ε̄, where ε̄ is the average error rateamong the binary classifiers.
3. All-versus-all (“tournament”): build
(|Y|2
)classifiers, pairing every y, y′ ∈ Y.
Theorem: error rate is at most 2(|Y| − 1) · ε̄.4. Tree-structured tournament. Theorem: error rate is at most dlog2 |Y|e · ε̄.
Challenge: you must choose the tree.
14 / 25
![Page 15: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/15.jpg)
Tree-Structured Tournament for Multiclass Classification
f1: {y1, y2, y3, y4} vs. {y5, y6, y7, y8}
f2: {y1, y2} vs. {y3, y4} f3: {y5, y6} vs. {y7, y8}
f4: y1 vs. y2 f5: y3 vs. y4 f6: y5 vs. y6 f7: y7 vs. y8
15 / 25
![Page 16: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/16.jpg)
Ranking
Most common setup: xn = 〈qn,d〉, where qn is a query and d is a (fixed, universal) setof documents {d1, . . . , dM}. Output yn is a ranking of d.
16 / 25
![Page 17: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/17.jpg)
Ranking
Most common setup: xn = 〈qn,d〉, where qn is a query and d is a (fixed, universal) setof documents {d1, . . . , dM}. Output yn is a ranking of d.
Pairwise encoding: let xn,i,j be the features encoding the comparison of di with dj ,under query qn.
17 / 25
![Page 18: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/18.jpg)
Ranking
Most common setup: xn = 〈qn,d〉, where qn is a query and d is a (fixed, universal) setof documents {d1, . . . , dM}. Output yn is a ranking of d.
Pairwise encoding: let xn,i,j be the features encoding the comparison of di with dj ,under query qn.
Output: yn,i,j is +1 if di is more relevant to qn than dj ; −1 otherwise.
18 / 25
![Page 19: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/19.jpg)
Ranking
Most common setup: xn = 〈qn,d〉, where qn is a query and d is a (fixed, universal) setof documents {d1, . . . , dM}. Output yn is a ranking of d.
Pairwise encoding: let xn,i,j be the features encoding the comparison of di with dj ,under query qn.
Output: yn,i,j is +1 if di is more relevant to qn than dj ; −1 otherwise.
Training on the binary problem 〈(xn,i,j , yn,i,j)〉n∈{1,...,N};i,j∈{1,...,M makes sense whenthe ranking is meant to separate relevant di from irrelevant di, known as “bipartite”ranking.
19 / 25
![Page 20: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/20.jpg)
Nuanced Ranking Problems
Intuitively, we want a scoring function, specific to query q, that is highest for the mostrelevant documents.
20 / 25
![Page 21: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/21.jpg)
Nuanced Ranking Problems
Intuitively, we want a scoring function, specific to query q, that is highest for the mostrelevant documents.
Let σ : {1, . . . ,M} → R denote a document-scoring function; the true scoring functionfor training example n is σn, and σ̂n is what we’ve estimated.
21 / 25
![Page 22: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/22.jpg)
Nuanced Ranking Problems
Intuitively, we want a scoring function, specific to query q, that is highest for the mostrelevant documents.
Let σ : {1, . . . ,M} → R denote a document-scoring function; the true scoring functionfor training example n is σn, and σ̂n is what we’ve estimated.
Let ω(i, j) be the nonnegative cost of putting something in position j when it belongsin position i.
22 / 25
![Page 23: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/23.jpg)
Nuanced Ranking Problems
Intuitively, we want a scoring function, specific to query q, that is highest for the mostrelevant documents.
Let σ : {1, . . . ,M} → R denote a document-scoring function; the true scoring functionfor training example n is σn, and σ̂n is what we’ve estimated.
Let ω(i, j) be the nonnegative cost of putting something in position j when it belongsin position i.One example:
ω(i, j) =
{1 if min(i, j) ≤ 10 and i 6= j0 otherwise
(More in the book.)
23 / 25
![Page 24: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/24.jpg)
Nuanced Ranking Problems
Intuitively, we want a scoring function, specific to query q, that is highest for the mostrelevant documents.
Let σ : {1, . . . ,M} → R denote a document-scoring function; the true scoring functionfor training example n is σn, and σ̂n is what we’ve estimated.
Let ω(i, j) be the nonnegative cost of putting something in position j when it belongsin position i.
Loss:
E(q,σ)∼D
∑i,j:i 6=j
Jσ(i) < σ(j)K · Jσ̂(i) < σ̂(j)K · ω(i, j)
24 / 25
![Page 25: Machine Learning (CSE 446): Beyond Binary Classification · 2017-11-20 · Machine Learning (CSE 446): Beyond Binary Classi cation Noah Smith c 2017 University of Washington nasmith@cs.washington.edu](https://reader033.vdocuments.site/reader033/viewer/2022050603/5faaf1b2cfb28c7de16702a4/html5/thumbnails/25.jpg)
Nuanced Ranking Problems
Intuitively, we want a scoring function, specific to query q, that is highest for the mostrelevant documents.
Let σ : {1, . . . ,M} → R denote a document-scoring function; the true scoring functionfor training example n is σn, and σ̂n is what we’ve estimated.
Let ω(i, j) be the nonnegative cost of putting something in position j when it belongsin position i.
Loss:
E(q,σ)∼D
∑i,j:i 6=j
Jσ(i) < σ(j)K · Jσ̂(i) < σ̂(j)K · ω(i, j)
Deriving a learning algorithm is left as an exercise. (See the book for an example.)
25 / 25