announcementswcohen/10-605/2016/mistake-bounds+stru… · announcements • guest lectures...
TRANSCRIPT
![Page 1: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/1.jpg)
![Page 2: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/2.jpg)
Announcements
• Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex Smola, something cool, 4/9
![Page 3: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/3.jpg)
Projects
• Students in 805: – First draft of project proposal due 2/17. – Some more detail on projects is on the
wiki.
![Page 4: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/4.jpg)
Quiz
• https://qna-app.appspot.com/view.html?aglzfnFuYS1hcHByGQsSDFF1ZXN0aW9uTGlzdBiAgICAg-n-Cww
![Page 5: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/5.jpg)
How do you debug a learning algorithm?
• Unit tests • Simple artificial problems
![Page 6: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/6.jpg)
How do you debug a learning algorithm?
• Unit tests • Simple artificial problems
[rain|sleet|snow|showers| [snow flurries|snow showers|light snow|…]] [Monday|Tuesday|…] and overcast
![Page 7: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/7.jpg)
Beyond Naïve Bayes: Other Efficient Learning
Methods William W. Cohen
![Page 8: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/8.jpg)
Two fast algorithms
• Naïve Bayes: one pass • Rocchio: two passes
– if vocabulary fits in memory • Both method are algorithmically similar
– count and combine • Thought experiment: what if we duplicated
some features in our dataset many times? – e.g., Repeat all words that start with “t” 10
times.
![Page 9: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/9.jpg)
Limitations of Naïve Bayes/Rocchio
• Naïve Bayes: one pass • Rocchio: two passes
– if vocabulary fits in memory • Both method are algorithmically similar
– count and combine • Thought thought thought thought thought thought
thought thought thought thought experiment: what if we duplicated some features in our dataset many times times times times times times times times times times? – e.g., Repeat all words that start with “t” “t” “t” “t” “t”
“t” “t” “t” “t” “t” ten ten ten ten ten ten ten ten ten ten times times times times times times times times times times.
– Result: those features will be over-weighted in classifier by a factor of 10
This isn’t silly – often there are features that are “noisy” duplicates, or important phrases of different length
![Page 10: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/10.jpg)
Limitations of Naïve Bayes/Rocchio
• Naïve Bayes: one pass • Rocchio: two passes
– if vocabulary fits in memory • Both method are algorithmically similar
– count and combine • Thought oughthay experiment experiment-day:
what we add a Pig latin version of each word starting with “t”? – Result: those features will be over-weighted – You need to look at interactions between
features somehow
This isn’t silly – often there are features that are “noisy” duplicates, or important phrases of different length
![Page 11: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/11.jpg)
Naïve Bayes is a linear algorithm
logP(y, x1,.., xn ) = logC(X = x j ∧Y = y)+mqxC(X = ANY ∧Y = y ')+mj
∑#
$%%
&
'((+ log
C(Y = y)+mqyC(Y = ANY )+m
= g(x j, y)j∑#
$%%
&
'((+ f (y)
= f (x,d)g(x, y)x∈V∑#
$%
&
'(+ f (y)
= v(y,d) ⋅w(y)
Naïve Bayes
where g(x j, y) = logC(X = x j ∧Y = y)+mqxC(X = ANY ∧Y = y ')+m
where f (x,d) = TF(x,d)
sparse vector of TF values for each word in the document…plus a “bias” term for f(y)
dense vector of g(x,y) scores for each word in the vocabulary .. plus f(y) to match bias term
![Page 12: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/12.jpg)
One way to look for interactions: on-line, incremental learning
logP(y, x1,.., xn ) = logC(X = x j ∧Y = y)+mqxC(X = ANY ∧Y = y ')+mj
∑#
$%%
&
'((+ log
C(Y = y)+mqyC(Y = ANY )+m
= g(x j, y)j∑#
$%%
&
'((+ f (y)
= f (x,d)g(x, y)x∈V∑#
$%
&
'(+ f (y)
= v(y,d) ⋅w(y)
Naïve Bayes
where g(x j, y) = logC(X = x j ∧Y = y)+mqxC(X = ANY ∧Y = y ')+m
dense vector of g(x,y) scores for each word in the vocabulary
Scan thu data: • whenever we see x with y we increase g(x,y) • whenever we see x with ~y we increase g(x,~y)
![Page 13: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/13.jpg)
One simple way to look for interactions
prediction = logC(X = x j ∧Y = y)+mqxC(X = ANY ∧Y = y ')+mj
∑#
$%%
&
'((+ log
C(Y = y)+mqyC(Y = ANY )+m
= g(x j, y)j∑#
$%%
&
'((+ f (y)
= f (x,d)g(x, y)x∈V∑#
$%
&
'(+ f (y)
= v(y,d)[w(y)−w(~ y)]
Naïve Bayes – two class version
where g(x j, y) = logC(X = x j ∧Y = y)+mqxC(X = ANY ∧Y = y ')+m
dense vector of g(x,y) scores for each word in the vocabulary
Scan thru data: • whenever we see x with y we increase g(x,y)-g(x,~y) • whenever we see x with ~y we decrease g(x,y)-g(x,~y)
We do this regardless of whether it seems to help or not on the data….if there are duplications, the weights will become arbitrarily large
To detect interactions: • increase/decrease g(x,y)-g(x,~y) only if we need to
(for that example) • otherwise, leave it unchanged
![Page 14: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/14.jpg)
One simple way to look for interactions
B instance xi Compute: yi = vk . xi
^
+1,-1: label yi If mistake: vk+1 = vk + correction Train Data
To detect interactions: • increase/decrease vk only if we need to (for that example) • otherwise, leave it unchanged
• We can be sensitive to duplication by stopping updates when we get better performance
![Page 15: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/15.jpg)
Theory: the prediction game
• Player A: – picks a “target concept” c
• for now - from a finite set of possibilities C (e.g., all decision trees of size m)
– for t=1,…., • Player A picks x=(x1,…,xn) and sends it to B
– For now, from a finite set of possibilities (e.g., all binary vectors of length n)
• B predicts a label, ŷ, and sends it to A • A sends B the true label y=c(x) • we record if B made a mistake or not
– We care about the worst case number of mistakes B will make over all possible concept & training sequences of any length
• The “Mistake bound” for B, MB(C), is this bound
![Page 16: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/16.jpg)
The prediction game
• Are there practical algorithms where we can compute the mistake bound?
![Page 17: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/17.jpg)
The voted perceptron
A B instance xi Compute: yi = vk . xi
^
yi ^
yi
If mistake: vk+1 = vk + yi xi
![Page 18: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/18.jpg)
u
-u
2γ
u
-u
2γ
+x1 v1
(1) A target u (2) The guess v1 after one positive example.
u
-u
2γ
u
-u
2γ
v1
+x2
v2
+x1 v1
-x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2
If mistake: vk+1 = vk + yi xi
![Page 19: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/19.jpg)
u
-u
2γ
u
-u
2γ
v1
+x2
v2
+x1 v1
-x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2
>γ
![Page 20: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/20.jpg)
u
-u
2γ
u
-u
2γ
v1
+x2
v2
+x1 v1
-x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2
![Page 21: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/21.jpg)
2
2
2
2
2
2
⎟⎟⎠
⎞⎜⎜⎝
⎛=
γR
![Page 22: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/22.jpg)
Summary • We have shown that
– If : exists a u with unit norm that has margin γ on examples in the seq (x1,y1),(x2,y2),….
– Then : the perceptron algorithm makes < R2/ γ2 mistakes on the sequence (where R >= ||xi||)
– Independent of dimension of the data or classifier (!) – This doesn’t follow from M(C)<=VCDim(C)
• We don’t know if this algorithm could be better – There are many variants that rely on similar analysis (ROMMA,
Passive-Aggressive, MIRA, …) • We don’t know what happens if the data’s not separable
– Unless I explain the “Δ trick” to you • We don’t know what classifier to use “after” training
![Page 23: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/23.jpg)
The Δ Trick • The proof assumes the data is separable by a
wide margin • We can make that true by adding an “id” feature
to each example – sort of like we added a constant feature
x1 = (x11, x2
1,..., xm1 )→ (x1
1, x21,..., xm
1 , Δ, 0,...., 0)x2 = (x1
2, x22,..., xm
2 )→ (x12, x2
2,..., xm2 , 0,Δ,...., 0)
...xn = (x1
n, x2n,..., xm
n )→ (x1n, x2
n,..., xmn , 0, 0,...,Δ)
n new features
![Page 24: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/24.jpg)
The Δ Trick • Replace xi with x’i so X becomes [X | I Δ] • Replace R2 in our bounds with R2 + Δ2
• Let di = max(0, γ - yi xi u) • Let u’ = (u1,…,un, y1d1/Δ, … ymdm/Δ) * 1/Z
– So Z=sqrt(1 + D2/ Δ2), for D=sqrt(d12+…+dm
2) – Now [X|IΔ] is separable by u’ with margin γ
• Mistake bound is (R2 + Δ2)Z2 / γ2
• Let Δ = sqrt(RD) è k <= ((R + D)/ γ)2
• Conclusion: a little noise is ok
![Page 25: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/25.jpg)
Summary • We have shown that
– If : exists a u with unit norm that has margin γ on examples in the seq (x1,y1),(x2,y2),….
– Then : the perceptron algorithm makes < R2/ γ2 mistakes on the sequence (where R >= ||xi||)
– Independent of dimension of the data or classifier (!) • We don’t know what happens if the data’s not
separable – Unless I explain the “Δ trick” to you
• We don’t know what classifier to use “after” training
![Page 26: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/26.jpg)
On-line to batch learning
1. Pick a vk at random according to mk/m, the fraction of examples it was used for.
2. Predict using the vk you just picked.
3. (Actually, use some sort of deterministic approximation to this).
![Page 27: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/27.jpg)
![Page 28: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/28.jpg)
Complexity of perceptron learning
• Algorithm: • v=0 • for each example x,y:
– if sign(v.x) != y • v = v + yx
• init hashtable
• for xi!=0, vi += yxi
O(n)
O(|x|)=O(|d|)
![Page 29: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/29.jpg)
Complexity of averaged perceptron
• Algorithm: • vk=0 • va = 0 • for each example x,y:
– if sign(vk.x) != y • va = va + vk • vk = vk + yx • mk = 1
– else • nk++
• init hashtables
• for vki!=0, vai += vki
• for xi!=0, vi += yxi
O(n) O(n|V|)
O(|x|)=O(|d|)
O(|V|)
![Page 30: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/30.jpg)
The kernel trick
You can think of a perceptron as a weighted nearest-neighbor classifier….
where K(v,x) = dot product of v and x (a similarity function)
![Page 31: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/31.jpg)
The kernel trick
Here’s yet another similarity function: K(v,x) is
Here’s another similarity function: K’(v,x)=dot product of H’(v),H’(x)) where
![Page 32: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/32.jpg)
The kernel trick
1,2,...,2,,,...,,)( 12
12121 nnnn xxxxxxxxH −≡x
Claim: K(v,x)=dot product of H(x),H(v) for this H:
![Page 33: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/33.jpg)
![Page 34: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/34.jpg)
![Page 35: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/35.jpg)
Parallelizing perceptrons
Instances/labels
Instances/labels – 1 Instances/labels – 2 Instances/labels – 3
vk/va -1 vk/va- 2 vk/va-3
vk
Split into example subsets
Combine somehow?
Compute vk’s on subsets
![Page 36: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/36.jpg)
Parallelizing perceptrons
Instances/labels
Instances/labels – 1 Instances/labels – 2 Instances/labels – 3
vk/va -1 vk/va- 2 vk/va-3
vk/va
Split into example subsets
Combine somehow
Compute vk’s on subsets
Synchonization cost vs Inference (classification) cost
![Page 37: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/37.jpg)
Review/outline • How to implement Naïve Bayes
– Time is linear in size of data (one scan!) – We need to count C( X=word ^ Y=label)
• Can you parallelize Naïve Bayes? – Trivial solution 1
1. Split the data up into multiple subsets 2. Count and total each subset independently 3. Add up the counts
– Result should be the same • This is unusual for streaming learning algorithms
– Why? no interaction between feature weight updates – For perceptron that’s not the case
![Page 38: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/38.jpg)
A hidden agenda • Part of machine learning is good grasp of theory • Part of ML is a good grasp of what hacks tend to work • These are not always the same
– Especially in big-data situations
• Catalog of useful tricks so far – Brute-force estimation of a joint distribution – Naive Bayes – Stream-and-sort, request-and-answer patterns – BLRT and KL-divergence (and when to use them) – TF-IDF weighting – especially IDF
• it’s often useful even when we don’t understand why – Perceptron/mistake bound model
• often leads to fast, competitive, easy-to-implement methods • parallel versions are non-trivial to implement/understand
![Page 39: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/39.jpg)
The Voted Perceptron for Ranking and Structured Classification
William Cohen
![Page 40: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/40.jpg)
The voted perceptron for ranking
A B instances x1 x2 x3 x4… Compute: yi = vk . xi
Return: the index b* of the “best” xi
^
b*
b
If mistake: vk+1 = vk + xb - xb*
![Page 41: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/41.jpg)
u
-u
x x
x
x
x
γ
Ranking some x’s with the target vector u
![Page 42: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/42.jpg)
u
-u
x x
x
x
x
γ
v
Ranking some x’s with some guess vector v – part 1
![Page 43: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/43.jpg)
u
-u
x x
x
x
x
v
Ranking some x’s with some guess vector v – part 2.
The purple-circled x is xb* - the one the learner has chosen to rank highest. The green circled x is xb, the right answer.
![Page 44: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/44.jpg)
u
-u
x x
x
x
x
v
Correcting v by adding xb – xb*
![Page 45: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/45.jpg)
x x
x
x
x
vk
Vk+1
Correcting v by adding xb – xb*
(part 2)
![Page 46: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/46.jpg)
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
u
-u
u
-u
xx
x
x
x
v
![Page 47: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/47.jpg)
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
3
u
-u
u
-u
xx
x
x
x
v
![Page 48: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/48.jpg)
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
3
u
-u
u
-u
xx
x
x
x
v
![Page 49: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/49.jpg)
u
-u
2γ
v1
+x2
v2
(3a) The guess v2 after the two positive examples: v2=v1+x2
>γ
u
-u
u
-u
xx
x
x
x
v
Notice this doesn’t depend at all on the number of x’s being ranked
Neither proof depends on the dimension of the x’s.
![Page 50: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/50.jpg)
Ranking perceptrons è structured perceptrons
• The API: – A sends B a (maybe
huge) set of items to rank
– B finds the single best one according to the current weight vector
– A tells B which one was actually best
• Structured classification on a sequence – Input: list of words:
x=(w1,…,wn) – Output: list of labels:
y=(y1,…,yn) – If there are K classes,
there are Kn labels possible for x
![Page 51: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/51.jpg)
51
Borkar et al’s: HMMs for segmentation
– Example: Addresses, bib records – Problem: some DBs may split records up differently (eg no “mail
stop” field, combine address and apt #, …) or not at all – Solution: Learn to segment textual form of records
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.
Author Year Title Journal Volume Page
![Page 52: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/52.jpg)
IE with Hidden Markov Models
52
Title
Journal
Author 0.9
0.5
0.5 0.8
0.2
0.1
Transition probabilities
Year
Learning
Convex
…
0.06
0.03
..
Comm.
Trans.
Chemical
0.04
0.02
0.004
Smith
Cohen
Jordan
…
0.01
0.05
0.3
…
Emission probabilities
dddd
dd
0.8
0.2
![Page 53: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/53.jpg)
Inference for linear-chain MRFs
When will prof Cohen post the notes …
Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them.
• (y(i)==B or y(i)==I) and (token(i) is capitalized)
• (y(i)==I and y(i-1)==B) and (token(i) is hyphenated)
• (y(i)==B and y(i-1)==B)
• eg “tell Ziv William is on the way”
Idea 2: construct a graph where each path is a possible sequence labeling.
![Page 54: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/54.jpg)
Inference for a linear-chain MRF
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
• Inference: find the highest-weight path • This can be done efficiently using dynamic programming (Viterbi)
![Page 55: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/55.jpg)
Ranking perceptrons è structured perceptrons
• The API: – A sends B a (maybe
huge) set of items to rank
– B finds the single best one according to the current weight vector
– A tells B which one was actually best
• Structured classification on a sequence – Input: list of words:
x=(w1,…,wn) – Output: list of labels:
y=(y1,…,yn) – If there are K classes,
there are Kn labels possible for x
![Page 56: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/56.jpg)
Ranking perceptrons ! structured perceptrons
• The API: – A sends B a (maybe
huge) set of items to rank
– B finds the single best one according to the current weight vector
– A tells B which one was actually best
• Structured classification on a sequence – Input: list of words:
x=(w1,…,wn) – Output: list of labels:
y=(y1,…,yn) – If there are K classes,
there are Kn labels possible for x
![Page 57: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/57.jpg)
Ranking perceptrons ! structured perceptrons
• New API: – A sends B the word
sequence x – B finds the single best
y according to the current weight vector using Viterbi
– A tells B which y was actually best
– This is equivalent to ranking pairs g=(x,y’)
• Structured classification on a sequence – Input: list of words:
x=(w1,…,wn) – Output: list of labels:
y=(y1,…,yn) – If there are K classes,
there are Kn labels possible for x
![Page 58: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/58.jpg)
The voted perceptron for ranking
A B instances x1 x2 x3 x4… Compute: yi = vk . xi
Return: the index b* of the “best” xi
^
b*
b
If mistake: vk+1 = vk + xb - xb*
Change number one is notation: replace x with g
![Page 59: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/59.jpg)
The voted perceptron for NER
A B instances g1 g2 g3 g4… Compute: yi = vk . gi
Return: the index b* of the “best” gi
^
b* b
If mistake: vk+1 = vk + gb - gb*
1. A sends B feature functions, and instructions for creating the instances g:
• A sends a word vector xi. Then B could create the instances g1 =F(xi,y1), g2= F(xi,y2), …
• but instead B just returns the y* that gives the best score for the dot product vk . F(xi,y*) by using Viterbi.
2. A sends B the correct label sequence yi.
3. On errors, B sets vk+1 = vk + gb - gb* = vk + F(xi,y) - F(xi,y*)
![Page 60: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/60.jpg)
EMNLP 2002
![Page 61: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/61.jpg)
Some background… • Collins’ parser: generative model… • …New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete
Structures, and the Voted Perceptron, Collins and Duffy, ACL 2002. • …Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted
Perceptron, Collins, ACL 2002. – Propose entities using a MaxEnt tagger (as in MXPOST) – Use beam search to get multiple taggings for each document (20) – Learn to rerank the candidates to push correct ones to the top, using
some new candidate-specific features: • Value of the “whole entity” (e.g., “Professor_Cohen”) • Capitalization features for the whole entity (e.g., “Xx+_Xx+”) • Last word in entity, and capitalization features of last word • Bigrams/Trigrams of words and capitalization features before and
after the entity
![Page 62: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/62.jpg)
Some background…
![Page 63: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/63.jpg)
EMNLP 2002, Best paper
And back to the paper…..
![Page 64: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/64.jpg)
Collins’ Experiments
• POS tagging • NP Chunking (words and POS tags from Brill’s
tagger as features) and BIO output tags • Compared Maxent Tagging/MEMM’s (with
iterative scaling) and “Voted Perceptron trained HMM’s” – With and w/o averaging – With and w/o feature selection (count>5)
![Page 65: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/65.jpg)
Collins’ results
![Page 66: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/66.jpg)
Parallelizing perceptrons
Instances/labels
Instances/labels – 1 Instances/labels – 2 Instances/labels – 3
vk/va -1 vk/va- 2 vk/va-3
vk
Split into example subsets
Combine somehow?
Compute vk’s on subsets
![Page 67: Announcementswcohen/10-605/2016/mistake-bounds+stru… · Announcements • Guest lectures schedule: – D. Sculley, Google Pgh, 3/26 – Alex Beutel, SGD for tensors, 4/7 – Alex](https://reader033.vdocuments.site/reader033/viewer/2022042806/5f748b6f7b54cb0efc2295f3/html5/thumbnails/67.jpg)
Parallelizing perceptrons
Instances/labels
Instances/labels – 1 Instances/labels – 2 Instances/labels – 3
vk/va -1 vk/va- 2 vk/va-3
vk/va
Split into example subsets
Combine somehow
Compute vk’s on subsets
Synchonization cost vs Inference (classification) cost