yi wu (cmu) joint work with vitaly feldman (ibm) venkat guruswami (cmu) prasad ragvenhdra (msr)...
Post on 19-Dec-2015
220 views
TRANSCRIPT
Agnostic Learning of Conjunctions by Halfspaces is
Hard
Yi Wu (CMU)
Joint work with
Vitaly Feldman (IBM)
Venkat Guruswami (CMU)
Prasad Ragvenhdra (MSR)
Conjunctions (Monomials)
10 Million Lottery Cheap Pharmacy Junk Is SpamYES YES NO YES NO SPAMNO YES YES NO YES NOT SPAMYES YES YES YES YES SPAMNO NO NO NO YES NOT SPAMYES NO YES NO YES NOT SPAMYES YES NO YES NO SPAM
“10 Millon= yes” and “Lottery=yes” and “Pharmacy=yes”
The Spam Problem
Decision Lists
10 Million Lottery Cheap Pharmacy Junk Is SpamYES YES NO YES NO SPAMNO YES YES NO YES NOT SPAMYES YES YES YES YES SPAMNO NO NO YES YES NOT SPAMYES NO YES NO YES NOT SPAMYES YES NO NO NO SPAM
If “10 Millon= NO” then Not SPAM Else
If “Lottery = No” then Not Spam Else
If “Pharmacy = No” then Not Spam Else SPAM
The Spam Problem
Halfspaces
10 Million Lottery Cheap Pharmacy Junk Is SpamYES YES NO YES NO SPAMNO YES YES NO YES NOT SPAMYES YES YES YES YES SPAMNO NO NO YES YES NOT SPAMYES NO YES NO YES NOT SPAMYES YES NO NO NO SPAM
“Million= YES” + 2 “Lottery=YES”+ “Pharmacy = YES” ≥ 4
The Spam Problem
Unknown distribution D over Rn, examples labeled by an unknown function f.
PAC learning Model
+-
-
--
+
+
+-
-
-
-
+
+
++
-
After receiving examples, algorithm
does its computation and outputs
hypothesis h.
+
Accuracy of hypothesis is
fh
Unknown distribution D over{0,1}n examples labeled by an unknown conjunctions.
Learning Conjunctions from random examples
+-
-
--
+
+
+-
-
-
-
+
+
++
-
+
is easy!
Since conjunctions is a special
halfspaces, we can use poly-time linear
programming to find a halfspace
hypothesis consistent with all examples:
Well-known theory (VC dimension) for any D random sample of
many examples yields -accurate hypothesis w.h.p.
Real-world data probably doesn’t come with guarantee that examples are labeled perfectly according to a conjunction.
Linear programming is brittle: noisy examples can easily result in no consistent hypothesis.
Motivates study of noisy variants of PAC learning for conjunctions.
Learning Conjunctions from random examples
is easy!…but not very realistic…
perfectly labeled^
+-
-
--
+
+
+ -
--
-
+
+
++
-
+- +
+
-
This Talk: Learning Conjunctions with Agnostic Noise
Unknown distribution D over {0,1}n examples labeled by an unknown
conjunction function f .
All the random examples given to learner:
– 1- ε fraction of the example is perfectly labeled, i.e.x~D, y =
f(x).
– ε fraction of the example mislabeled.
Goal: To find a good hypothesis that has good
accuracy (close to 1- ε? Or just better than 50%?)
No Noise: [Val84, Lit88, Hau88]: PAC Learnable
Random Noise: [Kea98]: PAC Learnable under random noise model.
Related Work (Positive)
For any ε,δ > 0 , NP-hard to tell whether◦ Some conjunction consistent with 1- ε fraction of
the data, ◦ No conjunction is ½ + δ consistent with the data.
[FGKP06]
It is NP-hard to find a 51%-accuracy conjunction even if knowing some conjunction is consistent with 99% of the data.
Related Work (Negative)
Proper: Given f is in function class C (e.g. conjunctions), learner output a function in class C.
Non-Proper: Given f is in class C (e.g. conjunctions), learner can output function in the class D (e.g. halfspaces).
Proper v.s. Non-Proper learning
We might still be able to learn conjunctions by outputing larger class of functions (say by linear programming?).◦ E.g. [Lit88] use the winnow algorithm which
output halfspace function.
Weakness of Previous Result
For any ε,δ > 0 , NP-hard to tell whether◦ Some halfspace consistent with 1- ε fraction of
the data, ◦ No halfspace is ½ + δ consistent with the data.
[FGKP, GR].
It is NP-hard to find a 51%-accuracy halfspaces even if knowing some halfspaces is consistent with 99% of the data.
Other Related Work
For any ε,δ > 0 , NP-hard to tell whether◦ Some conjunction consistent with 1- ε fraction of
the data, ◦ No function in any hypothesis class is ½ + δ
consistent with the data.
Ideally, we want to show:
[ABX08]: Showing NP-hardness using black-box reductions for unrestricted-class of improper learning is hard. ◦ It will otherwise break some long-standing
cryptographic assumptions: (transformation from any average-case hard problem in NP to a one-way function)
Negative Negative Result
For any ε,δ > 0 , NP-hard to tell whether◦ Some conjunction consistent with 1- ε fraction of
the data, ◦ No halfspaces is ½ + δ consistent with the data.
Main Result
It is NP-hard to find a 51%-accurate halfspaces
even if knowing some conjunction is consistent with
99% of the data.
Why halfspaces?
In practice, halfspace are at the heart of many learning
algorithms:
Perceptron Winnow SVM Logistic Regression Linear Discriminant Analysis
Learning TheoryComputational
We can not agnostically learn conjunctions using any
of the above mentioned algorithm!
Corollary
Halfspaces
Conjunctions
Decision List
Weakly Agnostic learning Conjunctions/Decision
Lists/Halfspaces by Halfspaces is hard!
◦ “Dictator” (halfspaces depending on very few variables e.g. f(x) = sgn(x1))
◦“Majority”(no variables has too much weight, e.g. f(x) = sgn(x1+x2+x3+…+xn).
24
Dictator Test
Dictator Testing for halfspaces
chooses: x2 {0,1}n, b 2 {0,1} from some
distribution .
Halfspace f :
{0,1}n
{0,1}
x
f(x)
Completeness ¸ c $ all (Monomials) f(x) = xi accepted w. prob. ¸
c Soundness · s $ “Majority like function” accepted “w. prob. · s
With such a test, we can show NP-hard to tell i) some monomial
satisfies c fraction of the data; ii) no halfspaces satisfies more than
s fraction of the data.
Accept if f(x) = b
Tester
1) Generate z by setting each zi independently to be random bits.
2) Generate y by resetting each zi to be 0 with probability 0.99.
3) Generating a random bit b and setting xi to be yi + b/2n.
4) Output (x,b) (Accept if f(x) = sgn(b)).
How to generate (x,b)
f(x)= xi
◦ Then Pr(f(x) =xi=b) > Pr(yi = 0) =0.99
f(x) = sgn ( )◦ Then
Pr( f(x) = b) = Pr(sgn (N(0, 0.1) + b /2n) =b)< 0.51
Analysis of the Test
n
xxx n...21
n
We prove that even weakly agnostic learning Conjunctions by Halfspace is NP-hard.
To propose a efficient halfspace learning algorithm for conjunctions/decision lists/halfspaces, we need either modeling the distribution of example or the noise.
Conclusion
Prove: For any ε,δ > 0 , given a set of training examples such that there is a conjunction consistent with 1- ε fraction of the data, it is NP-hard to find a degree d polynomial threshold function that is ½ + δ consistent with the data.
Why low degree ptf? Because such a hypothesis can agnostically learn conjunctions/halfspaces under uniform distribution.
Future Work