the computational complexity of searching for predictive hypotheses shai ben-david computer science...

21
The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

Post on 20-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

The Computational Complexityof Searching for Predictive Hypotheses

The Computational Complexityof Searching for Predictive Hypotheses

Shai Ben-David

Computer Science Dept.

Technion

Page 2: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

IntroductionIntroduction

The complexity of leaning is measured mainly along

two axis: Information Information and computationcomputation..

Information complexityInformation complexity enjoys a rich theory that yields

rather crisp sample size and convergence rate guarantees.

The focus of this talk is the Computational complexityComputational complexity

of learning.

While playing a critical role in any application,its

theoretical understanding is far less satisfactory.

Page 3: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

Outline of this TalkOutline of this Talk

1. Some background.

2. Survey of recent pessimistic hardness results.

3. New efficient learning algorithms for some basic learning architectures.

Page 4: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

The Label Prediction ProblemThe Label Prediction Problem

Given some domainset XX

A sample SS of labeledmembers of XX is generated by some(unknown) distribution

For a next point xx , predict its label

Data files of drivers

Will the currentcustomer file a claim?

Drivers in a sample are labeled according to whether they filed an insurance claim

Formal Definition Example

Page 5: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

The Agnostic Learning ParadigmThe Agnostic Learning Paradigm

Choose a Hypothesis Class HH of subsets of XX.

For an input sample SS , find some hh in HH that fits SS well.

For a new point x , predict a label according to its membership in hh .

Page 6: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

The Mathematical JustificationThe Mathematical Justification

If HH is not too rich (has small VC-dimension)

then, for every hh in HH ,

the agreement ratio of hh on the sample SS is a good estimate of its probability of success on a new xx .

Page 7: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

The Mathematical Justification - FormallyThe Mathematical Justification - Formally

If SS is sampled i.i.d., by some DD over

X X {0, 1} {0, 1} then with probability > 1-> 1-

Agreement ratio

|S|

)ln()Hdim(VCc)y)x(h(Pr

|S|

|}y)x(h:Sy){(x,|

Hh allFor

D)y,x(

1

Probability of success

Page 8: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

The Model Selection IssueThe Model Selection Issue

Output of the the learning Algorithm

Best regressor for P

Approximation ErrorEstimation Error

Computational Error

}Hh:)h(Ermin{Arg

}Hh:)h(srEmin{Arg

The Class H

Page 9: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

The Computational ProblemThe Computational Problem

Input:: A finite set of {0, 1}{0, 1}-labeled

points SS in RRnn .

Output:: Some ‘hypothesis’ h in H that

maximizes the number of correctly classified

points of S .

Page 10: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

We shall focus on the class of LinearHalf-spaces

Find best hyperplane for arbitrary samples S

NP Hard

Find hyperplane approximating the optimal for arbitrary S

?

Find best hyperplane for separable S

Feasable

(Perceptron Algorithms)

Page 11: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

For each of the following classes, approximating the

best agreement rate for h in HH (on a given input

sample SS ) up to some constant ratio, is NP-hard :Monomials Constant widthMonotone Monomials

Half-spacesHalf-spaces Balls Axis aligned RectanglesThreshold NN’s with constant 1st-layer width

BD-Eiron-Long

Bartlett- BD

Hardness-of-Approximation ResultsHardness-of-Approximation Results

Page 12: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

The SVM SolutionThe SVM Solution

Rather than bothering with non-separable data, make the data separable - by embedding it into some high-dimensional RRnn

Page 13: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

A Problem with the SVM methodA Problem with the SVM method

In “most” cases the data cannot be made separable unless the mapping is into dimension (|X|) (|X|) .

This happens even for classes of small

VC-dimension.

For “most” classes, no mapping for which concept-classified data becomes separable, has large margins.

In all of these cases generalization is lost!In all of these cases generalization is lost!

Page 14: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

Data-Dependent SuccessData-Dependent Success

Note that the definition of success for agnostic learning is data-dependent;

The success rate of the learner on S is comparedto that of the best h in H.

We extend this approach to a data-dependent success definition for approximations; The required success-rate is a function of the input data.

Page 15: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

A New Success CriterionA New Success Criterion

A learning algorithm AA is

marginsuccessful

if, for every input S S R Rnn {0,1} {0,1} ,

|{(x,y) S: A(s)(x) = y}| > |{(x,y): h(x)=y and d(h, x) >

for every half-space hh.

Page 16: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

Some IntuitionSome Intuition

If there exist some optimal h which separates with generous margins, then a margin

algorithm must produce an optimal separator.

On the other hand,

If every good separator can be degraded by small perturbations, then a margin algorithm can settle for a hypothesis that is far from optimal.

Page 17: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

A New Positive Result A New Positive Result

For every positive , there is an efficient margin algorithm.

That is, the algorithm that classifies correctly as many input points as any half-space can classify correctly with margin

Page 18: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

The positive result

For every positive there is a - margin algorithm whose running time is polynomial in |S| and n .

A Complementing Hardness Result

Unless P = NP , no algorithm can do this in time polynomial in 1/and in |S| and n ).

Page 19: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

A -margin Perceptron AlgorithmA -margin Perceptron Algorithm

On input S consider all k-size sub-samples.

For each such sub-sample find its largest margin separating hyperplane.

Among all the (~|S|k) resulting hyperplanes.choose the one with best performance on S .

(The choice of k is a function of the desired margin k ~

Page 20: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

Other margin AlgorithmsOther margin Algorithms

Each of the following algorithms can replace the “find the largest margin separating hyperplane”

The usual “Perceptron Algorithm”.

“Find a point of equal distance from

x1 , … xk “.

Phil Long’s ROMMA algorithm.

These are all very fast online algorithms.

Page 21: The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion

Directions for Further ResearchDirections for Further Research

Can similar efficient algorithms be derived for more complex NN architectures?

How well do the new algorithms perform on real data sets?

Can the ‘local approximation’ results be extended to more geometric functions?