lehrstuhl für informatik 2 gabriella kókai: maschine learning 1 computational learning theory

35
1 rstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Computational Learning Theory

Upload: carlos-stanley

Post on 26-Mar-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

1Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Computational Learning Theory

Page 2: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

2Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Content

➔ Introduction Probably Learning an Approximately Correct Hypothesis Sample Complexity for Finite Hypothesis Spaces Sample Complexity for the Infinite Hypothesis Space The Mistake Bound Model of Learning Summary

Page 3: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

3Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Introduction

Goal: Theoretical characterisation of the difficulty of several types of ML

problems Capabilities of several types of ML algorithms Answer to the questions:

Under what condition is successful learning possible and impossible? Under what condition is a particular ML algorithm assured to learn

successfully? PAC:

Identify classes of hypotheses that can or cannot be learned given a polynomial number of training examples

Define a natural complexity measure for hypothesis space that allows to limit the number of training examples required for inductive learning

Page 4: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

4Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Introduction 2 Task

Given: Some training example , Space of candidate hypotheses H

Goal: Inductive learning of Questions

Sample complexity: How many training examples are needed for a learner to converge (with high probability) to a successful hypothesis?

Computational complexity: How much computational effort is needed for a learner to converge (with high probability) to a successful hypothesis?

Mistake bound: How many training examples will the learnermisclassify before converging to a successful hypothesis?

i ix ,c x ix X

ic x

Page 5: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

5Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Introduction 3

Possibility to set quantitative bounds on these measures, depending on attributes of the learning problem such as: The size or complexity of the hypothesis space considered by the

learner The accuracy to which the target concept must be approximated The probability that the learner will output a successful hypothesis The manner in which training examples are represented to the learner

Page 6: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

6Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Content Introduction➔ Probably Learning an Approximately Correct Hypothesis

The Problem Setting Error of the Hypothesis PAC Learnability

Sample Complexity for Finite Hypothesis Spaces Sample Complexity for the Infinite Hypothesis Space The Mistake Bound Model of Learning Summary

Page 7: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

7Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Probably Learning an Approximately Correct Hypothesis

PAC (Probably approximately correct)Probably learning a approximately correct solution

Restriction: We only consider the case of learning boolean valued concepts from noise free training data

Result can be extended to the more general scenario of learning real-valued target functions (Natarajan 1991)

Result can be extended learning from certain types of noisy data (Laird 1988, Kearns and Vazirani 1994)

Page 8: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

8Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

The Problem Setting Names

X - set of all possible instances over which target functions may be defined C - set of target concepts that our learner might be called upon to learn D - probability distribution which is generally not known to the learner

as stationary: distribution does not change over time T - set of training examples H - space of candidate hypotheses

Each target concept c in C corresponds to some subset of X or equivalent to some boolean-valued function

Searched: After observing a sequence of training examples of c, L must output some h from H, which estimates c.

Evaluation of success of L: Performance of h over new instances drawn randomly from X according to D

x,c x | x X

c : X 0,1

Page 9: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

9Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Error of the Hypothesis True error:

error of h with respect to c observable L can only observe the performance of h over a training example

Training error: Fraction of training examples misclassified by h

Analysis: how probable is it that the observed training error for h gives a misleading estimate of the true

D x Derror h Pr c x h x

Derror h

Page 10: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

10Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

PAC Learnability Goal: characterise classes of target concepts that can be reliably learned from a reasonable

number of randomly drawn training examples and a reasonable amount of computation Possible definition of the success of the training:

for search : Problems

Multiple hypotheses consistent with the training examples Non-representative training set

Definition of PAC-Learning: Consider a concept class called C defined over a set of instances X of length n and a learner L using a hypothesis space H. C is PAC-learnable by L using h if for all , the distribution D over X, an such that , and such that , the learner L will with a probability of at least output a hypothesis such that , in a time that is polynomial in , , n and size(c)

c C h H Derror h = 0

C H

c C 0 < ε < 1/ 2 δ0 < δ < 1/ 2 1 δ

h H

ε

Derror h ε 1/ ε 1/ δ

Page 11: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

11Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Content Introduction Probably Learning an Approximately Correct Hypothesis➔ Sample Complexity for Finite Hypothesis Spaces

Agnostic learning and Inconsistent Hypotheses Conjunctions of Boolean Literals Are PAC-Learnable

Sample Complexity for the Infinite Hypothesis Space The Mistake Bound Model of Learning Summary

Page 12: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

12Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Sample Complexity for Finite Hypothesis Spaces

Definition: Sample complexity of the learning problem is the required number of training examples which are necessary for successful learning

Depending on the constraints of the learning problem Consistent Learner: It outputs a hypothesis that perfectly fits the training

data whenever possible Question: can a bound be derived for the number of training examples

required by any consistent learner, independent of the specific alg. it uses to derive a consistent hypothesis? -> YES

Significance of the version space :every consistent learner outputs a hypothesis belonging to the version space

Therefore to limit the number of examples needed by any consistent learner we need only to limit the number of examples needed to assure that the version space contains no unacceptable hypotheses

H,TVS h H | x,c x T : h x = c x

Page 13: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

13Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Sample Complexity for Finite Hypothesis Spaces 2

Definition of -exhausted (Haussler 1988): Consider a hypothesis space H, target concept c, instance

distribution D and a set of training examples T of c. The version space is said to be -exhausted with respect to c and D, if every hypothesis h in has an error less than with respect to c and D

ε

H,TVS

H,TVS

εε

Picture: is 0.3 exhaustedbut not 0.1-exhausted

VS H , T

HT Dh VS error h < ε

Page 14: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

14Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Sample Complexity for Finite Hypothesis Spaces 3

Theorem -exhausting the version space (Haussler 1988)If and D is a sequence of independent randomly drawn examples of some c then for any the probability that is not -exhausting (with respect to c) is less than or equal

Important information: given the upper limit of the misclassification, using choose

Hint 1: m grows linearly in , logarithmically in , and logarithmically in the size of H

Hint 2: bound can be substantially overestimated:

ε| H |< m 1

0 ε 1 H,TVS

εεm| H | e δ

δ

1m ln | H | + ln 1/ δ

ε

1 1/ δ

εm| H | e > 1

Page 15: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

15Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Agnostic learning and Inconsistent Hypotheses

Problem: Consistent hypotheses are not always possible (H does not contain c)

Agnostic learning: choose hypothesiswhere for example

Searched , so that if => with high possibility

best h H Th := argmin error h

T

number of correct classified examples from Terror h =

number of examples in T

0m 0m =| T |> m

D best T besterror h error h + ε

Page 16: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

16Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Agnostic learning and Inconsistent Hypotheses 2

Analogous: m independent coin flips showing “head” with some probability (m distinct trials of a Bernoulli experiment)

Hoeffding boundary:characterise the deviation between the true probability of some event and its observed frequency over m independent trials =>

Requirement: The error of must be limited =>

Interpretation: Given choose:

m depends logarithmically on H and on but m now grows as

22mεD best T besth HPr error h > error h + ε e

besth 22mε

D best T bestPr h H : error h > error h + ε | H | e δ δ

2

1m ln | H | + ln 1/ δ

1/ δ21/ ε

Page 17: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

17Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Conjunction of Boolean Literals are PAC-Learnable

Example: C is the class where the target concept is described by a conjunction of boolean literals (a literal is any boolean variable or its negation)

Is C PAC-learnable ->YES Any consistent learner will require only a polynomial number of training

examples to learn any c in C Suggesting a specific algorithm that uses polynomial time per training

example: Assumption H=C

from the Theorem of Haussler follows:

M grows linearly in the number of literals n,linearly in and logarithmically in

n| H |= 3

2

1m log 3 + log 1/ δ

1m ln | H | + ln 1/ δ

ε

1/ ε 1/ δ

Page 18: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

18Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Conjunction of Boolean Literals are PAC-Learnable 2

Example with numbers: 10 boolean variables:Wanted: Safety 95% that the error of the hypothesis =>

algorithm with polynomial computing time

Find-S Algorithm computes for each new positive training example the intersection of the literals shared by the current hypothesis and the new training example using time linear in n

ε 0.1

1m = 10log 3 + log 1/ 1 0.95 = 140

0.1

x X

Page 19: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

19Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Find-S: Finding a Maximally Specific Hypothesis

Use the more_general_than partial ordering: Begin with the most specific possible hypothesis in H Generalise this hypothesis each time it fails to cover an observed

positive example

1. Initialise h to the most specific hypothesis in H2. For each positive training instance x For each attribute constraint in h If the constrain is satisfied by x Then do nothing Else replace in h by the next more general constraint that is satisfied by x3. Output hypothesis h

ia

ia

ia

Page 20: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

20Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Find-S: Finding a Maximally Specific Hypothesis (Example)

1. Step: 2. Step: 1.Example + 1 Step: 3. Step: substituting a '?' in place of any attribute value in h that is not

satisfied by new example 3.negative Example: FIND-S algorithm simply ignores every negative

example 4.Step:

h , , , , ,

h Sunny,Warm, Normal,Strong,Warm,Same

h Sunny,Warm,?,Strong,Warm,Same

h Sunny,Warm,?,Strong,?,?

Page 21: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

21Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Content Introduction Probably Learning an Approximately Correct Hypothesis Sample Complexity for Finite Hypothesis Spaces➔ Sample Complexity for the Infinite Hypothesis Space

Sample Complexity and the VC Dimension The Vapnik-Chervonenkis Dimension Sample Complexity and the VC Dimension

The Mistake Bound Model of Learning Summary

Page 22: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

22Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Sample Complexity for Infinite Hypothesis Spaces

Disadvantage of the estimation before: Weak boundary In the case of an infinite hypothesis space it cannot be used

Def: Shattering a Set of InstancesA set of instances S is shattered by a hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy

here the measuring is not based on the number of distinct hypotheses in |H| but on the number of distinct instances form X that can be completely discriminated using H

1

1, 2 1 2 1 2

2

x S h x = 1S S : S S = S S S = h H :

x S h x = 0

Page 23: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

23Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Shattering a Set of Instance

Follows from the definition: is not shattered by h <=> : from the aspect of all hypotheses

S X

1, 2 1 2s s S, s s 1 2s s

Page 24: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

24Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

The Vapnik-Chervonenkis Dimension S instance => different dichotomy Definition Vapnik-Chervonenkis Dimension:

The Vapnik-Chervonenkis Dimension, VC(H), of hypothesis space H defined over the instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H then

Example:Let H the set of intervals on real numbers VC(H) =?

|S|2

VC H

S

ˆ| S |= 2 S = 3.1,5.7 H H = 1,2 , 1,4 , 4,7 , 1,7

ˆ| S |= 3 S = 3.1,5.7,9.3 H H problem separation of 5.7

VC H = 2

Page 25: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

25Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

The Vapnik-Chervonenkis Dimension 2

Example:Let , H the set of linear decision surface in the x, y plane; VC(H) =3

shattering is obviously general case irregular special case no shattering possibility

2S =

| S |= 2

| S |= 3

ˆ ˆVC H = n S X,| S |= n : H H : HshatteredS

| S |= 3

| S |= 4

Page 26: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

26Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Sample Complexity and the VC Dimension

Earlier: the number of randomly drawn examples suffice to probably approximately learn any c in C

Theorem: Lower bound on sample complexityConsider any concept class C such that , any learner L,and any and , then there exists a distribution D and a target concept in C such that: if L observes fewer examples than

then with probability at least , L outputs a hypothesis h having Hint: Both boundaries are logarithmic in and linear in VC(H)

sufficient

1m = 4log 2 / δ +8VC H log 13/ ε

ε

necessary

VC C 11m = max log 1/ δ

ε 32ε

VC C 21

0 < ε <8

10 < δ <

100

δ Derror h > ε

1/ ε

Page 27: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

27Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Content

Introduction Probably Learning an Approximately Correct Hypothesis Sample Complexity for Finite Hypothesis Spaces Sample Complexity for the Infinite Hypothesis Space➔ The Mistake Bound Model of Learning

The Mistake Bound for the FIND-S Algorithm The Mistake Bound for the HALVING Algorithm Optimal Mistake Bounds WEIGHTED-MAJORITY Algorithm

Summary

Page 28: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

28Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

The Mistake Bound Model of Learning

Mistake bound model: the learner is evaluated by the total number of mistakes it makes before it converges to the correct hypothesis.

Problem Inductive learning It receives a set of training examples but after each x, the learner

must predict the target value c(x) before it is shown the correct target value by the trainer

Success: exact/PAC-learning How many mistakes will the learner make in its predictions before it

learns the target concept.It is significant in practical application when the learning must be done while the system is in actual use

Exact learning: x X : h x = c x

Page 29: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

29Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

The Mistake Bound for the Find-S Algorithm

Assumption: , H: conjunction of up to n boolean literals and their negations Learning without noisy

Find-S algorithm: Initialise h as the most specific hypothesis For each positive training instance

Remove from h any literal that is not satisfied by x Output hypothesis h

Can we prove the total number of mistakes that Find-S will make before exactly learning C ->YES

Note: No error on negative instances Step 1: any additional error

=> maximal n+1 errors (case )

C H 1 nl , ...., l

1 1 2 2 n nl ¬ l l ¬ l ... l ¬ l

new oldn n / 2 1 c x 1new oldn = n / 2 = n

Page 30: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

30Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

The Mistake Bound for the HALVING Algorithm

Every error => maximal

Note: reduction of the version space also in the case of correct prediction

Extension: WEIGHTED-MAJORITY Algorithm ( weighted vote)

Refine theversion space

=Halvingalgorithm

Maintaining the version spacethrough majority vote decision

+

new oldn = n / 2 = n

2log | H |

Page 31: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

31Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Optimal Mistake Bounds

Question: What is the optimal mistake bound for an arbitrary concept class C – the lowest worst case mistake bound in respect to all possible learning algorithms

Let H=C for algorithm A:

For example:

Littlestone (1987)

A TM c := max number of the error of A bevor c learned exactly

A c C AM C := max M c

Find SM C := n +1

Halving 2M C log C

A set of classificators AOpt C := min M C

Halving 2VC C Opt C M C log | C |

Page 32: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

32Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

WEIGHTED-MAJORITY Algorithm Generalisation of the Halving Algorithm

Weighted vote among the pool of prediction algorithms Learns by altering the weight associated with each prediction

algorithm Advantage: Accommodate inconsistent training data Note: => Halving algorithm Theorem: Relative mistake bound for WEIGHTED-MAJORITY

Let T be any sequence of training examples, let A be any set of n prediction algorithms, and let k be the minimum number of mistakes made by any algorithm in A for the training sequence T. Then the number of mistakes over T made by the WEIGHTED-MAJORITY algorithm using is at most

β = 0

1

22.4 k log2 n

Page 33: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

33Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

WEIGHTED-MAJORITY Algorithm 2

denotes the prediction algorithm in the pool A of algorithms denotes the weight associated with For each i initialise For each training example

Initialise and to 0 For each prediction algorithm

If then If then

If then predict If then predict If then predict 0 or 1 at random for c(x)

For each prediction algorithm in A do If then

ia

thi

iwia

ai0q 1q

x,f xiw 1

ia x = 0 0 0 iq q + w ia x = 1

0 1 1q q + w

1 0q > q

1 0q < q

1 0q = q

c x = 1

c x = 0

ia

ia x c xi iw βw

Page 34: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

34Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Content

Introduction Probably Learning an Approximately Correct Hypothesis Sample Complexity for Finite Hypothesis Spaces Sample Complexity for the Infinite Hypothesis Space The Mistake Bound Model of Learning➔ Summary

Page 35: Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Computational Learning Theory

35Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Summary

PAC learning versus exact learning Consistent and inconsistent hypothesis, agnostic learning VC-Dimension: complexity of hypothesis space - largest subset of

instances that can be shattered Bound on the number of training examples sufficient for

successful learning under the PAC model Mistake bound model: Analyse the number of training examples a

learner will misclassify before it exactly learns the target concept WEIGHTED-MAJORITY Algorithm: combines the weighted

votes of multiple prediction algorithms to classify new instances