introduction to machine learning - carnegie mellon...

57
Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail Bounds Barnabás Póczos

Upload: duongkhuong

Post on 22-Mar-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Introduction to Machine Learning

CMU-10701 Stochastic Convergence and Tail Bounds

Barnabás Póczos

Page 2: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Basic Estimation Theory

2

Page 3: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Rolling a Dice,

Estimation of parameters 1,2,…,6

24

120 60

12

3

Does the MLE estimation converge to the right value?

How fast does it converge?

Page 4: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

4

Rolling a Dice

Calculating the Empirical Average

Does the empirical average converge to the true mean?

How fast does it converge?

Page 5: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

5 sample traces

5

How fast do they converge?

Rolling a Dice,

Calculating the Empirical Average

Page 6: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Key Questions

I want to know the coin parameter 2[0,1] within = 0.1

error, with probability at least 1- = 0.95. How many flips do I need?

6

• Do empirical averages converge?

• Does the MLE converge in the dice rolling problem?

• What do we mean on convergence?

• What is the rate of convergence?

Page 7: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Outline Theory:

• Stochastic Convergences:

– Weak convergence = Convergence in distribution

– Convergence in probability

– Strong (almost surely)

– Convergence in Lp norm

7

• Limit theorems:

– Law of large numbers

– Central limit theorem

• Tail bounds:

– Markov, Chebyshev

Page 8: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Stochastic convergence

definitions and properties

8

Page 9: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Convergence of vectors

9

Page 10: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Convergence in Distribution =

Convergence Weakly = Convergence in Law

10

Notation:

Let {Z, Z1, Z2, …} be a sequence of random variables.

Definition:

This is the “weakest” convergence.

Page 11: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Only the distribution functions converge!

(NOT the values of the random variables)

1

0

a 11

Convergence in Distribution =

Convergence Weakly = Convergence in Law

Page 12: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

12

Continuity is important!

Proof:

The limit random variable is constant 0:

Example:

In this example the limit Z is discrete, not random (constant 0),

although Zn is a continuous random variable.

0

0

1

0 1/n

0

1

Convergence in Distribution =

Convergence Weakly = Convergence in Law

Page 13: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

13

Properties

Scheffe's theorem:

convergence of the probability density functions ) convergence in distribution

Example: (Central Limit Theorem)

Zn and Z can still be independent even if their distributions are the same!

Convergence in Distribution =

Convergence Weakly = Convergence in Law

Page 14: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Convergence in Probability

14

Notation:

Definition:

This indeed measures how far the values of Zn() and Z() are from each other.

Page 15: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Almost Surely Convergence

15

Notation:

Definition:

Page 16: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Convergence in p-th mean, Lp norm

Definition:

16

Notation:

Properties:

Page 17: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Counter Examples

17

Page 18: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Further Readings on

Stochastic convergence

•http://en.wikipedia.org/wiki/Convergence_of_random_variables

•Patrick Billingsley: Probability and Measure

•Patrick Billingsley: Convergence of Probability Measures

18

Page 19: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Finite sample tail

bounds Useful tools!

19

Page 20: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Gauss Markov inequality

Decompose the expectation

If X is a nonnegative random variable and a > 0, then

Proof:

Corollary: Chebyshev's inequality 20

Page 21: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Chebyshev inequality

If X is any nonnegative random variable and a > 0, then

Proof:

Here Var(X) is the variance of X, defined as:

21

Page 22: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Generalizations of

Chebyshev's inequality Chebyshev:

Asymmetric two-sided case (X is asymmetric distribution)

Symmetric two-sided case (X is symmetric distribution)

This is equivalent to this:

There are lots of other generalizations, for example multivariate X. 22

Page 23: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Higher moments?

Chebyshev:

Markov:

Higher moments: where n ≥ 1

Other functions instead of polynomials?

Exp function:

Proof: (Markov ineq.)

23

Page 24: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Law of Large Numbers

24

Page 25: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Do empirical averages converge?

25

Answer: Yes, they do. (Law of large numbers)

Chebyshev’s inequality is good enough to study the question:

Do the empirical averages converge to the true mean?

Page 26: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Law of Large Numbers

26

Strong Law of Large Numbers:

Weak Law of Large Numbers:

Page 27: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Weak Law of Large Numbers

Proof I:

Assume finite variance. (Not very important)

Therefore,

As n approaches infinity, this expression approaches 1.

27

Page 28: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Fourier Transform

and Characteristic Function

28

Page 29: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Fourier Transform

29

Fourier transform

Inverse Fourier transform

Other conventions: Where to put 2? Not preferred: not unitary transf.

Doesn’t preserve inner product

unitary transf.

unitary transf.

Page 30: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Fourier Transform

30

Fourier transform

Inverse Fourier transform

Inverse is really inverse:

Properties:

and lots of other important ones…

Fourier transformation will be used to define the characteristic function,

and represent the distributions in an alternative way.

Page 31: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Characteristic function

31

The Characteristic function provides an alternative way

for describing a random variable

How can we describe a random variable?

• cumulative distribution function (cdf)

• probability density function (pdf)

Definition:

The Fourier transform of the density

Page 32: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Characteristic function

32

Properties

For example, Cauchy doesn’t have mean but still has characteristic function.

Continuous on the entire space, even if X is not continuous.

Bounded, even if X is not bounded

Levi’s: continuity theorem

Characteristic function of constant a:

Page 33: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Weak Law of Large Numbers

Proof II:

33

Taylor's theorem for complex functions

The Characteristic function

Properties of characteristic functions :

Levi’s continuity theorem ) Limit is a constant distribution with mean

mean

Page 34: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

34

“Convergence rate” for LLN

Gauss-Markov: Doesn’t give rate

Chebyshev:

Can we get smaller, logarithmic error in δ???

with probability 1-

Page 35: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

• http://en.wikipedia.org/wiki/Levy_continuity_theorem

• http://en.wikipedia.org/wiki/Law_of_large_numbers

• http://en.wikipedia.org/wiki/Characteristic_function_(probability_theory)

• http://en.wikipedia.org/wiki/Fourier_transform

35

Further Readings on LLN,

Characteristic Functions, etc

Page 36: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

More tail bounds

More useful tools!

36

Page 37: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Hoeffding’s inequality (1963)

It only contains the range of the variables,

but not the variances. 37

Page 38: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

“Convergence rate” for LLN

from Hoeffding Hoeffding

38

Page 39: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Proof of Hoeffding’s Inequality

39

A few minutes of calculations.

Page 40: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Bernstein’s inequality (1946)

It contains the variances, too, and can give

tighter bounds than Hoeffding.

40

Page 41: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Benett’s inequality (1962)

Benett’s inequality ) Bernstein’s inequality.

41

Proof:

Page 42: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

McDiarmid’s

Bounded Difference Inequality

It follows that

42

Page 43: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Further Readings on

Tail bounds

43

http://en.wikipedia.org/wiki/Hoeffding's_inequality

http://en.wikipedia.org/wiki/Doob_martingale (McDiarmid)

http://en.wikipedia.org/wiki/Bennett%27s_inequality

http://en.wikipedia.org/wiki/Markov%27s_inequality

http://en.wikipedia.org/wiki/Chebyshev%27s_inequality

http://en.wikipedia.org/wiki/Bernstein_inequalities_(probability_theory)

Page 44: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Limit Distribution?

44

Page 45: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Central Limit Theorem

45

Lindeberg-Lévi CLT:

Lyapunov CLT:

+ some other conditions

Generalizations: multi dim, time processes

Page 46: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Central Limit Theorem in Practice

unscaled

scaled

46

Page 47: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

47

Proof of CLT

From Taylor series around 0:

Properties of characteristic functions :

Levi’s continuity theorem + uniqueness ) CLT characteristic function

of Gauss distribution

Page 48: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

How fast do we converge to

Gauss distribution?

Berry-Esseen Theorem

48

Independently discovered by A. C. Berry (in 1941) and C.-G. Esseen (1942)

CLT:

It doesn’t tell us anything about the convergence rate.

Page 49: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Did we answer the questions

we asked?

• Do empirical averages converge?

• What do we mean on convergence?

• What is the rate of convergence?

• What is the limit distrib. of “standardized” averages?

How good are the ML algorithms on unknown test sets?

How many training samples do we need to achieve small error?

What is the smallest possible error we can achieve?

Next time we will continue with these questions:

49

Page 50: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

• http://en.wikipedia.org/wiki/Central_limit_theorem

• http://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm

Further Readings on CLT

50

Page 51: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Tail bounds in practice

51

Page 52: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

• Two possible webpage layouts

• Which layout is better?

Experiment

• Some users see A

• The others see design B

How many trials do we need to decide which page attracts

more clicks?

A/B testing

52

Page 53: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

A/B testing

53

Assume that in group A

p(click|A) = 0.10 click and p(noclick|A) = 0.90

Assume that in group B

p(click|B) = 0.11 click and p(noclick|B) = 0.89

Assume also that we know these probabilities in group A, but

we don’t know yet them in group B.

We want to estimate p(click|B) with less than 0.01 error

Let us simplify this question a bit:

Page 54: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

• In group B the click probability is = 0.11 (we don’t know this yet)

• Want failure probability of =5%

Chebyshev Inequality

54

Chebyshev:

• If we have no prior knowledge, we can only bound the variance by σ2 =

0.25 (Uniform distribution hast the largest variance 0.25)

• If we know that the click probability is < 0.15, then we can bound 2 at

0.15 * 0.85 = 0.1275. This requires at least 25,500 users.

Page 55: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Hoeffding’s bound

• Random variable has bounded range [0, 1] (click or no click),

hence c=1

• Solve Hoeffding’s inequality for n:

55

This is better than Chebyshev.

• Hoeffding

Page 56: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

What we have learned today

56

Theory:

• Stochastic Convergences:

– Weak convergence = Convergence in distribution

– Convergence in probability

– Strong (almost surely)

– Convergence in Lp norm

• Limit theorems:

– Law of large numbers

– Central limit theorem

• Tail bounds:

– Markov, Chebyshev

– Hoeffding, Benett, McDiarmid

Page 57: Introduction to Machine Learning - Carnegie Mellon …aarti/Class/10701_Spring14/slides/TailBounds.pdf · Introduction to Machine Learning CMU-10701 Stochastic Convergence and Tail

Thanks for your attention

57