statistics: learning models from data - nyu...

DS-GA 1002 Lecture notes 5 October 19, 2015

Statistics: Learning models from dataLearning models from data that are assumed to be generated probabilistically from a certainunknown distribution is a crucial step in statistical analysis. The model may be of interestin itself, as a description of the data used to build it, or it may be used to perform inference,i.e. to extract conclusions about new data.

1 Parametric models

In parametric modeling we make the assumption that the data are sampled from a knownfamily of distributions (Gaussian, Bernoulli, exponential, . . . ) with a small number of un-known parameters that must be learnt from the data. This assumption may be motivatedby theoretical insights such as the Central Limit Theorem, which explains why additivedisturbances are often well modeled as Gaussian, or by empirical evidence.

1.1 Frequentist parameter selection

Frequentist statistical analysis is based on treating parameters as unknown deterministicquantities. These parameters are fit to produce a model that is well adapted to the availabledata. In order to quantify the extent to which a model explains the data, we define thelikelihood function. This function is the pmf or pdf of the distribution that generates thedata, interpreted as a function of the unknown parameters.

Definition 1.1 (Likelihood function). Given a realization x of an iid vector of randomvariables X of dimension n with a distribution that depends on a vector of parameters θ ∈Rm, the likelihood function is defined as

Lx (θ) :=n∏i=1

pXi (xi,θ) (1)

if X is discrete and

Lx (θ) :=n∏i=1

fXi (xi,θ) (2)

if X is continuous.

The log-likelihood function is equal to the logarithm of the likelihood function logLx (θ).

The likelihood function represents the probability or the probability density of the parametricdistribution at the observed data, i.e. it quantifies how likely the data are according to themodel. Therefore, higher likelihood values indicate that the model is better adapted tothe samples. The maximum likelihood (ML) estimator is a hugely popular parameterestimator based on maximizing the likelihood (or equivalently the log-likelihood).

Definition 1.2 (Maximum-likelihood estimator). The maximum likelihood (ML) esti-mator for the vector of parameters θ ∈ Rm is

θ̂ML (x) := arg maxθLx (θ) (3)

= arg maxθ

logLx (θ) . (4)

The maximum of the likelihood function and that of the log-likelihood function are at thesame location because the logarithm is monotone.

Under certain conditions, one can show that the maximum-likelihood estimator is consistent:it converges in probability to the true parameter as the number of data increases. One caneven show that its distribution converges to that of a Gaussian random variable (or vector),just like the distribution of the sample mean. These results are beyond the scope of thecourse. Bear in mind, however, that they obviously only hold if the data are indeed generatedby the type of distribution that we are considering.

Example 1.3 (ML estimator of the parameter of a Bernoulli distribution). Let x1, x2, . . .be data that we wish to model as iid samples from a Bernoulli distribution. The likelihoodfunction is equal to

Lx (p) =n∏i=1

pXi (xi, p) (5)

=∏i=1

1xi=1p+ 1xi=0 (1− p) (6)

= pn1 (1− p)n0 (7)

and the log-likelihood function to

logLx (p) =n∑i=1

pXi (xi) (8)

= n1 log p+ n0 log (1− p) , (9)

where n1 are the number of samples equal to one and n0 the number of samples equal tozero. The ML estimator of the parameter p is

p̂ML = arg maxp

logLx (p) (10)

= arg maxpn1 log p+ n0 log (1− p) . (11)

2

We compute the derivative and second derivative of the log-likelihood function,

d logLx (p)

dp=n1

p− n0

1− p, (12)

d2 logLx (p)

dp2= −n1

p2− n0

(1− p)2 < 0. (13)

The function is concave, as the second derivative is negative. The maximum is consequentlyat the point where the first derivative equals zero, namely

p̂ML =n1

n0 + n1

, (14)

the fraction of samples that are equal to one, which is a very reasonable estimate.

Example 1.4 (ML estimator of the parameters of a Gaussian distribution). Let x1, x2, . . .be data that we wish to model as iid samples from a Gaussian distribution with mean µ andstandard deviation σ. The likelihood function is equal to

Lx (µ, σ) =n∏i=1

fXi (xi) (15)

=n∏i=1

1√2πσ

e−(xi−µ)2

2σ2 (16)

and the log-likelihood function to

logLx (µ, σ) =n∑i=1

fXi (xi, p) (17)

= −n log (2π)

2− n log σ −

n∑i=1

(xi − µ)2

2σ2. (18)

The ML estimator of the parameters µ and σ is

{µ̂ML, σ̂ML} = arg max{µ,σ}

logLx (µ, σ) (19)

= arg max{µ,σ}−n log σ −

n∑i=1

(xi − µ)2

2σ2. (20)

3

We compute the partial derivatives of the log-likelihood function,

∂ logLx (µ, σ)

∂µ= −

n∑i=1

xi − µσ2

, (21)

∂ logLx (µ, σ)

∂σ= −n

σ+

n∑i=1

(xi − µ)2

σ3. (22)

The function we are trying to maximize is strictly concave in {µ, σ}. To prove this, we wouldhave to show that the Hessian of the function is positive definite. We omit the calculationsthat show that this is the case. Setting the partial derivatives to zero we obtain

µ̂ML =1

n

n∑i=1

xi, (23)

σ̂2ML =

1

n

n∑i=1

(xi − µ̂ML) . (24)

The estimator for the mean is just the sample mean. The estimator for the variance is arescaled sample variance.

Figure 1 shows the result of fitting a Gaussian to the height data in Figure 1 of Lecture Notes4 by applying the ML estimators for the mean and the variance derived in Example 1.4. Fora small number of samples the estimate can be quite unstable, but for a large number ofsamples it provides a good description of the data.

1.2 Bayesian parameter selection

Up to now we have focused on estimating parameters that are modeled as deterministicand fixed. This is the viewpoint of frequentist statistics. Bayesian statistics provide analternative perspective in which the parameters are considered random. This allows forgreater flexibility in both building the model and in quantifying our uncertainty about it,but it also assumes that we have access to an estimate of the distribution of the parameters.

In a frequentist framework we assumed only that the likelihood function of the data wasknown. Bayesian inference relies on two modeling choices:

1. The prior distribution of the parameters encodes our uncertainty about the modelbefore seeing the data.

4

60 62 64 66 68 70 72 74 76Height (inches)

0.05

0.10

0.15

0.20

0.25n = 20n = 1000

Figure 1: Result of fitting a Gaussian distribution to the data shown in Figure 1 of Lecture Notes4 three times using 20/1000 random samples.

5

2. The conditional distribution of the data given the parameters specifies how the dataare generated. The pmf or pdf that characterizes this distribution is equal to thelikelihood function Lx (θ) from Definition 1.1. In a Bayesian framework the likelihoodis no longer just a function of the parameters; it has a probabilistic interpretation inits own right.

Once the prior and the likelihood are fixed, we apply Bayes theorem to obtain the posteriordistribution of the parameters given the data. This distribution quantifies our uncertaintyabout the value of the parameters after processing the data.

Theorem 1.5 (Posterior distribution). The posterior distribution of the parameters giventhe data equals

pΘ|X (θ|x) =pΘ (θ) pX|Θ (x|θ)∑u pΘ (u) pX|Θ (x|u)

(25)

if the data and parameters are discrete,

fΘ|X (θ|x) =fΘ (θ) fX|Θ (x|θ)∫

ufΘ (u) fX|Θ (x|u) du

(26)

if the data and parameters are continuous,

pΘ|X (θ|x) =pΘ (θ) fX|Θ (x|θ)∑u pΘ (u) fX|Θ (x|u)

(27)

if the data are continuous and the parameters discrete, and

fΘ|X (θ|x) =fΘ (θ) pX|Θ (x|θ)∫

ufΘ (u) pX|Θ (x|u) du

(28)

if the data are discrete and the parameters continuous.

Proof. The expressions follow from a direct application of Bayes theorem.

The posterior distribution of the parameter given the data allows us to compute the probabil-ity that the parameter lies in a certain interval. Such intervals are called credible intervals,as opposed to frequentist confidence intervals. Recall that once we have computed a 1 − αconfidence interval from the data, it makes no sense to state that it contains the true pa-rameter with probability 1 − α; the realization of the interval and the parameter are bothdeterministic. In contrast, once we have computed the posterior distribution of a parametergiven the data within a Bayesian framework, it is completely correct to state that the trueparameter belongs to the fixed 1 − α credible interval with probability 1 − α (if the priorand likelihood are assumed to be correct).

6

A question that remains is how to produce a point estimate of the parameters from theirposterior distribution. A reasonable choice is the mean of the posterior distribution, whichcorresponds to the conditional expectation of the parameters given the data. This has astrong theoretical justification: the posterior mean minimizes the mean square error withrespect to the true value of the parameters over all possible estimators.

Theorem 1.6 (The posterior mean minimizes the MSE). The posterior mean is the mini-mum mean-squared-error (MMSE) estimate of the parameter given the data. More precisely,if we represent the data and the parameters as the random vectors Θ and X,

E(Θ|X

)= arg min

θ̂(X)E

((θ̂ (X)−Θ

)2). (29)

Proof. Let θ̂ (X) denote an arbitrary estimator. We will show that the MSE incurred byθ̂ (X) is always greater or equal to the MSE incurred by E (Θ|X). We begin by computingthe MSE conditioned on X = x,

E( (

θ̂ (X)−Θ)2 ∣∣X = x

)= E

((θ̂ (X)− E (Θ|X) + E (Θ|X)−Θ

)2 ∣∣∣X = x)

(30)

=(θ̂ (X)− E (Θ|X = x)

)2

+ E(

(E (Θ)−Θ)2∣∣∣X = x

)(31)

+ 2 E((

θ̂ (X)− E (Θ|X = x))

(E (Θ|X = x)− E (Θ|X = x)))

=(θ̂ (X)− E (Θ|X = x)

)2

+ E(

(E (Θ)−Θ)2∣∣∣X = x

). (32)

By iterated expectation,

E( (

θ̂ (X)−Θ)2 )

= E(

E( (

θ̂ (X)−Θ)2 ∣∣X)) (33)

= E( (

θ̂ (X)− E (Θ|X))2 )

+ E(

(E (Θ|X)−Θ)2∣∣X) (34)

≥ E(

(E (Θ|X)−Θ)2 ), (35)

Since the expectation of a nonnegative quantity is nonnegative. This establishes thatE (Θ|X) achieves the minimum MSE.

The following example illustrates Bayesian inference applied to the problem of determiningthe bias of a coin by flipping it several times, or equivalently of fitting the parameter of aBernoulli random variable from iid realizations.

Example 1.7 (Bayesian analysis of the parameter of a Bernoulli distribution). Let x1, x2, . . .be data that we wish to model as iid samples from a Bernoulli distribution. Since we aretaking a Bayesian approach we are forced to choose a prior distribution for the parameterof the Bernoulli. We will consider two different Bayesian estimators Θ1 and Θ2:

7

1. Θ1 represents a conservative estimator in terms of prior information. We assign auniform pdf to the parameter. Any value in the unit interval has the same probabilitydensity:

fΘ1 (θ) =

{1 for 0 ≤ θ ≤ 1,

0 otherwise.(36)

2. Θ2 is an estimator that assumes that the parameter is closer to 1 than to 0. We coulduse it for instance to capture the suspicion that a coin is biased towards heads. Wechoose a skewed pdf that increases linearly from zero to one,

fΘ2 (θ) =

{2 θ for 0 ≤ θ ≤ 1,

0 otherwise.(37)

Recall from the likelihood estimation (7) in Example 1.3 that the conditional pmf of thedata given the parameter of the Bernoulli Θ equals

pX|Θ (x|θ) = θn1 (1− θ)n0 , (38)

where n1 is the number of ones in the data and n0 the number of zeros. The posterior pdfsof the two estimators are consequently equal to

fΘ1|X (θ|x) =fΘ1 (θ) pX|Θ1 (x|θ)∫

ufΘ1 (u) pX|Θ1 (x|u) du

(39)

=θn1 (1− θ)n0∫

uun1 (1− u)n0 du

(40)

=θn1 (1− θ)n0

β (n1 + 1, n0 + 1), (41)

fΘ2|X (θ|x) =θn1+1 (1− θ)n0∫

uun1+1 (1− u)n0 du

(42)

=θn1+1 (1− θ)n0

β (n1 + 2, n0 + 1), (43)

(44)

where

β (x, y) :=

∫u

ux−1 (1− u)y−1 du (45)

is a special tabulated function.

8

In order to obtain point estimates for the parameter we compute the posterior means:

E (Θ1|X = x) =

∫ 1

0

θfΘ1|X (θ|x) dθ (46)

=

∫ 1

0θn1+1 (1− θ)n0 dθ

β (n1 + 1, n0 + 1)(47)

=β (n1 + 2, n0 + 1)

β (n1 + 1, n0 + 1), (48)

E (Θ2|X = x) =

∫ 1

0

θfΘ2|X (θ|x) dθ (49)

=β (n1 + 3, n0 + 1)

β (n1 + 2, n0 + 1). (50)

Figure 2 shows the plot of the posterior distribution for different values of n1 and n0. It alsoplots the posterior mean and the ML estimate. For a small number of flips, the posteriorpdf of Θ2 is skewed to the right with respect to that of Θ1, reflecting the prior belief thatthe parameter is closer to 1. However for a large number of flips both posterior densitiesare very close and so are the posterior means and the ML estimates; the likelihood termdominates when the number of data grows.

In Figure 2 we can see that the maximum likelihood is the mode (maximum value) of theposterior distribution when the prior is uniform. This is no accident.

Lemma 1.8. The maximum likelihood is the mode (maximum value) of the posterior distri-bution if the prior distribution is uniform.

Proof. We prove the result when the model for the data and the parameters is continuous,if any or both of them are discrete the proof is identical. If the prior distribution of theparameters is uniform then

arg maxθ

fΘ|X (θ|x) = arg maxθ

fΘ (θ) fX|Θ (x|θ)∫ufΘ (u) fX|Θ (x|u) du

(51)

= arg maxθfX|Θ (x|θ) (52)

= arg maxθL (θ) , (53)

which is the ML estimator.

Note that uniform priors are only well defined in situations where the parameter is restrictedto a bounded set.

9

Prior distribution n0 = 1, n1 = 3

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

2.5

n0 = 3, n1 = 1 n0 = 91, n1 = 9

0.0 0.2 0.4 0.6 0.8 1.00.0

0.5

1.0

1.5

2.0

0.0 0.2 0.4 0.6 0.8 1.00

2

4

6

8

10

12

14Posterior mean (uniform prior)Posterior mean (skewed prior)ML estimator

Figure 2: Posterior distributions of the parameter of a Bernoulli for two different priors and fordifferent data realizations.

10

2 Nonparametric models

In situations where a parametric model is not available or does not fit the data adequately,we resort to nonparametric methods in order to characterize the unknown distribution thatis supposed to generate the data. Learning a model that does not rely on a small number ofparameters is challenging: we need to estimate the whole distribution just from the availablesamples. Without further assumptions this problem is ill posed; many (infinite!) differentdistributions could have generated the data. However, as we show below, with enoughsamples it is possible to obtain models that characterize the underlying distribution quiteaccurately.

2.1 Estimating the cdf

A way of characterizing the distribution that generates the data is to approximate its cdf.An intuitive estimate is obtained by computing the fraction of samples that are smaller thana certain value. This produces a piecewise constant estimator known as the empirical cdf.

Definition 2.1 (Empirical cdf). Let X1, X2, . . . be a sequence of random variables belongingto the same probability space. The value of the empirical cdf at any x ∈ R is

F̂n (x) :=1

n

n∑i=1

1Xi≤x. (54)

The empirical cdf is an unbiased and consistent estimator of the true cdf. This is establishedrigorously in Theorem 2.2 below, but is also illustrated empirically in Figure 3. The cdfof the height data in Figure 1 of Lecture Notes 4 is compared to three realizations of theempirical cdf computed from iid samples. As the number of available samples grows, theapproximation becomes very accurate.

Theorem 2.2. Let X1, X2, . . . be an iid sequence with cdf FX . For any fixed u ∈ R F̂x (x)

is an unbiased and consistent estimator of FX (x). In fact, F̂x (x) converges in mean squareto FX (x).

Proof. First, we verify

E(F̂n (x)

)= E

(1

n

n∑i=1

1Xi≤x

)(55)

=1

n

n∑i=1

P (Xi ≤ x) by linearity of expectation (56)

= FX (x) , (57)

11

so the estimator is unbiased. We now estimate its mean square

E(F̂ 2n (x)

)= E

(1

n2

n∑i=1

n∑j=1

1Xi≤x1Xj≤x

)(58)

=1

n2

n∑i=1

P (Xi ≤ x) +1

n2

n∑i=1

n∑j=1,i 6=j

P (Xi ≤ x,Xj ≤ x) by linearity of expectation

=FX (x)

n+

1

n2

n∑i=1

n∑j=1

FXi (x)FXj (x) by independence, (59)

=FX (x)

n+n− 1

nF 2X (x) . (60)

The variance is consequently equal to

Var(F̂n (x)

)= E

(F̂n (x)2

)− E2

(F̂n (x)

)(61)

=FX (x) (1− FX (x))

n. (62)

We conclude that

limn→∞

E(FX (x)− F̂n (x)

)= lim

n→∞Var

(F̂n (x)

)= 0. (63)

2.2 Estimating the pdf

Estimating the pdf of a continuous quantity is much more challenging that estimating thecdf. If we have sufficient data, the fraction of samples that are smaller than a certain xprovide a good estimate for the cdf at that point. However, no matter how much data wehave, there is negligible probability that we will see any samples exactly at x: a pointwiseempirical density estimator would equal zero almost everywhere (except at the availablesamples).. How should we estimate f (x) then?

Intuitively, an estimator for f (x) should take into account the presence of samples at neigh-boring locations. If there are many samples close to x then we should estimate a higherprobability density at x, whereas if all the samples are far away, then the estimate for f (x)should be small. The kernel density estimator implements these ideas by computing alocal weighted average at each point x such that the contribution of each sample dependson its distance to x.

12

n = 10

60 62 64 66 68 70 72 74 76Height (inches)

0.2

0.4

0.6

0.8

1.0True cdf

Empirical cdf

n = 100

60 62 64 66 68 70 72 74 76Height (inches)

0.2

0.4

0.6

0.8

1.0True cdf

Empirical cdf

n = 1000

60 62 64 66 68 70 72 74 76Height (inches)

0.2

0.4

0.6

0.8

1.0True cdf

Empirical cdf

Figure 3: Cdf of the height data in Figure 1 of Lecture Notes 4 along with three realizations ofthe empirical cdf computed with n iid samples for n = 10, 100, 1000.

13

Definition 2.3 (Kernel density estimator). Let X1, X2, . . . be a sequence of random variablesbelonging to the same probability space. The value of the kernel density estimator withbandwidth h at x ∈ R is

f̂h,n (x) :=1

nh

n∑i=1

k

(x−Xi

h

), (64)

where k is a kernel function with a maximum at the origin which decreases away from theorigin and satisfies

k (x) ≥ 0 for all x ∈ R, (65)∫Rk (x) dx = 1. (66)

Choosing a rectangular kernel yields an empirical density estimate that looks like a histogram.A more popular kernel is the Gaussian kernel k (x) = e−x

2, which produces a smooth density

estimate. Figure 4 shows the result of computing a Gaussian kernel to estimate the probabil-ity density of the weight of a population of sea snails1. The whole population consists of 4177individuals. Our task is to estimate this distribution from just 200 iid samples. The plotsshow the enormous influence that the bandwidth parameter, which determines the width ofthe kernel, can have on the result. If the bandwidth is very small, individual samples havea large influence on the density estimate. This allows to reproduce irregular shapes moreeasily, but also yields spurious fluctuations that are not present in the true curve. Increasingthe bandwidth smooths out such fluctuations. However, increasing the bandwidth too muchsmooths out structure that may be actually present in the true pdf. A good tradeoff isdifficult to achieve. In practice this parameter must be calibrated from the data.

1The data are available at archive.ics.uci.edu/ml/datasets/Abalone

14

archive.ics.uci.edu/ml/datasets/Abalone

1 0 1 2 3 4Weight (grams)

0.0

0.2

0.4

0.6

0.8

1.0KDE bandwidth: 0.05

KDE bandwidth: 0.25

KDE bandwidth: 0.5

True pdf

1 0 1 2 3 4Weight (grams)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 4: Kernel density estimate for the weight of a population of abalone, a species of sea snail.In the plot above the density is estimated from 200 iid samples using a Gaussian kernel with threedifferent bandwidths. Black crosses representing the individual samples are shown underneath. Inthe plot below we see the result of repeating the procedure three times using a fixed bandwidthequal to 0.25.

15

statistics: learning models from data - nyu...

Documents