local regression

Advanced data analysis

M.GerolimettoDip. di Statistica

Universita` CaFoscari Venezia,

[email protected]

www.dst.unive.it/margherita

1

PART 4: LOCAL REGRESSION

2

Definition

Local regression is an approach to fitting curves

and surfaces to data by smoothing. It is called

LOCAL since the fit at a generic point x0 is the

value of a parametric function fitted only to those

observations that are close to x0.

In this sense it can be thought as a natural exten-

sion of parametric fitting. Since now we considered

models like:

yi = + xi+ i, i = 1, . . . , N

that can be seen as

yi = m(xi) + i, i = 1, . . . , N

where m is linear.

When we assume that m(x) is an element of a

specific parametric class of functions (for example

linear) we are forcing the relationship to have a

certain shape.

3

However it is possible that these models cannot be

applied because of nonlinearity (especially of un-

known form) in the data.

In this sense nonparametric modelling is a goodresponse because it is like placing a a flexible

curve on the (x, y) scatterplot with no para-

metric restrictions on the form of the curve.

Moreover nonparametric methods can help tosee in the scatterplot the underlying structures

of the data (smoothing).

4

Parametric localization

The underlying model for local regression is:

yi = m(xi) + ui, i = 1, . . . , N

The distribution of the yis are unknown.

The means m(xi) are unknown.

In practice we must model the data, which meansmaking certain assumptions on m and other as-pects of the distribution of the yi.

One common assumption is that the yis arehomoskedastic.

As for m it is supposed that the function can belocally approximated by a member of a para-metric class, usually chosen to be a polynomialof certain degree.

This is the parametric localization: in carryingout the local regression we use a parametric familyas in global parametric fitting but we ask only thatthe family fit locally and not globally.

5

Suppose x0 is a generic point in the support of the

x variable. Suppose we do not know function m(x),

but we can assume it is derivable.

To estimate m(x) in x0, we can think of using the

Taylor expansion

m(x) = m(x0) +m(x0)(x x0) + r

where r is a quantity of order smaller than (xx0).

Whatever function (under certain regularity condi-

tions) can be locally approximated by a line.

It is possible to estimate m(x) in a neighborhood of

x0 by minimizing squared errors for pairs (xi, yi), i =

1, . . . , N

min,

Ni=1

{yi (xi x0)}2wi

6

The weights wi in the previous formula are often

chosen so that they are bigger when (xi x0) issmaller. This means that the closer is xi to the

point x0, the bigger is the weight.

This minimization can be thought in a sort of local

view around x0: we can think of weighted least

squares.

BIG ISSUES:

1. How can the weights be chosen?

2. How large should be the neighborhood?

7

The estimation of m that comes from above defi-

nition is obtained with the following steps:

for each fitting point x0 define a neighborhoodbased on some metric in the space of the x

variable

within this neighborhood assume that m is ap-proximated by some member of the chosen para-

metric family

estimate the parameters from observations inthe neighborhood; the local fit at x0 is the fit-

ted function evaluated at x0.

Very often a weight function w(u) is incorporated

that gives greater weight to the xis that are closer

to x0 and smaller weight to the xis that are further

from x0.

The estimation method used depends on the as-

sumption on the yis. If the yis are assumed to

be Gaussian with constant variance, then it makes

sense to base estimation on least squares.

8

Once wi and h have been chosen, one is not in-

terested in calculating m only on a single point x0,

but typically on a set of values (usually uniformly

spaced along the interval between x1 and xN).

Practically, one creates a grid between x1 and xNconsisting of m points (uniformly spaced) and then

compute the minimization over all points of the

grid.

This corresponds to havingm times locally weighted

least squares, one for every of the m points of the

grid that become the center of the neighborhood.

9

Modeling the data

When using local regression the following are the

choices to be made:

1. Assumptions about the behaviour of m

Weight function

Bandwidth

Parametric family

2. Assumption about the yis

Fitting criterion

Differently from parametric fitting we do not rely

on a priori knowledge.

To make the choices listed above we use i) either

the data with graphical analysis or ii) some auto-

matic methods to carry out model selection.

10

Trade-off... again!

Modeling m non parametrically requires a trade off

between bias and variance, starting from the choice

of the bandwidth (but not only!).

In some applications there is a strong preference to-

ward rough estimates (smaller bias) in some other

there is a preference toward smoother estimates

(bigger bias).

Using criteria of model selection, like cross-validation,

has the advantage of an automatic choice (less

subjectivity), but at the same time the disadvan-

tage of giving a poor answer in any particular ap-

plication.

Using graphical criteria, the advantage is great power,

but the disadvantage that they are labor-intensive.

They are good for picking a small number of pa-

rameters, but in case of adaptive fitting it becomes

extremely long.

11

Selecting the weight function

Supposing that m is continuous, then we will use

weight functions that are peaked around 0 and de-

cay smoothly as distances from x0 (let us call the

distances u) increase.

A smooth weight function results in a smoother

estimate than, for example, using a rectangular

weight function.

A natural choice is to use gaussian kernels. The

tricube kernels also are often used because of the

computational speed of a weight function that at

a certain point (but smoothly) gives zero weight

compared to one that only approaches zero as u

gets larger:

w(u) =

{(1 |u|3)3 |u| < 1

0 |u| > 1

In case a gaussian kernel is used, local regression

take the name of kernel regression. In case a

tricube kernel is used (plus a nearest neighbors

bandwidth), local regression take the name of LOESS

estimator, as we will see later on.

12

Selecting the fitting criterion

Virtually any global fitting procedure can be local-

ized. So local regression could work on the basis

of the same number of distributions as global para-

metric fitting.

The simplest case is the Gaussian yis. Least squares

methods approaches can be used. An objection to

least squares is that those estimators are not ro-

bust to heavy-tailed residuals distributions. Under

these circumstances, proposals of ad hoc robusti-

fied fitting procedures are available (LOWESS).

In case other distributions are hypothesized for the

yis, then the locally weighted likelihood can be

used. For example in case of binary data the non

parametric estimated is obtained by local likeli-

hood.

13

Selecting the bandwidth and local family

These issues will be sort of discussed simultane-

ously since they are strongly connected.

Both the choice of the bandwidth parameter and

the parametric family are related to the goal of

producing an estimate that is as smoother as pos-

sible whithout distorting the underlying pattern of

dependence of the response on the independent

variables.

As for kernel estimates of density functions, a bal-

ance between bias and variance must be found.

As for the bandwidth selection, will be considered

fixed and nearest neighbors bandwidth. As for the

parametric family the choice will be made among

polynomial forms whith the degree ranging from 0

to 3.

14

Nearest neighbor bandwidths vs fixed band-

width

The problem with fixed bandwidth is that it pro-

vokes strong swings in variance in case of large

changes in the density of the data.

The boundary issue plays a major role in the band-

width choice. The issue is that using the same

bandwidth at the boundary (where observations

can be more sparse) as in the interior can pro-

duce estimates with a large variability. Think of

gaussian data!

The variable bandwidth (as nearest neighbors) ap-

pears to perform better overall in applications for

this variance issue.

Of course nearest neighbors can fail for some spe-

cific examples, but it is not the fixed bandwidth

the remedy, but rather adaptive methods.

15

Polynomial degree

The choice of the polynomial degree is also a bias-

variance trade-off: a higher degree will produce a

less biased, but more variable estimate.

In case the degree is 0 the local regression estimate

is:

m(x) =

ni=1K(

xxih )yin

i=1K(xxih )

This choice p = 0 is quite well-known in nonpara-

metric literature (it is called local constant regres-

sion), because it is the one for which the asymp-

totic theory has been derived. However this case

is, at the same time, the one that in practice has

less frequently shown good performance.

The problem with local constant regression is that

it cannot reproduce a line even in the very special

case of equally spaced data away from boundaries.

Reducing the lack of fit to a tolerable level requires

very small bandwidths that end up in a very rough

estimate.

16

So, by using a polynomial degree greater than zero

it is possible to increase the bandwith (so reducing

the roughness) without introducing an intolerable

bias.

In case the degree is 1 the local regression estimateis:

m(x) =

ni=1

K(xxih)yin

i=1K(xxi

h)+ (x Xw)

ni=1

K(xxih)(xi Xw)yin

i=1K(xxi

h)(xi Xw)2

where

Xw =

ni=1K(

xxih )xin

i=1K(xxih )

This choice p = 1 is called local linear regression.

17

Notable cases

1. Kernel regression is a local constant regression

(p = 0) where the weigthing mechanism is done

using typical kernel functions (in particular the

Gaussian). It is also called Nadaraya Watson

regression.

2. The LOESS estimator for local regression is

characterized for having a tricube weigthing mech-

anism and a nearest neighbours bandwidth.

18

Kernel regression theory

For kernel regression much theory has been pro-

posed even though it is not the best option in prac-

tise.

The model is

y = m(x) + u

for a given choice of K and h (fixed), we suppose

that the data are i.i.d., the x are not stochastic.

BIASSimilarly to kernel density estimators, the ker-

nel regression is biased of size O(h2):

b(x0) = h2

m(x0)f (x0)f(x0)

+1

2m

(x0)

z2k(z)dzGiven a value for h, the bias varies with the

kernel function that we use, but most of all it

depends on the slope and the curvature of the

function m in x0 and with the slope of f(x0) the

density of the regressors. In the kernel density,

instead, the bias depends only on f(x).

19

LIMIT DISTRIBUTIONThe kernel regression estimator has a limit dis-tribution which is normalNh (m(x0)m(x0) b(x0)) N(0,

2

f(x0)

k(z)2dz

Note that the variance of the estimator m(x0)

is inversely related to f(x0), which means that

the variance of m(x0) is bigger in regions where

x is sparse.

BANDWIDTHThe choice of the bandwidth is once more con-

nected to the bias-variance trade-off.

As in kernel density estimator the bandwidth

can be determined using different methods, we

will see them in the next slides.

20

Choosing the bandwidth: Optimal rule

A value of h that minimizes MISE in an asymptotic

sense would be an optimal bandwidth.

Remember that MSE (mean squared error) mea-

sures the local performance of m in x0, in this case

it takes the form:

MSE [(m(x0)] = E[(m(x0)m(x0))2

]The MISE (mean integrated squared error) is a

global measure of performance

MISE(h) =MSE [(m(x0)] f(x0)dx0

where f is the density of the regressors.

The optimal bandwidth is obtained by minimizing

the MISE and this yields h = O(N1/5

).

It has been shown that the kernel estimate con-

verges with a rate that is slower than the paramet-

ric estimate.

21

Choosing the bandwidth: Cross-validation

An empirical estimate of the optimal h can be ob-

tained using the leave-one-out cross validation pro-

cedure, thus minimizing:

CV (h) =Ni=1

(yi mi(xi))2

The optimality properties derive from the asymp-

totic equivalence between minimizing CV (h) and

minimizingMISE(h) or ISE(h), recalling that, sim-

ilarly to what presented in the previous section:

ISE(h) =((m(x0)m(x0))2 f(x0)dx0

Plug-in

Usually in the kernel regression context it is not

used, the CV is preferred.

22

LOESS estimator

The LOESS estimator is a local regression estima-

tor where:

1. the weight function used for LOESS is the tri-

cube weight function

2. the local polynomials degrees are almost always

of first or second degree (that is, either locally

linear or locally quadratic)

3. the subsets of data used for each weighted least

squares fit in LOESS are determined by a near-

est neighbors algorithm

23

About the third characteristic, usually the smooth-

ing parameter, q, is a number between (p+1)/N

and 1, with p denoting the degree of the local poly-

nomial.

Large values of q produce the smoothest functions

that do not react that much in response to fluctu-

ations in the data. Smaller values of q make the

regression function follow closely the data.

Note that using too small a value of the smooth-

ing parameter is not desirable, however, since the

regression function will eventually start to capture

the random error in the data (too rough!). Possibly

good values of the smoothing parameter typically

lie in the range 0.25 to 0.5 for most LOESS appli-

cations.

24

local regression

Documents