local regression

Upload: han

Post on 14-Jan-2016

219 views

Category:

Documents


0 download

DESCRIPTION

good

TRANSCRIPT

  • Advanced data analysis

    M.GerolimettoDip. di Statistica

    Universita` CaFoscari Venezia,

    [email protected]

    www.dst.unive.it/margherita

    1

  • PART 4: LOCAL REGRESSION

    2

  • Definition

    Local regression is an approach to fitting curves

    and surfaces to data by smoothing. It is called

    LOCAL since the fit at a generic point x0 is the

    value of a parametric function fitted only to those

    observations that are close to x0.

    In this sense it can be thought as a natural exten-

    sion of parametric fitting. Since now we considered

    models like:

    yi = + xi+ i, i = 1, . . . , N

    that can be seen as

    yi = m(xi) + i, i = 1, . . . , N

    where m is linear.

    When we assume that m(x) is an element of a

    specific parametric class of functions (for example

    linear) we are forcing the relationship to have a

    certain shape.

    3

  • However it is possible that these models cannot be

    applied because of nonlinearity (especially of un-

    known form) in the data.

    In this sense nonparametric modelling is a goodresponse because it is like placing a a flexible

    curve on the (x, y) scatterplot with no para-

    metric restrictions on the form of the curve.

    Moreover nonparametric methods can help tosee in the scatterplot the underlying structures

    of the data (smoothing).

    4

  • Parametric localization

    The underlying model for local regression is:

    yi = m(xi) + ui, i = 1, . . . , N

    The distribution of the yis are unknown.

    The means m(xi) are unknown.

    In practice we must model the data, which meansmaking certain assumptions on m and other as-pects of the distribution of the yi.

    One common assumption is that the yis arehomoskedastic.

    As for m it is supposed that the function can belocally approximated by a member of a para-metric class, usually chosen to be a polynomialof certain degree.

    This is the parametric localization: in carryingout the local regression we use a parametric familyas in global parametric fitting but we ask only thatthe family fit locally and not globally.

    5

  • Suppose x0 is a generic point in the support of the

    x variable. Suppose we do not know function m(x),

    but we can assume it is derivable.

    To estimate m(x) in x0, we can think of using the

    Taylor expansion

    m(x) = m(x0) +m(x0)(x x0) + r

    where r is a quantity of order smaller than (xx0).

    Whatever function (under certain regularity condi-

    tions) can be locally approximated by a line.

    It is possible to estimate m(x) in a neighborhood of

    x0 by minimizing squared errors for pairs (xi, yi), i =

    1, . . . , N

    min,

    Ni=1

    {yi (xi x0)}2wi

    6

  • The weights wi in the previous formula are often

    chosen so that they are bigger when (xi x0) issmaller. This means that the closer is xi to the

    point x0, the bigger is the weight.

    This minimization can be thought in a sort of local

    view around x0: we can think of weighted least

    squares.

    BIG ISSUES:

    1. How can the weights be chosen?

    2. How large should be the neighborhood?

    7

  • The estimation of m that comes from above defi-

    nition is obtained with the following steps:

    for each fitting point x0 define a neighborhoodbased on some metric in the space of the x

    variable

    within this neighborhood assume that m is ap-proximated by some member of the chosen para-

    metric family

    estimate the parameters from observations inthe neighborhood; the local fit at x0 is the fit-

    ted function evaluated at x0.

    Very often a weight function w(u) is incorporated

    that gives greater weight to the xis that are closer

    to x0 and smaller weight to the xis that are further

    from x0.

    The estimation method used depends on the as-

    sumption on the yis. If the yis are assumed to

    be Gaussian with constant variance, then it makes

    sense to base estimation on least squares.

    8

  • Once wi and h have been chosen, one is not in-

    terested in calculating m only on a single point x0,

    but typically on a set of values (usually uniformly

    spaced along the interval between x1 and xN).

    Practically, one creates a grid between x1 and xNconsisting of m points (uniformly spaced) and then

    compute the minimization over all points of the

    grid.

    This corresponds to havingm times locally weighted

    least squares, one for every of the m points of the

    grid that become the center of the neighborhood.

    9

  • Modeling the data

    When using local regression the following are the

    choices to be made:

    1. Assumptions about the behaviour of m

    Weight function

    Bandwidth

    Parametric family

    2. Assumption about the yis

    Fitting criterion

    Differently from parametric fitting we do not rely

    on a priori knowledge.

    To make the choices listed above we use i) either

    the data with graphical analysis or ii) some auto-

    matic methods to carry out model selection.

    10

  • Trade-off... again!

    Modeling m non parametrically requires a trade off

    between bias and variance, starting from the choice

    of the bandwidth (but not only!).

    In some applications there is a strong preference to-

    ward rough estimates (smaller bias) in some other

    there is a preference toward smoother estimates

    (bigger bias).

    Using criteria of model selection, like cross-validation,

    has the advantage of an automatic choice (less

    subjectivity), but at the same time the disadvan-

    tage of giving a poor answer in any particular ap-

    plication.

    Using graphical criteria, the advantage is great power,

    but the disadvantage that they are labor-intensive.

    They are good for picking a small number of pa-

    rameters, but in case of adaptive fitting it becomes

    extremely long.

    11

  • Selecting the weight function

    Supposing that m is continuous, then we will use

    weight functions that are peaked around 0 and de-

    cay smoothly as distances from x0 (let us call the

    distances u) increase.

    A smooth weight function results in a smoother

    estimate than, for example, using a rectangular

    weight function.

    A natural choice is to use gaussian kernels. The

    tricube kernels also are often used because of the

    computational speed of a weight function that at

    a certain point (but smoothly) gives zero weight

    compared to one that only approaches zero as u

    gets larger:

    w(u) =

    {(1 |u|3)3 |u| < 1

    0 |u| > 1

    In case a gaussian kernel is used, local regression

    take the name of kernel regression. In case a

    tricube kernel is used (plus a nearest neighbors

    bandwidth), local regression take the name of LOESS

    estimator, as we will see later on.

    12

  • Selecting the fitting criterion

    Virtually any global fitting procedure can be local-

    ized. So local regression could work on the basis

    of the same number of distributions as global para-

    metric fitting.

    The simplest case is the Gaussian yis. Least squares

    methods approaches can be used. An objection to

    least squares is that those estimators are not ro-

    bust to heavy-tailed residuals distributions. Under

    these circumstances, proposals of ad hoc robusti-

    fied fitting procedures are available (LOWESS).

    In case other distributions are hypothesized for the

    yis, then the locally weighted likelihood can be

    used. For example in case of binary data the non

    parametric estimated is obtained by local likeli-

    hood.

    13

  • Selecting the bandwidth and local family

    These issues will be sort of discussed simultane-

    ously since they are strongly connected.

    Both the choice of the bandwidth parameter and

    the parametric family are related to the goal of

    producing an estimate that is as smoother as pos-

    sible whithout distorting the underlying pattern of

    dependence of the response on the independent

    variables.

    As for kernel estimates of density functions, a bal-

    ance between bias and variance must be found.

    As for the bandwidth selection, will be considered

    fixed and nearest neighbors bandwidth. As for the

    parametric family the choice will be made among

    polynomial forms whith the degree ranging from 0

    to 3.

    14

  • Nearest neighbor bandwidths vs fixed band-

    width

    The problem with fixed bandwidth is that it pro-

    vokes strong swings in variance in case of large

    changes in the density of the data.

    The boundary issue plays a major role in the band-

    width choice. The issue is that using the same

    bandwidth at the boundary (where observations

    can be more sparse) as in the interior can pro-

    duce estimates with a large variability. Think of

    gaussian data!

    The variable bandwidth (as nearest neighbors) ap-

    pears to perform better overall in applications for

    this variance issue.

    Of course nearest neighbors can fail for some spe-

    cific examples, but it is not the fixed bandwidth

    the remedy, but rather adaptive methods.

    15

  • Polynomial degree

    The choice of the polynomial degree is also a bias-

    variance trade-off: a higher degree will produce a

    less biased, but more variable estimate.

    In case the degree is 0 the local regression estimate

    is:

    m(x) =

    ni=1K(

    xxih )yin

    i=1K(xxih )

    This choice p = 0 is quite well-known in nonpara-

    metric literature (it is called local constant regres-

    sion), because it is the one for which the asymp-

    totic theory has been derived. However this case

    is, at the same time, the one that in practice has

    less frequently shown good performance.

    The problem with local constant regression is that

    it cannot reproduce a line even in the very special

    case of equally spaced data away from boundaries.

    Reducing the lack of fit to a tolerable level requires

    very small bandwidths that end up in a very rough

    estimate.

    16

  • So, by using a polynomial degree greater than zero

    it is possible to increase the bandwith (so reducing

    the roughness) without introducing an intolerable

    bias.

    In case the degree is 1 the local regression estimateis:

    m(x) =

    ni=1

    K(xxih)yin

    i=1K(xxi

    h)+ (x Xw)

    ni=1

    K(xxih)(xi Xw)yin

    i=1K(xxi

    h)(xi Xw)2

    where

    Xw =

    ni=1K(

    xxih )xin

    i=1K(xxih )

    This choice p = 1 is called local linear regression.

    17

  • Notable cases

    1. Kernel regression is a local constant regression

    (p = 0) where the weigthing mechanism is done

    using typical kernel functions (in particular the

    Gaussian). It is also called Nadaraya Watson

    regression.

    2. The LOESS estimator for local regression is

    characterized for having a tricube weigthing mech-

    anism and a nearest neighbours bandwidth.

    18

  • Kernel regression theory

    For kernel regression much theory has been pro-

    posed even though it is not the best option in prac-

    tise.

    The model is

    y = m(x) + u

    for a given choice of K and h (fixed), we suppose

    that the data are i.i.d., the x are not stochastic.

    BIASSimilarly to kernel density estimators, the ker-

    nel regression is biased of size O(h2):

    b(x0) = h2

    m(x0)f (x0)f(x0)

    +1

    2m

    (x0)

    z2k(z)dzGiven a value for h, the bias varies with the

    kernel function that we use, but most of all it

    depends on the slope and the curvature of the

    function m in x0 and with the slope of f(x0) the

    density of the regressors. In the kernel density,

    instead, the bias depends only on f(x).

    19

  • LIMIT DISTRIBUTIONThe kernel regression estimator has a limit dis-tribution which is normalNh (m(x0)m(x0) b(x0)) N(0,

    2

    f(x0)

    k(z)2dz

    Note that the variance of the estimator m(x0)

    is inversely related to f(x0), which means that

    the variance of m(x0) is bigger in regions where

    x is sparse.

    BANDWIDTHThe choice of the bandwidth is once more con-

    nected to the bias-variance trade-off.

    As in kernel density estimator the bandwidth

    can be determined using different methods, we

    will see them in the next slides.

    20

  • Choosing the bandwidth: Optimal rule

    A value of h that minimizes MISE in an asymptotic

    sense would be an optimal bandwidth.

    Remember that MSE (mean squared error) mea-

    sures the local performance of m in x0, in this case

    it takes the form:

    MSE [(m(x0)] = E[(m(x0)m(x0))2

    ]The MISE (mean integrated squared error) is a

    global measure of performance

    MISE(h) =MSE [(m(x0)] f(x0)dx0

    where f is the density of the regressors.

    The optimal bandwidth is obtained by minimizing

    the MISE and this yields h = O(N1/5

    ).

    It has been shown that the kernel estimate con-

    verges with a rate that is slower than the paramet-

    ric estimate.

    21

  • Choosing the bandwidth: Cross-validation

    An empirical estimate of the optimal h can be ob-

    tained using the leave-one-out cross validation pro-

    cedure, thus minimizing:

    CV (h) =Ni=1

    (yi mi(xi))2

    The optimality properties derive from the asymp-

    totic equivalence between minimizing CV (h) and

    minimizingMISE(h) or ISE(h), recalling that, sim-

    ilarly to what presented in the previous section:

    ISE(h) =((m(x0)m(x0))2 f(x0)dx0

    Plug-in

    Usually in the kernel regression context it is not

    used, the CV is preferred.

    22

  • LOESS estimator

    The LOESS estimator is a local regression estima-

    tor where:

    1. the weight function used for LOESS is the tri-

    cube weight function

    2. the local polynomials degrees are almost always

    of first or second degree (that is, either locally

    linear or locally quadratic)

    3. the subsets of data used for each weighted least

    squares fit in LOESS are determined by a near-

    est neighbors algorithm

    23

  • About the third characteristic, usually the smooth-

    ing parameter, q, is a number between (p+1)/N

    and 1, with p denoting the degree of the local poly-

    nomial.

    Large values of q produce the smoothest functions

    that do not react that much in response to fluctu-

    ations in the data. Smaller values of q make the

    regression function follow closely the data.

    Note that using too small a value of the smooth-

    ing parameter is not desirable, however, since the

    regression function will eventually start to capture

    the random error in the data (too rough!). Possibly

    good values of the smoothing parameter typically

    lie in the range 0.25 to 0.5 for most LOESS appli-

    cations.

    24