fast algorithms for robust regression

Upload: yancgece9763

Post on 07-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Fast Algorithms for Robust Regression

    1/18

    Fast Algorithms for Robust Regression

    Thorsten Bernholt Robin Nunkesser

    Department of Computer Science, University of Dortmund

    Statistical Computing 2006

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 1 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    2/18

    Outline

    1 IntroductionRobust RegressionTime Series Analysis in Intensive Care

    2 Statistics and Computer ScienceUsing the Toolkits of Computer ScienceProblem transformation

    3 Example

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 2 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    3/18

    Robust Regression

    Definition (Donoho and Huber(1983))

    The (finite sample) breakdown point is the smallest fraction of data pointsthat need to be changed to have an unbounded effect on the estimate.

    Number of international phone calls originated in Belgium

    1950 1955 1960 1965 1970

    0

    50

    100

    150

    200

    Year

    NumberofCalls(inmillions)

    LQD

    LS

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 3 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    4/18

    Time Series Analysis

    Heart rate of a patient in intensive care

    0 1000 2000 3000 4000 5000

    0

    50

    100

    150

    200

    time

    heartrate

    Time series data is monitored online, e.g. in intensive care

    Regression techniques have to be applied to a moving time window

    Robust regression may be used to reduce the effect of outliers

    Need for fast offline and online algorithms

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 4 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    5/18

    Using the Toolkits of Computer Science

    Problems from Statistics often need to be reformulated or transformed.

    Definition

    An algorithmic problem consists of a description of the set of allowableinputs and a description of a function that maps each allowable input to anon-empty set of correct outputs (answers, results).

    Computational Geometry is a related field of research

    The geometric flavour of statistics becomes apparent when

    a sample is regarded as a set of points in Euclidean space.

    Searching nearest neighbours may be reformulated to compute theHodges-Lehmann estimator and an estimator of scale.

    It is often useful to consider underlying decision problems

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 5 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    6/18

    Decision Problems

    Definition

    A decision problem is an algorithmic problem where the set of outputs is

    restricted to Yes and No.Obvious decision problems are:

    Is the optimal value of the objective function better than x?

    Is the local solution y the global solution?

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 6 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    7/18

    Problem Transformation: The Power of Geometric Duality

    Search for a point in an arrangement of lines instead of a line through a set

    of points.

    Map a point (a, b) to the line y= ax+ b and the line y= ax+ b tothe point (a, b).

    Distances and relations are preserved

    primal space

    x

    y

    1.510.50-0.5-1-1.5

    3

    2

    1

    0

    -1

    -2

    -3

    dual space

    -3

    -2

    -1

    0

    1

    2

    3

    -1.5 -1 -0.5 0 0.5 1 1.5

    y

    x

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 7 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    8/18

    Modifications to Geometric Duality

    Adding or Subtracting a dimension may lead to a known problem

    Searching for nearest neighbours in the plane reduces to querying

    extreme points of convex hulls in R3

    Adding or deleting points may help

    We apply this to a robust regression estimator later on

    Other duality concepts besides point/line duality may be used

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 8 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    9/18

    Overview and Newest Result

    Improved static or dynamic algorithms for Repeated Median, Median

    Absolute Deviation, Least Median of Squares, Least Quartile Difference

    Definition (Croux et al.(1994))

    Consider n points pi in the plane and let h = (n + 3)/2. The LQDsolution to the regression problem is given by the slope of the line L which

    minimises theh2

    th order statistic of {|ri (L) rj(L)| | 1 i< j n} .

    Example for ri(L)

    L

    ri(L)

    pi

    The problem has O(n4) possiblesolutions

    Original running time O(n5 log n)

    We achieve O(n2 log n)

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 9 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    10/18

    Application of Geometric Duality

    We map data values consisting ofn points (xi,yi) to 2n2

    lines

    L+i,j : v = +(xi xj)u (yiyj)

    Li,j : v = (xi xj)u+ (yiyj) .

    Example of the modifieddual space

    0.2

    0

    0.2

    0

    .4

    0.6

    0.8

    0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4

    v

    u

    In this arrangement, we search the

    lowest point (, r) withn2

    +

    h2

    subjacent or intersecting lines.

    equals the slope of the LQD fit,

    r equals the minimised orderstatistic.

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 10 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    11/18

    Results from Computational Geometry

    The dual problem is equivalent to two problems from ComputationalGeometry:

    Minimum k-level point

    k-violation linear programming

    Corollary (Cole et al(1987), Roos and Widmayer(1994), Chan(1999))

    It is possible to compute the LQD estimator for n data points in the plane

    in expected running timeO(n2 log n) or deterministic running time

    O(n2 log2 n).

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 11 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    12/18

    One of our Algorithms

    Theoretical superior algorithms are often hard to implement or evenimpractical

    Our own algorithms achieve slightly inferior theoretical running times

    The framework of the algorithms:1 Map the input consisting of n data values to 2

    n

    2

    lines using time

    O(n2)

    2 Search for the optimal solution with the help of the underlying

    decision problem3 Output the solution

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 12 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    13/18

    The Underlying Decision Problem

    We need to decide for a given height, if a local solution exists at this height

    or below.

    Example for the DecisionProblem

    0

    0.2

    0.4

    0.6

    0.8

    0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.40.2

    u

    v

    1 Compute all intersections of the

    lines with this height

    2 Sift through the sorted

    intersections and update the

    number of subjacent lines

    accordingly

    3 If the number equalsn2+

    h2

    ,decide YES

    Running time: O(n2 log n)

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 13 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    14/18

    Randomised Search for the Global Solution

    A lower and an upper bound for the height of the optimal solution is storedwhile the algorithm runs.

    1 Initialise the search:

    Initialise 0 as the lower bound and find a trivial local solution to

    initialise the upper bound.

    2 Search for the global solution:

    Calculate the number of intersections that lie between the lower and

    the upper bound.

    Choose one of this intersections uniformly at random.

    Decide if the height of this intersection becomes the new lower or the

    new upper bound.

    3 Stopping Criterion:

    Search until no intersection remains between the lower and the upper

    bound.

    Expected number the decision problem has to be solved:O

    (log n).Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 14 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    15/18

    Calculating the number of intersections efficiently

    Example for Intersections6 7 9 10

    10

    11

    8

    8

    5

    2

    1

    1 7 239 11 4 6

    3 4 5

    Calculate the no of intersections1 Label the lines according to their

    intersection with the upperhorizontal line.

    2 Interpret the intersections with thelower horizontal line as apermutation of these labels (e.g.(8, 1, 5, 2, 10, 3, 7, 4, 9, 6, 11)).

    3

    Calculate the number of inversionsof the permutation, e.g. with mergesort.

    Running time: O(n2 log n)

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 15 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    16/18

    Summary

    Results from Computational Geometry are applicable for problemsfrom Statistics.

    Solving dual or equivalent problems may lead to superior runningtimes.

    Results from Statistics may also help computer scientists, e.g. in theanalysis of running times.

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 16 / 17

  • 8/6/2019 Fast Algorithms for Robust Regression

    17/18

    Thank you!

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 17 / 17

    Bibli h

  • 8/6/2019 Fast Algorithms for Robust Regression

    18/18

    Bibliography

    Chan, T. M., 1999. Geometric applications of a randomizedoptimization technique. Discrete and Computational Geometry 22 (4),547567.

    Cole, R., Sharir, M., Yap, C. K., 1987. On k-hulls and relatedproblems. SIAM J. Comput. 16 (1), 6177.

    Croux, C., Rousseeuw, P. J., Hssjer, O., 1994. GeneralizedS-estimators. J. Amer. Statist. Assoc. 89, 12711281.

    Donoho, D., Huber, P., 1983. The notion of breakdown point. In:Bickel, P., Doksum, K., Hodges, J. J. (Eds.), A Festschrift for Erich L.

    Lehmann. Wadsworth, pp. 157184.

    Roos, T., Widmayer, P., 1994. k-violation linear programming. Inf.Process. Lett. 52 (2), 109114.

    Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 18 / 17