fast algorithms for robust regression
TRANSCRIPT
-
8/6/2019 Fast Algorithms for Robust Regression
1/18
Fast Algorithms for Robust Regression
Thorsten Bernholt Robin Nunkesser
Department of Computer Science, University of Dortmund
Statistical Computing 2006
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 1 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
2/18
Outline
1 IntroductionRobust RegressionTime Series Analysis in Intensive Care
2 Statistics and Computer ScienceUsing the Toolkits of Computer ScienceProblem transformation
3 Example
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 2 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
3/18
Robust Regression
Definition (Donoho and Huber(1983))
The (finite sample) breakdown point is the smallest fraction of data pointsthat need to be changed to have an unbounded effect on the estimate.
Number of international phone calls originated in Belgium
1950 1955 1960 1965 1970
0
50
100
150
200
Year
NumberofCalls(inmillions)
LQD
LS
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 3 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
4/18
Time Series Analysis
Heart rate of a patient in intensive care
0 1000 2000 3000 4000 5000
0
50
100
150
200
time
heartrate
Time series data is monitored online, e.g. in intensive care
Regression techniques have to be applied to a moving time window
Robust regression may be used to reduce the effect of outliers
Need for fast offline and online algorithms
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 4 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
5/18
Using the Toolkits of Computer Science
Problems from Statistics often need to be reformulated or transformed.
Definition
An algorithmic problem consists of a description of the set of allowableinputs and a description of a function that maps each allowable input to anon-empty set of correct outputs (answers, results).
Computational Geometry is a related field of research
The geometric flavour of statistics becomes apparent when
a sample is regarded as a set of points in Euclidean space.
Searching nearest neighbours may be reformulated to compute theHodges-Lehmann estimator and an estimator of scale.
It is often useful to consider underlying decision problems
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 5 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
6/18
Decision Problems
Definition
A decision problem is an algorithmic problem where the set of outputs is
restricted to Yes and No.Obvious decision problems are:
Is the optimal value of the objective function better than x?
Is the local solution y the global solution?
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 6 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
7/18
Problem Transformation: The Power of Geometric Duality
Search for a point in an arrangement of lines instead of a line through a set
of points.
Map a point (a, b) to the line y= ax+ b and the line y= ax+ b tothe point (a, b).
Distances and relations are preserved
primal space
x
y
1.510.50-0.5-1-1.5
3
2
1
0
-1
-2
-3
dual space
-3
-2
-1
0
1
2
3
-1.5 -1 -0.5 0 0.5 1 1.5
y
x
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 7 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
8/18
Modifications to Geometric Duality
Adding or Subtracting a dimension may lead to a known problem
Searching for nearest neighbours in the plane reduces to querying
extreme points of convex hulls in R3
Adding or deleting points may help
We apply this to a robust regression estimator later on
Other duality concepts besides point/line duality may be used
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 8 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
9/18
Overview and Newest Result
Improved static or dynamic algorithms for Repeated Median, Median
Absolute Deviation, Least Median of Squares, Least Quartile Difference
Definition (Croux et al.(1994))
Consider n points pi in the plane and let h = (n + 3)/2. The LQDsolution to the regression problem is given by the slope of the line L which
minimises theh2
th order statistic of {|ri (L) rj(L)| | 1 i< j n} .
Example for ri(L)
L
ri(L)
pi
The problem has O(n4) possiblesolutions
Original running time O(n5 log n)
We achieve O(n2 log n)
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 9 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
10/18
Application of Geometric Duality
We map data values consisting ofn points (xi,yi) to 2n2
lines
L+i,j : v = +(xi xj)u (yiyj)
Li,j : v = (xi xj)u+ (yiyj) .
Example of the modifieddual space
0.2
0
0.2
0
.4
0.6
0.8
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
v
u
In this arrangement, we search the
lowest point (, r) withn2
+
h2
subjacent or intersecting lines.
equals the slope of the LQD fit,
r equals the minimised orderstatistic.
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 10 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
11/18
Results from Computational Geometry
The dual problem is equivalent to two problems from ComputationalGeometry:
Minimum k-level point
k-violation linear programming
Corollary (Cole et al(1987), Roos and Widmayer(1994), Chan(1999))
It is possible to compute the LQD estimator for n data points in the plane
in expected running timeO(n2 log n) or deterministic running time
O(n2 log2 n).
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 11 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
12/18
One of our Algorithms
Theoretical superior algorithms are often hard to implement or evenimpractical
Our own algorithms achieve slightly inferior theoretical running times
The framework of the algorithms:1 Map the input consisting of n data values to 2
n
2
lines using time
O(n2)
2 Search for the optimal solution with the help of the underlying
decision problem3 Output the solution
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 12 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
13/18
The Underlying Decision Problem
We need to decide for a given height, if a local solution exists at this height
or below.
Example for the DecisionProblem
0
0.2
0.4
0.6
0.8
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.40.2
u
v
1 Compute all intersections of the
lines with this height
2 Sift through the sorted
intersections and update the
number of subjacent lines
accordingly
3 If the number equalsn2+
h2
,decide YES
Running time: O(n2 log n)
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 13 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
14/18
Randomised Search for the Global Solution
A lower and an upper bound for the height of the optimal solution is storedwhile the algorithm runs.
1 Initialise the search:
Initialise 0 as the lower bound and find a trivial local solution to
initialise the upper bound.
2 Search for the global solution:
Calculate the number of intersections that lie between the lower and
the upper bound.
Choose one of this intersections uniformly at random.
Decide if the height of this intersection becomes the new lower or the
new upper bound.
3 Stopping Criterion:
Search until no intersection remains between the lower and the upper
bound.
Expected number the decision problem has to be solved:O
(log n).Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 14 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
15/18
Calculating the number of intersections efficiently
Example for Intersections6 7 9 10
10
11
8
8
5
2
1
1 7 239 11 4 6
3 4 5
Calculate the no of intersections1 Label the lines according to their
intersection with the upperhorizontal line.
2 Interpret the intersections with thelower horizontal line as apermutation of these labels (e.g.(8, 1, 5, 2, 10, 3, 7, 4, 9, 6, 11)).
3
Calculate the number of inversionsof the permutation, e.g. with mergesort.
Running time: O(n2 log n)
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 15 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
16/18
Summary
Results from Computational Geometry are applicable for problemsfrom Statistics.
Solving dual or equivalent problems may lead to superior runningtimes.
Results from Statistics may also help computer scientists, e.g. in theanalysis of running times.
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 16 / 17
-
8/6/2019 Fast Algorithms for Robust Regression
17/18
Thank you!
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 17 / 17
Bibli h
-
8/6/2019 Fast Algorithms for Robust Regression
18/18
Bibliography
Chan, T. M., 1999. Geometric applications of a randomizedoptimization technique. Discrete and Computational Geometry 22 (4),547567.
Cole, R., Sharir, M., Yap, C. K., 1987. On k-hulls and relatedproblems. SIAM J. Comput. 16 (1), 6177.
Croux, C., Rousseeuw, P. J., Hssjer, O., 1994. GeneralizedS-estimators. J. Amer. Statist. Assoc. 89, 12711281.
Donoho, D., Huber, P., 1983. The notion of breakdown point. In:Bickel, P., Doksum, K., Hodges, J. J. (Eds.), A Festschrift for Erich L.
Lehmann. Wadsworth, pp. 157184.
Roos, T., Widmayer, P., 1994. k-violation linear programming. Inf.Process. Lett. 52 (2), 109114.
Thorsten Bernholt, Robin Nunkesser Fast Algorithms for Robust Regression Statistical Computing 18 / 17