sta141c: big data & high performance statistical...
TRANSCRIPT
STA141C: Big Data & High Performance StatisticalComputing
Lecture 7: Linear Regression, Linear System Solvers
Cho-Jui HsiehUC Davis
May 9, 2017
Linear Regression
Regression
Input: training data x1, x2, . . . , xn ∈ Rd and corresponding outputsy1, y2, . . . , yn ∈ RTraining: compute a function f such that f (xi ) ≈ yi for all i
Prediction: given a testing sample x̃ , predict the output as f (x̃)
Examples:
Income, number of children ⇒ Consumer spendingProcesses, memory ⇒ Power consumptionFinancial reports ⇒ RiskAtmospheric conditions ⇒ Precipitation
Linear Regression
Assume f (·) is a linear function parameterized by w ∈ Rd :
f (x) = wTx
Training: compute the model w ∈ Rd such that wTxi ≈ yi for all iPrediction: given a testing sample x̃ , the prediction value is wT x̃How to find w?
w∗ = argminw∈Rd
n∑i=1
(wTxi − yi )2
Linear Regression: probability interpretation
Assume the data is generated from the probability model:
yi ∼ wTxi + εi , εi ∼ N (0, 1)
Maximum likelihood estimator:
w∗ = argmaxw
logP(y1, . . . , yn | x1, . . . , xn,w)
= argmaxw
n∑i=1
logP(yi | xi ,w)
= argmaxw
n∑i=1
log(1√2π
e−(wT xi−yi )2/2)
= argmaxw
n∑i=1
−1
2(wTxi − yi )
2 + constant
= argminw
n∑i=1
(wTxi − yi )2
Linear Regression: written as a matrix form
Linear regression: w∗ = argminw∈Rd
∑ni=1(wTxi − yi )
2
Matrix form: let X ∈ Rn×d be the matrix where the i-th row is xi ,y = [y1, . . . , yn]T , then linear regression can be written as
w∗ = argminw∈Rd
‖Xw − y‖22
Solving Linear Regression
Minimize the sum of squared error J(w)
J(w) =1
2‖Xw − y‖2
=1
2(Xw − y)T (Xw − y)
=1
2wTXTXw − yTXw +
1
2yTy
Derivative: ∂∂w J(w) = XTXw − XTy
Setting the derivative equal to zero gives the normal equation
XTXw∗ = XTy
Therefore, w∗ = (XTX )−1XTy
but XTX may be non-invertible . . .
Solving Linear Regression
Minimize the sum of squared error J(w)
J(w) =1
2‖Xw − y‖2
=1
2(Xw − y)T (Xw − y)
=1
2wTXTXw − yTXw +
1
2yTy
Derivative: ∂∂w J(w) = XTXw − XTy
Setting the derivative equal to zero gives the normal equation
XTXw∗ = XTy
Therefore, w∗ = (XTX )−1XTy
but XTX may be non-invertible . . .
Linear System Solver
Linear System: given A ∈ Rm×n and b ∈ Rm, solve (find a x ∈ Rn suchthat)
Ax = bThree cases:
A is invertible (m = n and full rank)⇒ unique solution x = A−1b
Under-determined system: rank(A) = m but n > m⇒ multiple solutions, usually want to find the “least norm” solution.
x∗ = arg minx‖x‖22 s.t. Ax = b.
Over-determined system: rank(A) < m⇒ (usually) no solution⇒ Output x∗ = arg minx ‖Ax − b‖2
Do not compute the inverse of a matrix!Numerical problemTime consuming
Linear System Solver
Linear System: given A ∈ Rm×n and b ∈ Rm, solve (find a x ∈ Rn suchthat)
Ax = bThree cases:
A is invertible (m = n and full rank)⇒ unique solution x = A−1b
Under-determined system: rank(A) = m but n > m⇒ multiple solutions, usually want to find the “least norm” solution.
x∗ = arg minx‖x‖22 s.t. Ax = b.
Over-determined system: rank(A) < m⇒ (usually) no solution⇒ Output x∗ = arg minx ‖Ax − b‖2
Do not compute the inverse of a matrix!Numerical problemTime consuming
Linear System Solver
Linear System: given A ∈ Rm×n and b ∈ Rm, solve (find a x ∈ Rn suchthat)
Ax = bThree cases:
A is invertible (m = n and full rank)⇒ unique solution x = A−1b
Under-determined system: rank(A) = m but n > m⇒ multiple solutions, usually want to find the “least norm” solution.
x∗ = arg minx‖x‖22 s.t. Ax = b.
Over-determined system: rank(A) < m⇒ (usually) no solution⇒ Output x∗ = arg minx ‖Ax − b‖2
Do not compute the inverse of a matrix!Numerical problemTime consuming
Linear System Solver
Linear System: given A ∈ Rm×n and b ∈ Rm, solve (find a x ∈ Rn suchthat)
Ax = bThree cases:
A is invertible (m = n and full rank)⇒ unique solution x = A−1b
Under-determined system: rank(A) = m but n > m⇒ multiple solutions, usually want to find the “least norm” solution.
x∗ = arg minx‖x‖22 s.t. Ax = b.
Over-determined system: rank(A) < m⇒ (usually) no solution⇒ Output x∗ = arg minx ‖Ax − b‖2
Do not compute the inverse of a matrix!Numerical problemTime consuming
Linear System Solver
First case: A ∈ Rm×m, A is invertible (full rank)
Use “Gaussian elimination” or equivalently “LU factorization”
Call “numpy.linalg.solve”
>>> a = np.array([3,1], [1,2]])
>>> b = np.array([9,8])
>>> x = np.linalg.solve(a, b)
>>> x
array([ 2., 3.])
Linear System Solver
Under-determined system and over-determined system: can be solvedby SVD
Ax = b
UΣV Tx = b
x = VӆUTb (psuedo-inverse)
Σ† = diag(σ−11 , σ−12 , · · · , σ−1k , 0, · · · , 0)
(assume A has k nonzero singular values)
Call “numpy.linalg.lstsq”
Linear System Solver
>>> a = scipy.rand(5,10)
>>> b = scipy.rand(5,1)
>>> x = numpy.linalg.lstsq(a,b)
>>> numpy.linalg.norm(a.dot(x[0]) - b)
7.4320704251928296e-16
>>> a = scipy.rand(5,3)
>>> b = scipy.rand(5,1)
>>> x = numpy.linalg.lstsq(a,b)
>>> numpy.linalg.norm(a.dot(x[0]) - b)
0.35374284817556079
Solve multiple linear systems
Many times we need to solve
Axi = bi for all i = 1, · · · ,N
They can be solved altogether (so only one SVD or otherdecomposition is needed)
>>> a = scipy.rand(5,3)
>>> b = scipy.rand(5,4)
>>> x = numpy.linalg.lstsq(a,b)
>>> x[0] ## solution for 5 linear systems
array([[ 0.15914526, 0.44365737, 0.31351924, 0.3476335 ],
[ 0.30223114, 0.54325633, 0.22719821, 1.05852352],
[ 0.07991735, 0.09856708, 0.08663738, -0.29111466]])
Solving Linear Regression
Normal equation: XTXw∗ = XTyIf XTX is invertible (typically when # samples > # features):
w∗ = (XTX )−1y
If XTX is low-rank (typically when # features > # samples):infinite number of solutions
In general, just use “numpy.linalg.lstsq”
Regularized Linear Regression
Overfitting
Overfitting: the model has low training error but high prediction error.
Using too many features can lead to overfitting
Regularization to Avoid Overfitting
Enforce the solution to have low L2-norm:
argminw
n∑i=1
‖wTxi − yi‖2 s.t. ‖w‖2 ≤ K
Equivalent to the following problem with some λ
argminw
n∑i=1
‖wTxi − yi‖2 + λ‖w‖2
Regularized Linear Regression
Regularized Linear Regression:
argminw‖Xw − y‖2 + R(w)
R(w): regularization
Ridge Regression (`2 regularization):
argminw‖Xw − y‖2 + λ‖w‖2
Lasso (`1 regularization):
argminw‖Xw − y‖2 + λ‖w‖1
Note that ‖w‖1 =∑d
i=1 |wi |
Regularization
Lasso: the solution is sparse, but no closed form solution
Ridge Regression
Ridge regression: argminw∈Rd
1
2‖Xw − y‖2 +
λ
2‖w‖2︸ ︷︷ ︸
J(w)
Closed form solution: optimal solution w∗ satisfies ∇J(w∗) = 0:
XTXw∗ − XTy + λw∗ = 0
(XTX + λI )w∗ = XTy
Optimial solution: w∗ = (XTX + λI )−1XTy
Inverse always exists because XTX + λI is positive definite
What’s the computational complexity?
Time Complexity
When X is dense:
Closed form solution requires O(nd2 + d3) if X is denseEfficient if d is very smallRuns forever when d > 100, 000
Typical case for big data applications:
X ∈ Rn×d is sparse with large n and large dHow can we solve the problem?
Coming up
Optimization
Questions?