ee270 large scale matrix computation, optimization and
TRANSCRIPT
![Page 1: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/1.jpg)
EE270Large scale matrix computation,
optimization and learning
Instructor : Mert Pilanci
Stanford University
Thursday, Jan 14 2020
![Page 2: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/2.jpg)
Randomized Linear AlgebraLecture 3: Applications of AMM, Error Analysis,
Trace Estimation and Bootstrap
![Page 3: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/3.jpg)
Approximate Matrix Multiplication
Algorithm 1 Approximate Matrix Multiplication via Sampling
Input: An n × d matrix A and an d × p matrix B, an integer mand probabilities {pk}dk=1
Output: Matrices CR such that CR ≈ AB
1: for t = 1 to m do2: Pick it ∈ {1, ..., d} with probability P[it = k] = pk in i.i.d.
with replacement3: Set C (t) = 1√
mpitA(it) and R(t) = 1√
mpitB(it)
4: end for
I We can multiply CR using the classical algorithm
I Complexity O(nmp)
![Page 4: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/4.jpg)
AMM mean and variance
AB ≈ CR =1
m
m∑t=1
1
pitA(it)B(it)
I Mean and variance of the matrix multiplication estimator
Lemma
I E [(CR)ij ] = (AB)ij
I Var [(CR)ij ] = 1m
∑dk=1
A2ikB
2kj
pk− 1
m (AB)2ij
I E‖AB − CR‖2F =∑
ij E(AB − CR)2ij =∑
ij Var[(CR)ij ]
= 1m
∑dk=1
∑i A
2ik
∑j B
2kj
pk− 1
m‖AB‖2F
= 1m
∑dk=1
1pk‖A(k)‖22‖B(k)‖22 − 1
m‖AB‖2F
![Page 5: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/5.jpg)
Optimal sampling probabilities
I Nonuniform sampling
pk =‖A(k)‖2‖B(k)‖2∑i ‖A(k)‖2‖B(k)‖2
I minimizes E‖AB − CR‖FI E‖AB − CR‖2F = 1
m
∑dk=1
1pk‖A(k)‖22‖B(k)‖22 − 1
m‖AB‖2F
= 1m
(∑dk=1 ‖A(k)‖2‖B(k)‖2
)2− 1
m‖AB‖2F
is the optimal error
![Page 6: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/6.jpg)
Final Probability Bound for `2-norm sampling
I For any δ > 0, set m = 1δ ε2
to obtain
P [‖AB − CR‖F > ε‖A‖F‖B‖F ] ≤ δ (1)
I i.e., ‖AB − CR‖F < ε‖A‖F‖B‖F with probability 1− δI note that m is independent of any dimensions
![Page 7: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/7.jpg)
Numerical simulations for AMM
I Approximating ATA
a subset of the CIFAR dataset
0 10 20 30 40 50 60 70 80 90 100
q - number of row samples
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Err
or
in s
pectr
al norm
norm
aliz
ed b
y ||A
||F4
uniform
L2 norms
![Page 8: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/8.jpg)
Numerical simulations for AMM
I Approximating ATA
sparse matrix from a computational fluid dynamics model
0 10 20 30 40 50 60 70 80 90 100
q - number of row samples
0
10
20
30
40
50
60
70E
rror
in s
pectr
al norm
norm
aliz
ed b
y ||A
||F4
uniform
L2 norms
SuiteSparse Matrix Collection: https://sparse.tamu.edu
![Page 9: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/9.jpg)
Sampling with replacement vs without replacement
SuiteSparse Matrix Collection: https://sparse.tamu.edu
Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.
![Page 10: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/10.jpg)
Applications of Approximate Matrix MultiplicationI Simultaneous Localization and Mapping (SLAM)
Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.
![Page 11: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/11.jpg)
Applications of Approximate Matrix Multiplication
Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.
![Page 12: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/12.jpg)
Applications of Approximate Matrix Multiplication
Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.
![Page 13: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/13.jpg)
Neural Networks
I Given image x
I Classify into M classes
I Neural network f (x) = WL(...s(W2(s(W1x))))
I W1,..., WL are trained weight matrices
LeCun et al. (1998)
![Page 14: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/14.jpg)
Neural Networks
LeCun et al. (1998)
![Page 15: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/15.jpg)
AMM for neural networks
Plancher et. al. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM,2019.
![Page 16: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/16.jpg)
Probing the actual error
I AB ≈ CR
I ∆ , AB − CR
I How large is the error ‖∆‖F ?
I ‖∆‖2F = tr(∆T∆
)I trace of a matrix B
I trB),∑
i Bii
I trace estimation
![Page 17: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/17.jpg)
Trace estimation
I Let B an n × n symmetric matrix
I Let u1, ..., un be n i.i.d. samples of a random variable U withmean zero and variance σ2
I Lemma
E[uTBu] = σ2tr(B)
Var[uTBu] = 2σ4∑
i 6=j B2ij +
(E[U4]− σ4
)∑i B
2ii
![Page 18: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/18.jpg)
Trace estimation: optimal sampling distribution
I Let B an n × n symmetric matrix
I Let u1, ..., un be n i.i.d. samples of a random variable U withmean zero and variance σ2
E[uTBu] = σ2tr(B)
Var[uTBu] = 2σ4∑
i 6=j B2ij +
(E[U4]− σ4
)∑i B
2ii
I minimum variance unbiased estimator
minp(U)
Var[uTBu]
subject to E[uTBu] = tr(B)
I Var(U2) = E[U4]− σ4 ≥ 0
I minimized when Var(U2) = 0
I U2 = 1 with probability one
![Page 19: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/19.jpg)
Trace estimation: optimal sampling distribution
I Let B an n × n symmetric matrix
I Let u1, ..., un be n i.i.d. samples of a random variable U withmean zero and variance σ2
E[uTBu] = σ2tr(B)
Var[uTBu] = 2σ4∑
i 6=j B2ij +
(E[U4]− σ4
)∑i B
2ii
I minimum variance unbiased estimator
minp(U)
Var[uTBu]
subject to E[uTBu] = tr(B)
I Var(U2) = E[U4]− σ4 ≥ 0
I minimized when Var(U2) = 0
I U2 = 1 with probability one
![Page 20: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/20.jpg)
Optimal trace estimation
I Let B be an n × n symmetric matrix with non-zero trace
Let U be the discrete random variable which takes values1,−1 each with probability 1
2 (Rademacher distribution)
Let u = [u1, ..., un]T be i.i.d. ∼ U
I uTBu is an unbiased estimator tr(B) and
Var[uTBu] = 2∑i 6=j
B2ij .
I U is the unique variable amongst zero mean random variablesfor which uTBu is a minimum variance, unbiased estimator oftr(B).
Hutchinson (1990)
![Page 21: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/21.jpg)
Application to Approximate Matrix Multiplication
I ‖AB − CR‖2F = tr((AB − CR)T (AB − CR)
)I can be estimated via
I uT (AB − CR)T (AB − CR)u = ‖(AB − CR)u‖22I only requires matrix-vector products
where u = [u1, ..., un]T is i.i.d. ±1 each with probability 12
I variance can be reduced by averaging independent trials
![Page 22: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/22.jpg)
Sampling/Sketching Matrix Formalism
I Define the sampling matrix
Sij =
{1 if the i-th column of A is chosen in the j-th trial
0 otherwise
I diagonal re-weighting matrix
Dtt =1
√mpit
I AB ≈ CR
C = ASD and R = DSTB
I let S = DST
CR = ASDDSTB = ASTSB
![Page 23: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/23.jpg)
Sampling/Sketching Matrix Formalism
I Define the sampling matrix
Sij =
{1 if the i-th column of A is chosen in the j-th trial
0 otherwise
I diagonal re-weighting matrix
Dtt =1
√mpit
I AB ≈ CR
C = ASD and R = DSTB
I let S = DST
CR = ASDDSTB = ASTSB
![Page 24: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/24.jpg)
Estimating the entry-wise error
I infinity norm error
I ε(S) , ‖ASTSB − AB‖∞ = maxij |(ASTSB)ij − (AB)ij |I 0.99-quantile of ε(S) is the tightest upper bound that holds
with probability at least 0.99
![Page 25: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/25.jpg)
Estimating the entry-wise error
I infinity norm error
I ε(S) , ‖ASTSB − AB‖∞ = maxij |(ASTSB)ij − (AB)ij |I 0.99-quantile of ε(S) is the tightest upper bound that holds
with probability at least 0.99
![Page 26: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/26.jpg)
Estimating the entry-wise error
I infinity norm error
I ε(S) , ‖ASTSB − AB‖∞ = maxij |(ASTSB)ij − (AB)ij |I 0.99-quantile of ε(S) is the tightest upper bound that holds
with probability at least 0.99
I Bootstrap procedure:
For b = 1, ...,B do
sample m numbers with replacement from {1, ...,m}form Sb by selecting the the respective rows of S
compute εb = ‖ASTb SbB − ASTSB‖∞
return 0.99-quantile of the values ε1, ..., εB
e.g., sort in increasing order and return b0.99Bc-th value
I imitates the random mechanism that originally generatedASTSB
A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication. Lopes et al.
![Page 27: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/27.jpg)
Extrapolating the error
I ε(S) , ‖ASTSB − AB‖∞I for sufficiently large m
I 0.99-quantile of ε(S) ≈ κ√m
where κ is an unknown number
I given initial sketch of size m0
we can extrapolate the error for m > m0 via the Bootstrapestimate as
√m0√mε(S)
![Page 28: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/28.jpg)
Extrapolation: Numerical example
I Protein dataset (n = 17766, d = 356)The black line is the 0.99-quantile as a function of m. Theblue star is the average bootstrap estimate at the initialsketch size m0 = 500, and the blue line represents the averageextrapolated estimate derived from the starting value m0.A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication. Lopes et al.
![Page 29: EE270 Large scale matrix computation, optimization and](https://reader033.vdocuments.site/reader033/viewer/2022061614/62a07b0e0b186953846758d5/html5/thumbnails/29.jpg)
Questions?