qr factorizations and svds for tall-and-skinny matrices in mapreduce architectures (siam cse)
TRANSCRIPT
QR Factorizations and SVDs for Tall-and-skinnyMatrices in MapReduce Architectures
Austin Benson
ICME, Stanford UniversityWork completed in part at UC-Berkeley
SIAM CSE 2013
February 27, 2013
Collaborators
James Demmel, UC-Berkeley David Gleich, Purdue
Paul Constantine, Stanford
Thanks!
Quick QR and SVD review
A Q R
VT
n
n
n
n
n
n
A U n
n
n
n
n
n
Σ
n
n
Figure: Q, U, and V are orthogonal matrices. R is upper triangular andΣ is diagonal with positive entries.
Tall-and-skinny QR
A Q
R
m
n
m
n n
n
Tall-and-skinny (TS): m >> n
TS-QR → TS-SVD
R is small, so computing its SVD is cheap.
Why Tall-and-skinny QR and SVD?
1. Regression with many samples
2. Principle Component Analysis (PCA)
3. Model Reduction
Pressure, Dilation, Jet Engine
Figure: Dynamic mode decomposition of a rectangular supersonicscreeching jet. Joe Nichols, Stanford University.
MapReduce overview
Two functions that operate on key value pairs:
(key , value)map−−→ (key , value)
(key , 〈value1, . . . , valuen〉)reduce−−−−→ (key , value)
A shuffle stage runs between map and reduce to sort the values bykey.
MapReduce overview
Scalability: many map tasks and many reduce tasks are used
https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png
MapReduce overview
The idea is data-local computations. The programmer implements:
I map(key, value)
I reduce(key, 〈 value1, . . ., valuen 〉)
The shuffle and data I/O is implemented by the MapReduceframework, e.g., Hadoop.
This is a very restrictive programming environment! We sacrificeprogram control for structure, scalability, fault tolerance, etc.
Matrix representation
We have matrices, so what are the key-value pairs? We can justhave unique row identifiers:
A =
1.0 0.02.4 3.70.8 4.29.0 9.0
→
(a3320354, [1.0, 0.0])(5da011e2, [2.4, 3.7])(94d80025, [0.8, 4.2])(90764917, [9.0, 9.0])
Each value in a key-value pair is a row of the matrix.
Matrix representation: an example
We can encode more information in the keys, e.g., (x, y, z)coordinates and model number:((47570,103.429767811242,0,-16.525510963787,iDV7), [0.00019924
-4.706066e-05 2.875293979e-05 2.456653e-05 -8.436627e-06 -1.508808e-05
3.731976e-06 -1.048795e-05 5.229153e-06 6.323812e-06])
Figure: Aircraft simulation data. Paul Constantine, Stanford University
Why MapReduce?
1. Fault tolerance
2. Structured data I/O
3. Cheaper clusters with more data storage
4. Easy to program
Generate lots of data on supercomputer
Post-process and analyze on MapReduce cluster
http://www.nersc.gov/assets/About-Us/hopper1.jpg
Fault tolerance
0 50 100 150 2001200
1250
1300
1350
1400
1450
1500
1550
1/(probability of task fault)
job
tim
e (
se
cs.)
Performance with Injected Faults
Running time with no faults
Running times with injected faults
≈ 25% performance penalty with 1 out 5 tasks failing
Why MapReduce?
Not convinced MapReduce is good for computational science?
MS218: Is MapReduce Good for Science and Simulation Data?,Thursday, 4:30-6:30pm.
Communication-avoiding TSQR
A =
A1
A2
A3
A4
︸ ︷︷ ︸8n×4n
=
Q1
Q2
Q3
Q4
︸ ︷︷ ︸
8n×4n
R1
R2
R3
R4
︸ ︷︷ ︸4n×n
=
=Q︷ ︸︸ ︷Q1
Q2
Q3
Q4
︸ ︷︷ ︸
8n×4n
Q̃︸︷︷︸4n×n
R︸︷︷︸n×n
(Demmel et al. 2008)
Communication-avoiding TSQR
A =
A1
A2
A3
A4
︸ ︷︷ ︸8n×4n
=
Q1
Q2
Q3
Q4
︸ ︷︷ ︸
8n×4n
R1
R2
R3
R4
︸ ︷︷ ︸4n×n
Ai = QiRi can be computed in parallel. If we only need R, thenwe can throw out the Q factors.
MapReduce TSQR
S(1)
A
A1
A2
A3
A3
R1 map
A2
emit R2 map
A3
emit R3 map
A4
emit R4 map
shuffle
S1
A2
reduce
S2 R2,2
reduce
R2,1 emit
emit
emit
shuffle
A2 S3 R2,3
reduce emit
Local TSQR
identity map
A2 S(2) R reduce emit
Local TSQR Local TSQR
Figure: S (1) is the matrix consisting of the rows of all of the Ri factors.Similarly, S (2) consists of all of the rows of the R2,j factors.
MapReduce TSQR: the implementation
1 import random, numpy, hadoopy2 class SerialTSQR:3 def i n i t ( self , blocksize , isreducer ) :4 se l f . bsize , se l f . data =blocksize , [ ]5 se l f . c a l l = se l f . reducer i f isreducer else se l f .mapper67 def compress( se l f ) :8 R = numpy. l ina lg . qr(numpy. array( se l f . data) , ’ r ’ )9 se l f . data = [ ]
10 for row in R: se l f . data .append([ f loat (v) for v in row])1112 def col lect ( self , key , value ) :13 se l f . data .append(value)14 i f len ( se l f . data) > se l f . bsize ∗ len ( se l f . data [0 ] ) :15 se l f . compress()1617 def close ( se l f ) :18 se l f . compress()19 for row in se l f . data : yield random. randint(0,2000000000), row2021 def mapper( self , key , value ) :22 se l f . co l lect (key , value)2324 def reducer( self , key , values ) :25 for value in values : se l f .mapper(key , value)2627 i f name ==’ main ’ :28 mapper = SerialTSQR(blocksize=3, isreducer=False)29 reducer = SerialTSQR(blocksize=3, isreducer=True)30 hadoopy. run(mapper, reducer)
MapReduce TSQR: Getting Q
We have an efficient way of getting R in QR (and Σ in the SVD).What if I want Q (or my left singular vectors)?
A = QR → Q = AR−1
This is a numerically unstable computation, and Q can be far fromorthogonal. (This may be OK in some cases).
Indirect TSQR
We call this method Indirect TSQR.
Indirect TSQR: Iterative Refinement
We can repeat the TSQR computation on Q to get a moreorthogonal matrix. This is called iterative refinement.
A
A1 R-1 map
R
Q1
A2 R-1 map
Q2
A3 R-1 map
Q3
A4 R-1 map
Q4
emit
emit
emit
emit
distribute
TSQ
R
Q
Q1 R1
-1 map
R1
Q1
Q2 R1
-1 map
Q2
Q3 R1
-1 map
Q3
Q4 R1
-1 map
Q4
emit
emit
emit
emit
distribute
Local MatMul Local MatMul
Iterative Refinement step
Direct TSQR
Why is it hard to reconstruct Q?
I Only maintained “state” is the data written to disk at the endof each map/reduce iteration
I No controlled communication between processors → can onlycontrol data movement via the shuffle
I File names and keys are the only methods of labeling datacomponents
Direct TSQR: Steps 1 and 2
A
A1 R1 map
emit Q11 emit
A2 R2 map
emit Q21
emit
A3 R3 map
emit Q31
emit
A4 R4 map
emit Q41
emit
First step
R1
R2
R3
R4
Q12
Q22
Q32
Q42
R
reduce
emit
emit
emit
emit
emit
Second step
shuffle
Direct TSQR: Step 3
Cholesky QR
We also have a Cholesky QR:
ATA = (QR)T (QR) = RTQTQR = RTR
Computing ATA translates nicely to MapReduce. This method hasworse stability issues but requires fewer flops.
Performance Data
4Bx4 (135 GB) 2.5Bx10 (193 GB) 600Mx25 (112 GB) 500Mx50 (184 GB) 150Mx100 (110 GB)0
1000
2000
3000
4000
5000
6000
7000
8000
matrix size
Tim
e t
o c
om
ple
tio
n (
se
co
nd
s)
Performance of various MapReduce algorithms for computing QR
Cholesky
Indirect TSQR
Cholesky + IR
Indirect TSQR + IR
Direct TSQR
End
Questions??
I code: https://github.com/arbenson/mrtsqr
I arXiv paper: http://arxiv.org/abs/1301.1071
I my email: [email protected]
Extra slides
Cholesky QR
A
A1 A1
TA1 map emit
A2 A2
TA2 map emit
A3 A3
TA3 map emit
A4 A4
TA4 map emit
shuffle
(A1TA1)1
(A2TA2)1
(A3TA3)1
(A4TA4)1
(ATA)1
(A1TA1)2
(A2TA2)2
(A3TA3)2
(A4TA4)2
(ATA)2
(A1TA1)3
(A2TA2)3
(A3TA3)3
(A4TA4)3
(ATA)3
(A1TA1)4
(A2TA2)4
(A3TA3)4
(A4TA4)4
(ATA)4
emit
emit
emit
emit
ATA RTR emit
reduce
reduce
reduce
reduce
Local ATA Row Sum Cholesky
Stability
100
105
1010
1015
1020
10−15
10−10
10−5
100
105
cond(A)
||Q
TQ
− I
||2
Numerical stability: 10000x10 matrices
Dir. TSQR
Indir. TSQR
Indir. TSQR + I.R.
Chol.
Chol. + I.R.