qr factorizations and svds for tall-and-skinny matrices in mapreduce architectures (siam cse)

QR Factorizations and SVDs for Tall-and-skinnyMatrices in MapReduce Architectures

Austin Benson

ICME, Stanford UniversityWork completed in part at UC-Berkeley

SIAM CSE 2013

February 27, 2013

Collaborators

James Demmel, UC-Berkeley David Gleich, Purdue

Paul Constantine, Stanford

Thanks!

Quick QR and SVD review

A Q R

VT

n

n

n

n

n

n

A U n

n

n

n

n

n

Σ

n

n

Figure: Q, U, and V are orthogonal matrices. R is upper triangular andΣ is diagonal with positive entries.

Tall-and-skinny QR

A Q

R

m

n

m

n n

n

Tall-and-skinny (TS): m >> n

TS-QR → TS-SVD

R is small, so computing its SVD is cheap.

Why Tall-and-skinny QR and SVD?

1. Regression with many samples

2. Principle Component Analysis (PCA)

3. Model Reduction

Pressure, Dilation, Jet Engine

Figure: Dynamic mode decomposition of a rectangular supersonicscreeching jet. Joe Nichols, Stanford University.

MapReduce overview

Two functions that operate on key value pairs:

(key , value)map−−→ (key , value)

(key , 〈value1, . . . , valuen〉)reduce−−−−→ (key , value)

A shuffle stage runs between map and reduce to sort the values bykey.

MapReduce overview

Scalability: many map tasks and many reduce tasks are used

https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png

https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png

MapReduce overview

The idea is data-local computations. The programmer implements:

I map(key, value)

I reduce(key, 〈 value1, . . ., valuen 〉)

The shuffle and data I/O is implemented by the MapReduceframework, e.g., Hadoop.

This is a very restrictive programming environment! We sacrificeprogram control for structure, scalability, fault tolerance, etc.

Matrix representation

We have matrices, so what are the key-value pairs? We can justhave unique row identifiers:

A =

1.0 0.02.4 3.70.8 4.29.0 9.0

→

(a3320354, [1.0, 0.0])(5da011e2, [2.4, 3.7])(94d80025, [0.8, 4.2])(90764917, [9.0, 9.0])

Each value in a key-value pair is a row of the matrix.

Matrix representation: an example

We can encode more information in the keys, e.g., (x, y, z)coordinates and model number:((47570,103.429767811242,0,-16.525510963787,iDV7), [0.00019924

-4.706066e-05 2.875293979e-05 2.456653e-05 -8.436627e-06 -1.508808e-05

3.731976e-06 -1.048795e-05 5.229153e-06 6.323812e-06])

Figure: Aircraft simulation data. Paul Constantine, Stanford University

Why MapReduce?

1. Fault tolerance

2. Structured data I/O

3. Cheaper clusters with more data storage

4. Easy to program

Generate lots of data on supercomputer

Post-process and analyze on MapReduce cluster

http://www.nersc.gov/assets/About-Us/hopper1.jpg

http://www.nersc.gov/assets/About-Us/hopper1.jpg

Fault tolerance

0 50 100 150 2001200

1250

1300

1350

1400

1450

1500

1550

1/(probability of task fault)

job

tim

e (

se

cs.)

Performance with Injected Faults

Running time with no faults

Running times with injected faults

≈ 25% performance penalty with 1 out 5 tasks failing

Why MapReduce?

Not convinced MapReduce is good for computational science?

MS218: Is MapReduce Good for Science and Simulation Data?,Thursday, 4:30-6:30pm.

Communication-avoiding TSQR

A =

A1

A2

A3

A4

︸︷︷︸8n×4n

=

Q1

Q2

Q3

Q4

︸︷︷︸

8n×4n

R1

R2

R3

R4

︸︷︷︸4n×n

=

=Q︷︸︸︷Q1

Q2

Q3

Q4

︸︷︷︸

8n×4n

Q̃︸︷︷︸4n×n

R︸︷︷︸n×n

(Demmel et al. 2008)

Communication-avoiding TSQR

A =

A1

A2

A3

A4

︸︷︷︸8n×4n

=

Q1

Q2

Q3

Q4

︸︷︷︸

8n×4n

R1

R2

R3

R4

︸︷︷︸4n×n

Ai = QiRi can be computed in parallel. If we only need R, thenwe can throw out the Q factors.

MapReduce TSQR

S(1)

A

A1

A2

A3

A3

R1 map

A2

emit R2 map

A3

emit R3 map

A4

emit R4 map

shuffle

S1

A2

reduce

S2 R2,2

reduce

R2,1 emit

emit

emit

shuffle

A2 S3 R2,3

reduce emit

Local TSQR

identity map

A2 S(2) R reduce emit

Local TSQR Local TSQR

Figure: S (1) is the matrix consisting of the rows of all of the Ri factors.Similarly, S (2) consists of all of the rows of the R2,j factors.

MapReduce TSQR: the implementation

1 import random, numpy, hadoopy2 class SerialTSQR:3 def i n i t ( self , blocksize , isreducer ) :4 se l f . bsize , se l f . data =blocksize , [ ]5 se l f . c a l l = se l f . reducer i f isreducer else se l f .mapper67 def compress( se l f ) :8 R = numpy. l ina lg . qr(numpy. array( se l f . data) , ’ r ’ )9 se l f . data = [ ]

10 for row in R: se l f . data .append([ f loat (v) for v in row])1112 def col lect ( self , key , value ) :13 se l f . data .append(value)14 i f len ( se l f . data) > se l f . bsize ∗ len ( se l f . data [0 ] ) :15 se l f . compress()1617 def close ( se l f ) :18 se l f . compress()19 for row in se l f . data : yield random. randint(0,2000000000), row2021 def mapper( self , key , value ) :22 se l f . co l lect (key , value)2324 def reducer( self , key , values ) :25 for value in values : se l f .mapper(key , value)2627 i f name ==’ main ’ :28 mapper = SerialTSQR(blocksize=3, isreducer=False)29 reducer = SerialTSQR(blocksize=3, isreducer=True)30 hadoopy. run(mapper, reducer)

MapReduce TSQR: Getting Q

We have an efficient way of getting R in QR (and Σ in the SVD).What if I want Q (or my left singular vectors)?

A = QR → Q = AR−1

This is a numerically unstable computation, and Q can be far fromorthogonal. (This may be OK in some cases).

Indirect TSQR

We call this method Indirect TSQR.

Indirect TSQR: Iterative Refinement

We can repeat the TSQR computation on Q to get a moreorthogonal matrix. This is called iterative refinement.

A

A1 R-1 map

R

Q1

A2 R-1 map

Q2

A3 R-1 map

Q3

A4 R-1 map

Q4

emit

emit

emit

emit

distribute

TSQ

R

Q

Q1 R1

-1 map

R1

Q1

Q2 R1

-1 map

Q2

Q3 R1

-1 map

Q3

Q4 R1

-1 map

Q4

emit

emit

emit

emit

distribute

Local MatMul Local MatMul

Iterative Refinement step

Direct TSQR

Why is it hard to reconstruct Q?

I Only maintained “state” is the data written to disk at the endof each map/reduce iteration

I No controlled communication between processors → can onlycontrol data movement via the shuffle

I File names and keys are the only methods of labeling datacomponents

Direct TSQR: Steps 1 and 2

A

A1 R1 map

emit Q11 emit

A2 R2 map

emit Q21

emit

A3 R3 map

emit Q31

emit

A4 R4 map

emit Q41

emit

First step

R1

R2

R3

R4

Q12

Q22

Q32

Q42

R

reduce

emit

emit

emit

emit

emit

Second step

shuffle

Direct TSQR: Step 3

Cholesky QR

We also have a Cholesky QR:

ATA = (QR)T (QR) = RTQTQR = RTR

Computing ATA translates nicely to MapReduce. This method hasworse stability issues but requires fewer flops.

Performance Data

4Bx4 (135 GB) 2.5Bx10 (193 GB) 600Mx25 (112 GB) 500Mx50 (184 GB) 150Mx100 (110 GB)0

1000

2000

3000

4000

5000

6000

7000

8000

matrix size

Tim

e t

o c

om

ple

tio

n (

se

co

nd

s)

Performance of various MapReduce algorithms for computing QR

Cholesky

Indirect TSQR

Cholesky + IR

Indirect TSQR + IR

Direct TSQR

End

Questions??

I code: https://github.com/arbenson/mrtsqr

I arXiv paper: http://arxiv.org/abs/1301.1071

I my email: [email protected]

https://github.com/arbenson/mrtsqr

http://arxiv.org/abs/1301.1071

Extra slides

Cholesky QR

A

A1 A1

TA1 map emit

A2 A2

TA2 map emit

A3 A3

TA3 map emit

A4 A4

TA4 map emit

shuffle

(A1TA1)1

(A2TA2)1

(A3TA3)1

(A4TA4)1

(ATA)1

(A1TA1)2

(A2TA2)2

(A3TA3)2

(A4TA4)2

(ATA)2

(A1TA1)3

(A2TA2)3

(A3TA3)3

(A4TA4)3

(ATA)3

(A1TA1)4

(A2TA2)4

(A3TA3)4

(A4TA4)4

(ATA)4

emit

emit

emit

emit

ATA RTR emit

reduce

reduce

reduce

reduce

Local ATA Row Sum Cholesky

Stability

100

105

1010

1015

1020

10−15

10−10

10−5

100

105

cond(A)

||Q

TQ

− I

||2

Numerical stability: 10000x10 matrices

Dir. TSQR

Indir. TSQR

Indir. TSQR + I.R.

Chol.

Chol. + I.R.

qr factorizations and svds for tall-and-skinny matrices in mapreduce architectures (siam cse)

Data & Analytics