low rank matrix approximation - graduate center, cuny...recent results in lra and presents practical...

31
Low Rank Matrix Approximation John T. Svadlenka Ph.D. Program in Computer Science The Graduate Center of the City University of New York New York, NY 10036 USA [email protected] Abstract Low Rank Approximation is a fundamental computation across a broad range of ap- plications where matrix dimension reduction is required. This survey paper on the Low Rank Approximation (LRA) of matrices provides a broad overview of recent progress in the field with perspectives from both the Theoretical Computer Science (TCS) and Numerical Linear Algebra (NLA) points of view. While traditional application areas of LRA have come from scientific, engineering, and statistical disciplines, a plethora of recent activity has been seen in image processing, machine learning, and climate informatics to name just a few of the emerging modern technologies. All of these dis- ciplines and technologies are increasingly challenged by the scale of modern massive data sets (MMDS) collected from a variety of sources including sensor measurement, computational models, and Internet. At the same time, applied mathematicians and computer scientists have been seek- ing alternatives to the standard numerical linear algebra (NLA) algorithms that are fundamentally incapable to handle the sheer size of these MMDS’s. Against the back- drop of the disparity between increasing computer processing power and disk storage capabilities on the one hand versus memory bandwidth limitations on the other, a fur- ther research goal is to find flexible new LRA approaches that leverage the strengths of modern hardware architectures while limiting exposure to data communication bottle- necks. Central to new approaches with regard to MMDS’s is the approximation of a matrix by one of much smaller dimension facilitated by randomization techniques. The term ”low rank” in the context of LRA refers to the inherent ”dimension” of the matrix which may possibly be much smaller than both the actual number of rows and columns in the matrix. Thus, a matrix may be approximated, if not exactly represented, by another matrix containing significantly less rows and columns while still preserving the salient characteristics of the original matrix. Recent results have shown that it is possible to attain this dimension reduction using alternative approximation and probabilistic techniques where randomization plays a key role. This paper surveys classical and recent results in LRA and presents practical algorithms representative of the recent research progress. A broad understanding of potential future research directions in LRA should also be evident to readers from theoretical, algorithmic, and computational backgrounds. Keywords: Low rank approximation, Modern Massive Data Sets, random sketches, random projections, dimension reduction, QR factorization, Singular Value Decomposition, CUR, parallelism 1

Upload: others

Post on 20-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Low Rank Matrix Approximation

    John T. SvadlenkaPh.D. Program in Computer Science

    The Graduate Center of the City University of New YorkNew York, NY 10036 USA

    [email protected]

    Abstract

    Low Rank Approximation is a fundamental computation across a broad range of ap-plications where matrix dimension reduction is required. This survey paper on the LowRank Approximation (LRA) of matrices provides a broad overview of recent progressin the field with perspectives from both the Theoretical Computer Science (TCS) andNumerical Linear Algebra (NLA) points of view. While traditional application areasof LRA have come from scientific, engineering, and statistical disciplines, a plethoraof recent activity has been seen in image processing, machine learning, and climateinformatics to name just a few of the emerging modern technologies. All of these dis-ciplines and technologies are increasingly challenged by the scale of modern massivedata sets (MMDS) collected from a variety of sources including sensor measurement,computational models, and Internet.

    At the same time, applied mathematicians and computer scientists have been seek-ing alternatives to the standard numerical linear algebra (NLA) algorithms that arefundamentally incapable to handle the sheer size of these MMDS’s. Against the back-drop of the disparity between increasing computer processing power and disk storagecapabilities on the one hand versus memory bandwidth limitations on the other, a fur-ther research goal is to find flexible new LRA approaches that leverage the strengths ofmodern hardware architectures while limiting exposure to data communication bottle-necks.

    Central to new approaches with regard to MMDS’s is the approximation of a matrixby one of much smaller dimension facilitated by randomization techniques. The term”low rank” in the context of LRA refers to the inherent ”dimension” of the matrix whichmay possibly be much smaller than both the actual number of rows and columns in thematrix. Thus, a matrix may be approximated, if not exactly represented, by anothermatrix containing significantly less rows and columns while still preserving the salientcharacteristics of the original matrix. Recent results have shown that it is possibleto attain this dimension reduction using alternative approximation and probabilistictechniques where randomization plays a key role. This paper surveys classical andrecent results in LRA and presents practical algorithms representative of the recentresearch progress. A broad understanding of potential future research directions inLRA should also be evident to readers from theoretical, algorithmic, and computationalbackgrounds.

    Keywords: Low rank approximation, Modern Massive Data Sets, random sketches,random projections, dimension reduction, QR factorization, Singular Value Decomposition,CUR, parallelism

    1

  • 1 Introduction

    Numerical Linear Algebra (NLA) provides the theoretical underpinnings and formal frame-work for analyzing matrices and the various operations on matrices. The results in NLAhave typically been shared across a diverse spectrum of areas ranging from physical and lifesciences, engineering, to data analysis. More recently, new computational disciplines andapplication areas have begun to test the limits of traditional NLA algorithms. Althoughexisting algorithms have been rigorously developed, analyzed and refined over many years,certain inherent limitations of these algorithms have been increasingly exposed in both newand existing areas of application.

    A confluence of diverse factors as well as increased interest in NLA by researchers inTheoretical Computer Science have contributed to recent developments in NLA. Perhaps,more importantly, new perspectives on approximation and randomization as applied to NLAhave generated new opportunities to further the progress in the field. A prominent factorhas been the sheer size of today’s MMDS’s such as those encountered in data mining [1]that challenge conventional NLA algorithms. It can be shown that if one is willing to relaxthe requirement of high precision results for faster algorithms, then alternative algorithmicapproaches with improved arithmetical computational complexity are available [2].

    These alternative approaches create an approximation of the input matrix that is muchsmaller in size then the original. The approximate matrix is commonly known as a sketchdue to the LRA randomization strategies employed to obtain the approximation. Thisrandom sketch of much smaller size replaces the original matrix in the application of interestin order to realize the computational savings. The question that naturally arises from thismethodology is: how expensive is the LRA algorithm? While traditional LRA algorithmscan yield a rank k approximation of an m × n matrix in O(mnk) time, a randomizedalgorithm can reduce the asymptotic complexity to O(mn log k) [1], [2]. Another aspect ofthe approximation trade-off that justifies the approach lies within the context of data setswith inexact content [1]. Moreover, it can be argued that all numerical computation is itselfapproximate up to machine precision considerations, so an introduction of similarly inexactalgorithms (up to the user-specified tolerance) is not an unreasonable proposition.

    If one also recognizes that other deficiencies in classical algorithms besides the computa-tional complexity may be mitigated with alternative methods, then opportunities to realizefurther computational gains are possible. As an example, arithmetic computational com-plexity does not consider the effects of data movement to and from a computer’s randomaccess memory (RAM). Floating point operation speed continues to exceed that of memoryof bandwidth [31] so that data transfer is a major performance bottleneck with regard toprocessing MMDS’s. The problem is clearly magnified for out-of-core data sets. Algorithmsthat can significantly reduce the number of passes over the data set (pass efficiency) willreduce the clock time performance of an algorithm.

    Similarly, good data and temporal locality contributes to the likelihood of a more favor-able performance profile. Data locality refers to the characteristic of an algorithm wherebya segment of data in memory is likely to be processed next if it is located close to that datacurrently being processed. Likewise, temporal locality is the algorithmic property such thata set of operations on adjacent sets of the data occur in close time-wise succession. Theimplication is that there is a lower likelihood of having to swap any components of theprogram data set among the various levels of the memory hierarchy multiple times.

    As a concrete example, consider the QR factorization which is a standard algorithmfor LRA. Matrix-matrix multiplication itself may be executed faster than a QR algorithm

    2

  • [2] due to block operations. Due to better use of memory hierarchies matrix-matrix ker-nels can perform better in general than those of matrix-vector [33, 29] as encountered inQR processing. Therefore, an LRA algorithmic alternative that is based on the formerkernels rather than the latter may be more favorably disposed to computational gains. Ad-ditionally, matrix-matrix multiplication is embarrassingly parallel [1] and it more generallymotivates the search for new LRA strategies that can more fully leverage the parallelismwidely supported by modern hardware architectures. It should be noted from the abovediscussion that scalability improvements in NLA can be addressed in a multifaceted man-ner, that is, from the theoretical, algorithmic, and computational perspectives. This surveypaper examines such hybrid algorithms which combine existing elements of conventionaldeterministic NLA algorithms with randomization and approximation schemes to offer newalgorithms that reduce asymptotic complexity. The result is the capacity to process muchlarger data sets than possible with conventional algorithms.

    The outline of this survey paper after the current introductory section is as follows. Wereview classical results from the literature in Section 2 to provide some broad backgroundon standard factorizations and algorithms for LRA. Subsequently, the more recent researchconcerning approximation and probabilistic results is given in Section 3. The results ofthese two sections form the theoretical basis for the presentation of the Randomized HybridLRA algorithms of Section 4 where we also discuss the strategies and benefits associatedwith them. Open problems are discussed in Section 5 and concluding remarks follow inSection 6.

    2 Classical Results

    2.1 Rank k Factorization

    In the previous section we mentioned that LRA is concerned with the study of matrixsketches that significantly reduce the number of rows and columns of a matrix. LRAyields two significant benefits: a savings in memory storage and a decrease in the numberof arithmetic computations. To see how this is possible, consider that the storage spacerequirements of a dense m × n matrix A is mn memory cells. An LRA of A of rank-k fork � min(m,n) is defined by the factorization:

    A ≈ B · C (2.1)

    such that B ∈ Rm×k and C ∈ Rk×n. Therefore, the memory storage cost of the LRAof A is O((m + n)k) and we have that ((m + n)k) � m × n. In terms of a matrix vectorproduct involving the original matrix A of the form:

    x = Av (2.2)

    where x and v are m and n dimensional vectors, respectively, this operation requires mnmultiplications and m(n− 1) additions. Therefore, the number of arithmetic operations isroughly 2mn. A matrix vector product formulation with the LRA of A is:

    x ≈ BCv (2.3)

    3

  • We can see that the product y = Cv requires kn+k(n−1) operations while the productx = By uses mk + m(k − 1) operations. The overall arithmetic operation complexity isO(k(m + n)) and offers a significant savings when k � min(m,n). If the matrix vectorproduct can be left in the rank factorization form:

    x ≈ By (2.4)

    then the number of operations is further reduced.The metric by which we prefer to measure the accuracy of a rank-k LRA, Âk, of A is

    the (1 + �) relative-error bound for small positive � of the following form for both Spectraland Frobenius norms:

    ‖A− Âk‖ ≤ (1 + �)‖A−Ak‖ (2.5)

    The relative-error norm bound is a particular example of a multiplicative error bound.Ak is the theoretical best rank-k approximation of A that is given by the Singular ValueDecomposition which we discuss next.

    2.2 Singular Value Decomposition (SVD)

    Though the origins of the Singular Value Decomposition (SVD) can be traced back to thelate 1800’s, it was not until the 20th century that it eventually evolved into its current andmost general form. The SVD exists for any m×n matrix A regardless of whether its entriesare real or complex.

    Let A be an m × n matrix with r = rank(A) whose elements may be complex. Thenthere exists two unitary matrices U and V such that

    A = UΣV ∗ (2.6)

    where U and V are m×m and n× n, respectively. Σ is an m× n diagonal matrix withnonnegative elements σi such that σ1 ≥ σ2 ≥ · · · ≥ σr > 0 and σj = 0 for j > r.

    We may also write a truncated form of the SVD in which U consists of the r left-mostcolumns of U , V ∗ is similarly r×n, and Σ is diag(σ1, . . . , σr). We write this truncated formas follows:

    A = UrΣrV∗r (2.7)

    In this truncated form the columns of U and V form orthogonal bases for A and A∗,respectively. The σ1, . . . , σr are commonly referred to as the singular values of A. Moreimportantly, these singular values indicate the lower bounds on the error of any rank-kapproximation of A in the Spectral and Frobenius norms. We have for Ak = UkΣkV

    ∗k that:

    ‖A−Ak‖2 = σk+1 (2.8)

    4

  • ‖A−Ak‖F =

    √√√√√min(m,n)∑j=k+1

    σ2j (2.9)

    The SVD plays a dual role with regards to matrix approximation. Firstly, algorithmsexist to compute the decomposition with asymptotic cost of O((m+ n)mn) [6] from whichwe may obtain a rank-k approximation for k = 1, . . . , r − 1. For the special case of k = 0we have that ‖A‖ = σ1. Secondly, the rank-k approximation from the SVD is utilized asthe optimal rank-k approximation for evaluating and comparing decompositions and theapproximation algorithms which produce them. It is indeed the high cost of producing theSVD factorization for MMDS’s that has motivated the search for new LRA techniques. Wealso note that from an SVD representation of A we may write it’s low rank format by firstforming the product (ΣkV

    ∗k ):

    Ak = Uk · (ΣkV ∗k ) (2.10)

    Alternatively, we may also write:

    Ak = (UkΣk) · V ∗k (2.11)

    Starting with the QR factorization, we present other important classical decompositionsin subsequent sections and algorithms that generate them. The SVD decomposition mayalso be obtained from this QR factorization with an additional post-processing step to theQR algorithm.

    2.3 QR Decomposition

    The QR decomposition is a factorization of a matrix A into the product of two matrices,a unitary matrix Q providing an orthogonal basis for A and an upper triangular matrixR. The significance of this decomposition is evident from its usage as a preliminary step indetermining a rank-k SVD decomposition of an m × n matrix A. The QR decompositionitself can be obtained faster than the SVD in O(mnmin (m,n)) time.

    More formally, let A be an m× n matrix with m ≥ n whose elements may be complex.Then there exists an m× n matrix Q and an n× n matrix R such that

    A = QR (2.12)

    where the columns of Q are orthonormal and R is upper triangular. Column i of A isthe linear combination of columns of Q with the coefficients given by column i of R. Inparticular, by the upper triangular form of R, it is clear that column i of A is determinedfrom the first i columns of Q. The existence of the QR factorization can be proven in avariety of ways. We present here a proof using the Gram-Schmidt procedure:

    Suppose (a1, a2, . . . , an) is a linearly independent list of vectors in an inner productspace V . Then there is an orthonormal list of vectors (q1, q2, . . . , qn) such that:

    span(a1, a2, . . . , an) = span(q1, q2, . . . , qn). (2.13)

    5

  • Proof: Let proj(r, s) := r denote the projection of r on to s and apply the followingsteps to obtain the orthonormal list of vectors:

    w1 := a1w2 := a2 − proj(a2, w1)...wn := an − proj(an, w1)− proj(an, w2)− · · · − proj(an, wn−1)q1 := w1/‖w1‖, q2 := w2/‖w2‖, . . . , qn := wn/‖wn‖

    Re-arranging equations for w1, w2, . . . , wn to be equations with a1, a2, . . . , an on theleft-hand side and replacing wi with qi gives A = QR where

    A = [a1, a2, . . . , an]

    Q = [q1, q2, . . . , qn]

    R =

    < q1, a1 > < q1, a2 > < q1, a3 > . . . < q1, an >

    0 < q2, a2 > . . . < q2, an−1 > < q2, an >0 0 < q3, a3 > . . . < q3, an >...

    ......

    ......

    0 0 0 0 < qn, an >

    A problem with practical application of the Gram-Schmidt procedure occurs in the case

    that rank(A) < n. It is necessary in this case to determine a permutation of the columnsof A such that the first rank(A) columns of Q are orthonormal. Let P denote the n × nmatrix representing this column permutation. We have the QRP formulation:

    A = QRP (2.14)

    Enhancements to the basic QR algorithm to obtain both the QR factorization andthe permutation matrix are known as QR with column-pivoting. A further improvementdiscerns the rank of matrix A though at a higher cost. The goal is to find the permutationmatrix P for the construction of R such that:

    R =

    (R11 R120 R22

    )(2.15)

    Then If ‖R22‖ is small and R11 is r × r, it can be shown that σr+1(A) ≤ ‖R22‖ impliesthat rank(A) = n − r. These constructions are known as rank revealing QR (RRQR)factorizations and are the most commonly used forms of the QR algorithm in use today. Adeterministic RRQR algorithm was given by Gu and Eisenstat in their seminal paper [5]that finds a k-column subset C of the input matrix A such that the projection of A on toC has error relative to the best rank-k approximation of A as follows:

    ‖A− CC†A‖2 ≤√

    1 + k(n− k)‖A−Ak‖2 (2.16)

    The above result matches the classical existence result of Ruston[7]. The reader isreferred to [5] for more information on RRQR factorizations.

    6

  • 2.4 Skeleton (CUR) Decomposition

    A different approach to LRA is one that determines a subset of actual rows and columns ofthe input matrix as factors in an approximation instead of an orthogonal matrix. The CURdecomposition consists of a matrix C of a subset of columns of the original matrix A and amatrix R containing a subset of rows of A. U is a suitably chosen matrix to complete thedecomposition. The problem described in this section is the submatrix selection problem.

    Let A be an m × n matrix of real elements with rank(A) = r. Then there exists anonsingular r × r matrix  in A. Moreover, let I be and J be the sets of row and columnindices of A, respectively, in Â, such that C = A(1..m, J) and R = A(I, 1..n). For U = Â−1,we have that:

    A = CUR (2.17)

    Therefore, it is clear that a subset of both r columns and r rows captures A′s columnand row spaces, respectively. This skeleton is in contrast to the SVD’s left and right singularvectors which are unitary. While it is NP-hard to find optimal row and column subsets,an advantage of this representation is that its content is conducive to being understood inapplication terms and domain knowledge. Moreover, the CUR decomposition may preservestructural properties of the original matrix that would otherwise be lost in the admittedlysomewhat abstract decompositional reduction to unitary matrices. On the other hand, itis not guaranteed that  is well-conditioned.

    The CUR decomposition requires O((m+n+ r)r) memory space and may be simplifiedto a rank factorization format by writing GH = CUR where G = CU and H = R,or G = C and H = UR. We shall see in a later section that the work of researchersfrom both Numerical Linear Algebra (NLA) and Theoretical Computer Science (TCS) haveprovided different algorithmic approaches to LRA employing the CUR decomposition. NLAalgorithms have focused on the particular choice of an  that maximizes the absolute valueof its determinant while TCS favors column and row sampling strategies based on samplingprobabilities derived from Euclidean norms of either the matrix’s singular vectors or of it’sactual rows and columns.

    It is the construction of the sampling probabilities from singular vectors, commonlyknown as leverage scores, which is responsible for the computational complexity bound inthe TCS approach. In the NLA algorithms the absolute value of the determinant is a proxyfor quantifying the orthogonality of the columns in a matrix. This topic will be discussedin more detail in a later section.

    A variation of the above CUR strategy is to obtain a rank-k matrix C from columnsof the original matrix A and project A on to C. This approximate decomposition is givenby A ≈ CC†A and is known as a CX decomposition where X := C†A. The key idea is toproject the matrix A onto a rank-k subspace of A as given by C. Thus, a rank-k factorizationmay be given by GH = CC†A where G = C and H = C†A. In the next section we presenta related form of decomposition known as the Interpolative Decomposition.

    2.5 Interpolative Decomposition (ID)

    The intuition motivating the ID is that if an m × n matrix A has rank k, than it isreasonable to expect to be able to use some representative subset of k columns of A (let’scall this column subset B) to represent all n columns of A. In effect, the columns of B serve

    7

  • as a basis of A. Consequently, we only need to construct a k × n matrix P to express eachcolumn i of A for i = 1 . . . n as a linear combination of the columns of B. This intuitionleads us to the Interpolative Decomposition Lemma:

    Suppose A is an m × n matrix of rank k whose elements may be complex. Then thereexists an m× k matrix B consisting of a subset of columns of A and a k×n matrix P suchthat:

    1. A = B · P

    2. The Ik matrix appears in some column subset of P

    3. |pij | ≤ 1 for all i and j

    To find the subset of k columns from a choice of n columns is NP-hard and algorithmsbased on the above conditions can be expensive. But the computation of the ID is madeeasier [1] by relaxing the requirement of |pij | ≤ 1 to |pij | ≤ 2. The B factor of the ID,as with the C and R matrices of the CUR decomposition, facilitates data analysis andit inherits properties of the matrix A. We may ask if the ID can be extended into theform of a two-sided ID decomposition where the rows of B are a basis for the rowspace ofA? The existence of such a decomposition is given in [8] with the Two-sided InterpolativeDecomposition Theorem:Let A be an m× n matrix and k ≤ min(m,n). Then there exists:

    A = PL

    (IkS

    )AS(Ik|T

    )P ∗R +X (2.18)

    such that PL and PR are permutation matrices. S ∈ C(m−k)×k and T ∈ Ck×(n−k) and Xsatisfy:

    ‖S‖F ≤√k(m− k) (2.19)

    ‖T‖F ≤√k(n− k) (2.20)

    ‖X‖2 ≤ σk+1(A)√

    1 + k(min(m,n)− k) (2.21)

    In the above formulation AS is a k × k submatrix of A. Though we will not investigatethis decomposition any further in this survey, we mention it here to point out that this CUR-like decomposition includes a residual X term that is bounded by the (k+ 1) singular valueof A. To some extent we may infer an increased difficulty of the submatrix selection problemversus that of just column subset selection in the one-sided Interpolative Decomposition.

    2.6 QR Conventional Algorithm and Complexity Cost

    After having described some of the more prominent decompositions in the prior sections, weturn our attention to the conventional algorithms commonly utilized to produce them. Wewill limit our presentation to the QR and SVD factorizations for a couple of reasons. First,these are two of the most ubiquitous factorizations in practice and much effort has beeninvested over the years to enhance the performance and functionality of their algorithms.

    8

  • Moreover, an analysis of the techniques used in their production are sufficient to conveythe limitations and inflexibilities that motivate the search for more robust algorithmic ap-proaches.

    Let’s first present a method for generating the Q and R factors of A which addressescomputational issues arising from the Gram-Schmidt procedure. As a QR Gram Schmidtalternative, consider an orthogonal matrix product Q1Q2 . . . Qn that transforms A to uppertriangular form R:

    (Qn . . . Q2Q1)A = R

    A multiplication of both sides by (Qn . . . Q2Q1)−1 yields:

    (Qn . . . Q2Q1)−1(Qn . . . Q2Q1)A = (Qn . . . Q2Q1)

    −1R

    A = Q1Q2 . . . QnR

    Note that a product of orthogonal matrices is also orthogonal so allowing for column-pivoting we have that:

    AΠ = Q1Q2 . . . QnR (2.22)

    A Householder reflection matrix is used for each Qi, i = 1, 2, . . . , n to transform Ato R column-wise. More formally, the Householder matrix vector multiplication Hx =(I − 2vvT )x reflects a vector x across the hyperplane normal to v. The unit vector v isconstructed for each Qi Householder matrix so that entries of column i below the diagonalof A vanish.

    1. x = (aii, aii+1, . . . , ain) for column i

    2. v depends upon x and the standard basis vector ei

    3. The matrix product Qi ·A is applied

    4. The above items are repeated for each column of A

    The impacts to the QR algorithm are that Householder matrices improve numerical sta-bility through the multiplication by orthogonal matrices. The chain of Qi’s are not collapsedentirely together in a manner that would result in just one matrix-matrix multiplicationoperation. It is also possible that the matrix product Qi · A is implemented as a series ofmatrix vector multiplications instead.

    While parallelized deployments of the QR algorithm are utilized, the parallelization ona massive scale that we seek for MMDS’s is not suited to the above described algorithmicenhancements. We mention here that there also exists a variation of the QR algorithm thatuses another type of orthogonal transformation, the Givens rotation, though the House-holder version is more commonly used. A more detailed discussion on these QR algorithmsmay be found in one of the popular Linear Algebra textbooks such as [6]. We next turnour attention to algorithms for computing an SVD decomposition.

    9

  • 2.7 SVD Deterministic Algorithm and Complexity Cost

    The standard SVD algorithm is based on the work of Golub and Reinsch [3] though wewill also review an alternative algorithm that uses a QR algorithm with post-processing.The SVD decomposition of A = UΣV occurs in two distinct steps:

    1st Step: Use two sequences of Householder translations to reduce A to upper bidiag-onal form:

    1. B = Qn . . . Q2Q1AP1P2 . . . Pn−2

    2. Therefore, we have that: A = Q1Q2 . . . QnBPn−2 . . . P2P1

    2nd Step: Use two sequences of Givens rotations (orthogonal transformations) to re-duce B to diagonal form Σ

    1. Σ = Gn−1 . . . G2G1BF1F2 . . . Fn−1

    2. Likewise, we have that: B = G1G2 . . . Gn−1ΣFn−1 . . . F2F1

    3. Set U := Q1Q2 . . . QnG1G2 . . . Gn−1

    4. Set V := (F1F2 . . . Fn−1)∗(P1P2 . . . Pn−2)

    A truncated SVD to a rank-k approximation may be obtained from running this algo-rithm though there is no savings available in the arithmetic complexity. According to [28]the first step of bidiagonal reduction (BRD) can consume at least seventy percent of thetime for the SVD algorithm. In practice, BRD consists of the repeated construction of theHouseholder reflectors and update of the matrix using the reflectors. Two matrix vectormultiplications during each reflector construction involve the remaining subdiagonal portionof the matrix being reduced. Depending upon the implementation, if the sequence of ma-trix updates and matrix vector multiplications result in frequent data transfers across thememory hierarchy, a memory bottleneck may result. The situation may be exacerbated formatrices that are larger than available cache. It should be clear that this has implicationsfor processing MMDS’s and motivates, in part, the search for new LRA strategies.

    There exists an alternative to the above standard SVD algorithm which relies on firstobtaining a rank-k QR factorization. In this case we have that:

    A = Q1Q2 . . . QnRΠ∗ + E (2.23)

    where E is a residual error term. The SVD algorithm described above may be appliedto the product RΠ∗ with the result:

    RΠ∗ = XΣV ∗ (2.24)

    Let Q = Q1Q2 . . . Qn and noting that the product U = QX is also orthogonal, we havethe following rank-k SVD decomposition of A:

    A = UΣV ∗ + E (2.25)

    While we have used the SVD algorithm it is only applied on a matrix of typically muchsmaller dimension than A.

    10

  • 3 Approximation and Probabilistic Results

    In recent years a number of theoretical results have appeared in the literature concerningthe approximation of a matrix by one of a smaller size in terms of rows and/or columns.Three broad strategies have been identified by which we may seek to reduce matrix size. Thefirst of these is dimension reduction in which the goal is to approximate a matrix by one ofmuch smaller rank than the original matrix. A second approximation strategy is to choosea subset of columns (or rows) from the original matrix which are most representative of theoriginal matrix and thereby preserving the salient characteristics of the original matrix inthe approximation. In the third strategy a submatrix consisting of a subset of both rowsand columns of the original matrix is chosen to formulate an approximate matrix of smallersize. Each of these strategies have received considerable attention by researchers and moreoften than not, from both Theoretical Computer Science and Numerical Linear Algebraperspectives. An informative and in-depth comparison of these two points of view and theircultural differences may be found in [2]. This section covers the three broad strategies and,in particular, the random multiplier matrices utilized in dimension reduction.

    3.1 Dimension Reduction

    It is instructive to begin the discussion of Dimension Reduction, as it applies to matrices,with a related topic concerning points in Euclidean space. A seminal paper by Johnsonand Lindenstrauss [9] proved that, given a set of n points of dimension d, it is possible toapproximate distances between any two points in O(log n)-dimensional space.

    Let X1, X2, . . . Xn ∈ Rd. Then for � ∈ (0, 1) there exists Φ ∈ Rk×d for k = O( 1�2 log n)such that:

    (1− �)‖Xi −Xj‖2 ≤ ‖ΦXi − ΦXj‖2 ≤ (1 + �)‖Xi −Xj‖2 (3.1)

    That is, the target dimension depends only upon the number of points. The immediateconsequences of this result were obvious and significant for the Nearest Neighbor prob-lem, though not readily apparent to the NLA community. The mapping matrix Φ (J-LTransform) can be constructed as a random Gaussian normal k × d matrix. Alternatively,Achlioptas [10] demonstrated that a matrix obtained from Bernoulli random {+1,−1} en-tries could also be used. Perhaps even of more importance was their finding that sparsitycould be introduced into the random matrix thereby achieving reduced matrix vector mul-tiplication complexity. In this case {+1,−1} values are each chosen with probability p = 16and zero otherwise.

    Sarlos [11] utilized the Johnson-Lindenstrauss lemma to provide the first relative-errorapproximation in terms of the Frobenius norm for LRA in a constant number of passes ofa matrix building on Achlioptas’ result. If A ∈ Rm×n and B is an r×n J-L transform withi.i.d. zero mean entries {−1,+1} for r = Θ(k� +k log k) and � ∈ (0, 1), then with probabilityp ≥ .5, we have that:

    ‖A− ProjABT ,k(A)‖F ≤ (1 + �)‖A−Ak‖F (3.2)

    where ProjABT ,k(A) is the best rank k approximation of the projection of A in the

    column space of ABT . This result extended the preservation of distance metrics for vectors

    11

  • to that of the actual matrix subspace structure utilizing J-L transforms. Moreover, itsuggests a general two step general strategy for dimension reduction. In the first step arandom subspace is created from the application of the J-L transform to the matrix A. Arank-k approximation of A is then obtained in the second step after projecting A on to thesubspace generated in the first step. Thus, the randomization as given in step 1 for theJ-L transform construction enables a new approach to SVD approximation with constantprobability. We cover the important topic of random multiplier matrices as given by J-L transforms in more detail in the next section. Another important implication is that toarrive at a rank-k approximation of A, we must formulate r > k random linear combinationsof columns of A.

    Sarlos’ result was actually preceded in the NLA literature by Papadimitriou et al. [12]which initially proposed using random projections in Latent Semantic Indexing (LSI) ap-plications. Briefly, LSI is concerned with information retrieval and the evaluation of thespectral properties of term document matrices which capture documents on one matrix di-mension and the terms found in those documents along the other dimension. Each entry inthe matrix contains a count of the occurrences of the particular term for a given document.Papadimitriou’s result provides a weaker additive error bound than Sarlos in terms of arank-2k approximation Â2k of a matrix A:

    ‖A− Â2k‖2F ≤ ‖A−Ak‖2F + 2�‖A‖2F (3.3)

    The weakness in the additive error bound is due to the second term on the right handside of the above result because ‖A‖F can be arbitrarily large. Nonetheless, Papadimitriouprovided mathematical rigor in explaining why LRA can be used to capture the salientfeatures of term document matrices. Eventually, a relative-error bound in the spectralnorm was given by Halko et al. [1] which relies on a power iteration to obtain the followingresult:

    Let A ∈ Rm×n. If B is an n × 2k Gaussian matrix and Y = (AA∗)qAB such that qis a small non-negative integer and 2k is the target rank approximation where 2 ≤ k ≤0.5min{m,n} then:

    E‖A− ProjY,2k(A)‖2 ≤[1 + 4

    √2{min(m,n}

    k − 1

    ] 12q+1

    ‖A−Ak‖2 (3.4)

    A power iteration factor (AA∗)q appears in Y to address any case of slow decay in thesingular values of A that might otherwise negatively affect the LRA accuracy. Thus, theaccuracy of the approximation can be refined by a larger choice of q. It can be shownthat the SVD of (AA∗)qA preserves the left and right singular vectors of A. The singularvalue matrix of (AA∗)qA is Σ2q+1 where Σ is the singular value matrix of A. In practicethe approach employed in [1] for most input matrices in practice does not utilize a poweriteration (eq., q = 0). Instead, an oversampling parameter p, a small positive integer, isadded to the rank-k value desired to specify the size of an n × (k + p) random multipliermatrix. The choice of value for p involves a number of factors. Please see [1] for more details.An improvement to their proof was given by Woodruff [13] that realizes an actual rank kapproximation. Woodruff also refined the proof using results for bounds on maximum andminimum singular values of Gaussian random matrices [14].

    12

  • From the Relative-Error bound for a matrix A we have seen that given a sample of rrandom linear combinations of columns of A, we may obtain a rank k < r approximationof A. Perhaps a more insightful explanation is to recognize the following:

    1. Multiplying A by a random vector x yields a vector y ∈ colspace(A).

    2. With high probability a set of r such y′s are linearly independent.

    3. A new approximate basis  for A consisting of the y′s has dimension r.

    4. If A is projected on to Â, a rank k matrix decomposition of this pro-jection approximates the truncated rank-k SVD decomposition of A.

    The most expensive aspect of the random projection approach is the multiplication of Aby a random multiplier matrix. On the one hand, matrix multiplication is an embarrassinglyparallel operation, but the concern of memory bottlenecks is raised with MMDS’s andthe type of random multiplier involved which can have an adverse impact on clock timeperformance. We shall see in the next section that we may use structured random multipliersbesides Gaussian matrices. Structured matrices are beneficial in reducing the number offloating point operations (FLOPS) as compared to Gaussian matrices and in the amountof storage that they require.

    3.2 Subspace Projections with Random Multiplier Matrices

    In the prior subsection the discussion of J-L transforms described Gaussian random matricesor matrices of Bernoulli random {+1,−1} entries. Unfortunately, matrix vector and matrixmatrix multiplication using such dense matrices is expensive. In the case of matrix vectormultiplication with a dense Rk×d matrix, O(kd) arithmetic operations are required witheach vector X ∈ Rd. One alternative involves the sparsification of the J-L transform andit was first proposed by Achlioptas [10]. In this sparse variant of the J-L transform, eachelement is chosen from a probability distribution where {+1,−1} are each chosen with 16probability and zero is selected with 23 probability. A scaling constant

    √3k completes the

    definition of this J-L transform. While this approach can be effective for dense vectors, it isproblematic when the vector itself is also sparse. While researchers have focused attentionin recent years on J-L transforms for the specific case of sparse vectors, we concern ourselvesfor the remainder of this section with the general case.

    The next significant result was the introduction of the Fast Johnson-LindenstraussTransform (FJLT) by Ailon and Chazelle [15]. Their efforts addressed the limitations ofprocessing sparse vectors while reducing the complexity of dense matrix vector multiplica-tion. They introduced a transform, a random structured matrix, defined as the productof three matrix factors in which two of the matrices are randomized and the third is theHadamard matrix.

    Let the FJLT Φ = PHD and d = 2l such that:

    • P ∈ Rk×d

    • H,D ∈ Rd×d

    13

  • • Pij ∼ N(0, q−1) with probability q and Pij = 0 with probability 1− q forq = min(Θ( log

    2 nd ), 1)

    • H2 =(d−

    12 d−

    12

    d−12 −d−

    12

    )for q = 2h, h = 0, 1, . . . l

    • D is diagonal with Dii drawn independently from {1,−1} with probability 12 .

    Then we have that with probability 23 for Xi ∈ Rd:

    (1− �)k‖Xi‖2 ≤ ‖Φ‖2 ≤ (1 + �)k‖Xi‖2 (3.5)

    To build Φ requires O(d log d + min(d�−2 log n, �−2 log3 n)) operations. Moreover, thecomplexity of matrix vector multiplication is O(d log d+ |P |). The motivation for the FJLTis to have a matrix with sparsity as proposed by Achlioptas and that negates the sparsevector scenario while reducing arithmetic compexity of multiplication. The H and Dmatrices are orthogonal matrices. Therefore, matrix vector multiplication preserves vectornorms and distances between vectors. According to [15], H densifies sparse vectors whileD provides enough randomization to prevent dense vectors from becoming sparse. The Pmatrix provides sparsity to the transform similarly as in [16]. The structure inherent inthe FJLT as given by the recursive definition of H provides for the improved matrix vectormultiplication complexity rather than the sparsity given in P .

    Woolfe et al. [16] subsequently applied the FJLT to LRA by formulating the subsampledrandom Fourier transform (SRFT) in the complex case. Let the SRFT Φ =

    √nlDFS such

    that:

    • D ∈ Cn×n is a diagonal matrix whose entries are i.i.d. random variables distributeduniformly on the unit circle.

    • F is the n× n Discrete Fourier Transform (DFT) matrix.

    • S is an n × l matrix whose columns are sampled uniformly from the n × n identitymatrix.

    Therefore, a random subspace may be created from an m × n matrix using the SRFTas a random multiplier. In comparison to a Gaussian multiplier that requires nl randomentries, the SRFT requires only (n + l) random entries: n entries for the D matrix and lfor the matrix S. The SRFT matrix matrix multiplication is performed using O(mn log l)flops. In their algorithm the accuracy of a rank-k approximation Âk of A ∈ Cm×n for realα, β greater than 1 such that m > l ≥ α

    2β(α−1)2 (2k)

    2 and with probability p ≥ 1− 3β is:

    ‖A− Âk‖ ≤ 2(√

    2α− 1 + 1)(√αmax (m,n) + 1 +

    √αmax (m,n))‖A−Ak‖2 (3.6)

    A similar random structured matrix may be formed with the DFT matrix replacedby a Hadamard matrix and the diagonal matrix D consisting of randomly chosen {1,−1}on the diagonal as in the SRFT case. The primary drawback of SRFT’s compared to

    14

  • Gaussian matrices is a theoretically higher probability of failure [1]. For Gaussian matricesthe probability of failure with oversampling parameter p is e−p and that of SRFT’s for arank-k approximation increases to 1k .

    We may ask what other types of random structured matrices can be used that areperhaps faster than SRFT for Dimension Reduction? The key to answering this questionlies in a change in the definitions of the input matrix A and the random Gaussian multiplierB used to generate a random subspace in the first step of Dimension Reduction. Accordingto the Dual Theorem in Pan et al. [17], assume that A ∈ Rm×n is an average inputmatrix with numerical rank at most r under the Gaussian probability distribution and thatB ∈ Rn×l has numerical rank l. If l ≥ r then Dimension Reduction using B succeeds inoutputting a rank l approximation to A. It implies that a unitary matrix B or a matrix thatis both full-rank and well-conditioned (reasonably bounded condition number) is sufficient.Recall that the condition number of a unitary matrix is equal to one. This result suggeststhat we can expect to have success with random multipliers that are formed using structuredand sparse orthogonal matrices. Moreover, as it concerns MMDS’s we also benefit from theuse of structured multipliers by reducing the memory space needed for their storage.

    Therefore, the possibility exists to find more efficient random multipliers, that is, thosehaving lower complexity bound than O(mn log l) flops for matrix matrix multiplication as inSRFT. One such possibility is to employ Hadamard and Fourier matrices that are defined upto a few recursive levels, thus inducing a sparse and orthogonal matrix. Indeed the numericalexperiments in [17] show that random multipliers using these abridged and orthogonalHadamard matrices in place of a full Hadamard matrix are promising. Currently, there isno formal support for this specific type of matrix. Though it should not be surprising thatorthogonal matrices should be effective multipliers given that matrix vector multiplicationwith them preserves vector norms as well as the distances and angles between vectors.

    3.3 Approximations with Column(or Row) Subsets

    We now turn our attention to the problem of identifying a suitable subset of columns (orrows) of a matrix A ∈ Rm×n that may also be optionally processed further in some mannerin order to obtain an approximation A. We previously saw a classical result in an earliersection concerning existence of a k-column subset C of A such that:

    ‖A− CC†A‖2 ≤√

    1 + k(n− k)‖A−Ak‖2 (3.7)

    The CC†A term represents a CX approximation to A for X := C†A in which A isprojected on to the column space of C by the projection matrix CC†.

    An influential paper by Frieze et al. [19] introduced a strategy of creating samplingprobabilities from the Euclidean norms of the rows and columns of a matrix from which tosubsequently sample a subset of columns and rows. Their key theoretical finding assumesthe probabilities Pi for rows A(i), i = 1 . . .m and a constant c ≤ 1 such that:

    Pi ≥ c|A(i)|2

    ‖A‖2F(3.8)

    Theorem 1. [19]: Let R be a sample of r rows of A chosen from the above distributionand let W be the vector space spanned by R. There exists with probability p ≥ .9 anorthonormal set of vectors w(1), w(1), . . . w(k) in W such that:

    15

  • ∥∥∥∥A−A k∑i=1

    w(i)wT(i)

    ∥∥∥∥2 ≤ ‖A−Ak‖2F + 10kcr ‖A‖2F (3.9)The authors applied Theorem 1 to provide an algorithm that samples a subset of columns

    and rows of A to form a rank-k approximation Âk of A with additive error bound:

    ‖A− Âk‖2F ≤ ‖A−Ak‖2F + �‖A‖2F (3.10)

    The additive error bound is weak in the sense that ‖A‖2F in the second term on the right-hand side may be arbitrarily large. While the algorithm has polynomial time complexity in kand 1� , the complexity bound does not include the computation of the sampling probabilitiesfor the rows and columns of A and the sample complexity (number of rows) is O(k4).Otherwise, it is interesting to note that the running time is independent of the matrix size.

    Subsequently, Deshpande et al [20] improved upon the above result of [19] utilizing avolume-sampling technique to generate a multiplicative error bound that is more refinedthan its additive counterpart. They showed that there exists a set of k rows of A whosespan contains a subset of rows Ãk that is a multiplicative approximation to the best rank-kmatrix approximation, Ak:

    ‖A− Ãk‖F ≤√

    (k + 1)‖A−Ak‖F (3.11)

    Moreover, they extended this result into a stronger relative error approximation in thefollowing theorem.

    Theorem 2. [20]: In any m × n matrix A there exists O(k2� ) rows in whose span arerows that form a rank-k matrix Ãk for an error parameter � such that:

    ‖A− Ãk‖2F ≤ (1 + �)‖A−Ak‖2F (3.12)

    The volume sampling method utilized in this paper relies on volume distributions con-structed for each k-subset of rows of A. The volume of a matrix B containing k rowsis:

    vol(B) =1

    k!

    √det(BBT ) (3.13)

    Thus, a k-row subset is chosen with probability proportional to the square of its volume.The improved error-bound approximation of Deshpande over that of [19] can at least in partbe attributed to a volume metric that captures information about a matrix as opposed tothat of the Euclidean vector norms associated to individual rows and columns of a matrix.However, Frieze’s algorithms involves only two passes while Deshpande’s requires multiplepasses to obtain the relative error approximation. Both of the algorithms of Deshpande [20]and Frieze [19] have sample complexity that is at least quadratic for CX approximation.This complexity bound was subsequently reduced by Rudelson and Vershynin[21] in anapproach using the Law of Large Numbers adapted to matrices. The rationale underlyingtheir work is that if a matrix has small numerical rank than a low rank approximation

    16

  • should be available from a random submatrix. Though they only obtain an additive errorapproximation, it is done with an algorithm using at most two passes of the data and withO(k log k) sample complexity. Their additive error is given in the spectral norm.

    Theorem 3. [21]: Suppose A is an m × n matrix of numerical rank r = ‖AF ‖2

    ‖A‖22and

    �, δ ∈ (0, 1), c ≥ 0, and d an integer such that:

    m ≥ d ≥ c(r

    �4δ

    )log

    (r

    �4δ

    )(3.14)

    Let a random submatrix Ãk of d rows of A be sampled according to their squaredEuclidean norms and let Uk be the k top left singular vectors of Ãk. We have that withprobability p ≥ 1− 2 exp −cδ :

    ‖A−AUU∗‖2 ≤ ‖A−Ak‖2 + �‖A‖2 (3.15)

    Another point of interest is that the sample size d depends upon the numerical rankand not the desired rank-k value for the approximation as in [19, 20].

    Finally, an algorithm that realizes the CX relative error existential result of Deshpandeet al. [19] was given by Drineas et al. [22]. However, they take a different approachfrom previous papers concerning the construction of sampling probabilities. Recall that theprevious sampling probabilities given in the literature utilized the squared Euclidean normsof the rows (or columns) of a matrix A. Drineas introduced the idea of subspace samplingaccording to the squared norms of the right singular vectors of A. Their argument is thatthis is an improvement over the previous column (row) sampling of the matrix due to linearspan considerations. Suppose that the i-th column of a matrix A is given by:

    A(i) = UΣ(V T )(i) (3.16)

    Therefore, V (i) is in some sense a metric of the extent to which A(i) is contained in thespan of U . The effect of Σ is eliminated as compared to the previous probability distributionapproach because it does not affect U ’s span. Consequently, the probability distributionspi for i = 1 . . . n, otherwise known as leverage scores, associated to each column i of thebest rank-k approximation Ak are given by:

    pi ≥

    ∣∣∣(V T(A,k))(i)∣∣∣22k

    (3.17)

    The theorem for the LRA algorithm follows:Theorem 4. [22]: Suppose A is an m × n matrix containing real entries and k �

    min(m,n) is an integer. Then there exists randomized algorithms that choose a c column

    subset, C, of A where c = O(k2 log 1

    δ�2

    )such that for �, δ ∈ (0, 1], we have that with probability

    1− δ:

    ‖A− CC+A‖F ≤ (1 + �)‖A−Ak‖F (3.18)

    17

  • A similar result holds if at most c = O(k log k log 1

    δ�2

    )columns are chosen in expectation.

    The complexity bound to build the CX in this algorithm is bounded by the cost required toderive the right singular vectors of the rank-k approximation. A less expensive alternative isto obtain approximate leverage scores. These may be derived according to a relative-errorbound approximation in [23] with cost O(mn log k) to obtain the sampling probabilitiescorresponding to the top k singular vectors. An area of further research concerns thepreconditioning of matrix A by a suitable multiplier matrix such that the leverage scoresof the product matrix are approximately uniform. In this case, the sampling algorithmincludes a post-processing recovery step of the original matrix. This strategy can be justifiedif the cost of the two matrix multiplications can be done inexpensively and it implies usingorthogonal matrices as their inverses are readily available. A randomized algorithm thatsatisfies Theorem 4 is presented in section 4.

    3.4 Approximations with Column-Row Subset Combinations

    We now extend discussion of the column (row) subset appproximation results of the priorsection to that of approximations using column and row subsets simultaneously. Suchapproximations in the TCS community are commonly referred to as CUR decompositionswhile among NLA researchers the terms CGR and matrix skeletons are also used.

    An investigation as it concerns adapting the sampling approach of the last section toCUR can be found in Drineas et al. [24]. Their algorithms are devised with the goal ofaccommodating out-of-core massive data sets so that only O(cm+ nr) RAM is required toobtain a CUR for A ∈ Rm×n, C ∈ Rm×c, and R ∈ Rr×n. Moreover, at most three passes ofthe matrix A are required.

    Their linear time algorithm, as with the Column Subset algorithms of the prior section,relies on sampling probabilities. These probabilities are proportional to the squared Eu-clidean norms pi and q

    j for the sets of rows and columns, respectively, of A. Thus, C iscomputed from sampling c columns of A according to the column probabilities given by qj .Likewise, R is constructed from r sampled rows of A according to the row probabilities,pi. The column and row subsets are scaled and the matrix U is derived from additionalprocessing of C and R. More formally, we have the following sampling probabilities:

    pi =|A(i)|2

    ‖A‖2Fi = 1 . . .m (3.19)

    qj =|A(j)|2

    ‖A‖2Fj = 1 . . . n (3.20)

    The algorithm has cost complexity of O(max (m,n)) and requires one pass of A to com-pute the probabilities and a second pass in which the matrices C and R are simultaneouslyobtained. An additive error in expectation is given as follows for 1 ≤ k ≤ min (c, r) providedthat c ≥ 64k

    �4and r ≥ k

    �2:

    E[‖A− Âk‖F ] ≤ ‖A−Ak‖F + �‖A‖F (3.21)

    E[‖A− Âk‖2] ≤ ‖A−Ak‖2 + �‖A‖F (3.22)

    18

  • Though the algorithm is polynomial in k and 1� , it is linear in the input matrix dimen-sions. It is assumed that the desired rank-k is much smaller than min (m,n) and that cand r are sufficiently small to be considered constants. The additive error bounds for CURwere subsequently improved to relative error bounds by Drineas et al [22] by extending therelative error bound result for CX decompositions of the same paper. Once again they usea modified sampling probability strategy based on the singular vectors of the input matrix.The complexity of the algorithm is bounded by the time required to compute the squaredEuclidean norms (or approximations thereof) of the singular vectors.

    A rough sketch of the algorithm is presented here and it is discussed in more detail insection 4. The matrix C, a column subset of A, is generated by the algorithm of Theorem 4.The left singular vectors UC of C are then used to compute probabilities qi for the samplingof r rows from A to form R. The U matrix is the pseudoinverse of the matrix W ∈ Rr×cthat is the intersection of R with C. Both R and W are scaled similarly to C.

    qi =

    ∣∣∣(UTC)(i)∣∣∣22

    ci = 1 . . .m (3.23)

    The key theorem that enables the relative error bound for CUR follows:Theorem 5. [22]: Suppose A is an m × n matrix containing real entries and C is an

    m × c matrix containing c columns of A obtained with the algorithm of Theorem 4. Setr = 3200 c

    2

    �2for � ∈ (0, 1], and choose r rows of A (and the corresponding ones in C) as in

    the algorithm described immediately above. Then we have with probability p ≥ .7 that:

    ‖A− CUR‖F ≤ (1 + �)‖A− CC+A‖F (3.24)

    A similar result holds if at most r = O(c log c�2

    )rows are chosen in expectation. Please

    see [22] for further details. Theorems 4 and 5 can now be combined to get the final CURrelative error bound as given in Section 5.2 of [22] where �p = 3�:

    ‖A− CUR‖F ≤ (1 + �)‖A− CC+A‖F ≤ (1 + �)2‖A−Ak‖F (3.25)

    (1 + �)2‖A−Ak‖F ≤ (1 + �p)‖A−Ak‖F (3.26)

    According to [22], sampling according to the probabilities qi as defined above is doneso that R may contain those rows that capture a similar subspace as the first c rightsingular vectors of C. The running time of the algorithm, omitting the CX decompositionalgorithm of Theorem 4 to obtain C, is O(mn). The requirements of O(k log k

    �2) columns

    and O(c log c�2

    ) rows for a rank-k CUR approximation were subsequently lowered in a paperby Boutsidis and Woodruff [25]. Though [25] is conceptually similar to [22] in the overallapproach, Boutsidis and Woodruff employ approximation and sparsification results fromthe literature to obtain an algorithm with running time that is O(nnz(A)) where nnz(A) isthe number of nonzero elements of A.

    Furthermore, their randomized algorithm requires only c = O(k� ) columns and r = O(k� )

    rows for a rank-k CUR decomposition. The U matrix is constructed such that rank(U) = k.

    19

  • To obtain the column subset C, an O(k log k) subset is sampled using sampling probabilitiesfrom an approximate SVD of A. This subset is then downsampled to O(k) columns with aspectral sparsification algorithm. An additional O(k� ) columns are then sampled using anadaptive sampling technique of [20] to obtain a C matrix for the CUR decomposition thatitself has a relative error bound wrt. to the best rank-k approximation, Ak, of A. Likewise,the row subset component of the algorithm starts with an approximation algorithm that isused to find a rank-k orthonormal matrix in colspace(C) that yields a relative error boundwrt Ak. Afterwards, similar sampling steps as for the derivation of C follow to obtain R.Finally, the U matrix is assembled using matrices from the derivation steps for C and R.

    This algorithm illustrates the variety of approximation, subset selection, and sparsifica-tion tools that can be combined to realize efficient hybrid LRA algorithms. However, thisefficiency still requires the examination of all nonzero elements of A and the use of sophis-ticated machinery to generate column and row subsets with O(k� ) number of elements. Wemay ask whether there exist any CUR strategies that may not necessarily access all nnz(A)elements? Or is it possible to do so if we relax the constraint of referencing all mn elementsof A? Indeed, the classical SVD and QR algorithms as well as dimension reduction allpresume the access of each entry in a matrix of interest.

    3.5 Pseudo-Skeleton CUR Approximation

    With the above questions in mind, we now turn our attention to the review of results fromthe literature that enable LRA algorithms that use less than the mn entries of a densem × n matrix A. The reduction in the number of matrix entries enables a commensuratereduction in the number of arithmetic operations. These approximations are known aspseudo-skeleton approximations. An early and significant result in this area wrt the spectralnorm was obtained by Goreinov et al. [26]. Suppose A ∈ Rm×n. Then there exists subsetsof k columns and rows, C and R, respectively, in A and a matrix U ∈ Rk×k such that:

    ‖A− CUR‖2 ≤√k(√m+

    √n)‖A−Ak‖2 (3.27)

    This paper further elaborates on the choice of the matrix U . Let I be and J be theindex sets of rows and columns of A as contained in R and C, respectively, and denoteG = A(I, J). If G is nonsingular and we set U := G−1, it can be shown that the accuracyof the CUR approximation to A improves with decreasing ‖G−1‖. That is,

    ‖A− CUR‖2 = O(‖A‖22‖G−1‖22) (3.28)

    Given that we would likely want to somehow utilize the entries of C and R to formU , the problem to be solved is that of locating a submatrix of A with a reasonably smallinverse. However, this paper did not offer a methodology for finding such a submatrix. Butin a subsequent paper by Goreinov and Tyrtyshnikov [27], the key insight for identifyingthis submatrix was provided.

    Suppose  is a CUR approximation of the form given above and U = A(I, J)−1. IfA(I, J) has maximal determinant modulus of all k×k submatrices of A, then it holds that:

    ‖A− Â‖C ≤ (k + 1)‖A−Ak‖2 (3.29)

    20

  • Note that ‖ · ‖C is the Chebyshev norm and it is the maximum absolute-valued entryof a matrix. While neither this result nor the prior one yields a relative error bound, theydo suggest the condition according to which we should search for a suitable submatrix.Unfortunately, the problem of finding a maximal determinant modulus (also known asmatrix volume) is NP-hard. It should also be noted that the volume of a matrix is a metricthat quantifies the orthogonality of the columns of a matrix. Geometrically, we seek theparallelepiped with maximum volume as constructed from a subset of columns in the matrix.But from a practical algorithmic point of view, the determinant of a triangular matrix equalsthe product of its diagonal entries. Thus, we may consider using existing factorizationalgorithms such as LU and QR as components of a larger algorithm to locate submatricesof sufficiently large volume. We shall investigate algorithms using these CUR skeletonresults in a later section 4. The skeleton CUR approach reduces arithmetic operations fromthe point of view of potentially not having to access every entry in a matrix of interest.

    3.6 Summary

    This section has examined a broad range of approximation results from the literature in-cluding dimension reduction with random multiplier matrices, column subset sampling,column-row subset sampling, and pseudo-skeletons. We have seen that dimension reductionrelies on the data oblivious generation of random linear combinations of the columns of aninput matrix A in order to eventually approximate A. Likewise, column and or row subsetsampling algorithms create sampling probabilities from A before taking column (row) sam-ples based on these probabilities. Finally, the pseudo-skeleton approach searches A for asubmatrix that approximately maximizes the volume metric though it is not a randomizedalgorithm.

    Algorithms based on these results are presented and analyzed in section 4. Moreover,it should be mentioned that hybrid algorithms that combine one or more of these resultsmay also be formed. Recall that we have already proposed an extension to sampling col-umn subsets using an approach whereby we preprocess a matrix so that its leverage scoreare approximately uniform. We may similarly apply the relative-error bound CUR algo-rithm of [22] and then recover the original matrix with multiplication of the CUR by theinverse of the random multiplier. Likewise, we have also seen in [25] the application of sev-eral approximation and sparsification techniques to further lower computational complexitybounds.

    4 Randomized LRA Algorithms

    The important approximation results of the last section allow us to present some practicalprototype algorithms based on those results. We start with a discussion of a typical di-mension reduction algorithm and follow with one for column subset sampling. Finally, wediscuss examples of CUR approximation utilizing row-column sampling and pseudo-skeletonapproaches.

    4.1 The Dimension Reduction Algorithm

    The random projection algorithm given below is for the fixed rank problem where we areinterested in a rank-k approximation of A ≈ B · C such that k

  • Input: A ∈ Rm×nInput: rank k, oversampling parameter pOutput: B ∈ Rm×(k+p)Output: C ∈ R(k+p)×nl← k + pConstruct a random Gaussian normal matrix G ∈ Rn×lY ← A ·GGet an orthogonal basis matrix Q ∈ Rm×l for YB ← QC ← Q∗ ·AOutput B, C

    Algorithm 1: Dimension Reduction [Halko et al 2011]

    Gaussian normal matrix for the multiplier. Afterwards, the following steps convert theoutput of the Dimension Reduction algorithm from low rank format to SVD:

    1. Run a SVD algorithm on matrix C to obtain C := ÛΣV ∗

    2. Set U ← B · Û

    It is informative to note that the algorithm and the subsequent SVD steps use conven-tional techniques on much smaller matrices than the input matrix A ∈ Rm×n. For example,the orthogonal basis matrix Q may be obtained with a standard QR algorithm and an SVDalgorithm is applied to C. The most expensive steps are Y ← A · G and C ← Q∗ · Awhich are each O(mnl). However, matrix matrix multiplication is an embarrassingly par-allel operation suited to today’s modern parallel hardware architectures. Furthermore, theLAPACK Basic Linear Algebra Subprograms (BLAS) implement matrix matrix operationsin a block fashion that leads to better utilization of memory hierarchies. Please see [29]for an in-depth treatment of this particular topic. From a computational perspective, themost arithmetically intensive steps of this dimension reduction algorithm conveniently mapto those hardware and software features that can boost the clock time performance of thealgorithm.

    Another aspect of the algorithm that may potentially be tuned for further performanceconcerns the QR algorithm. If we consider that, as presented in the prior section, thecolumns of Y are likely linearly independent, than QR can potentially be applied without acolumn-pivoting strategy. A further consideration is the singular value spectrum of the inputmatrix A. If the spectrum decays slowly, it may be necessary to employ a power iteration toensure more accurate results through acceleration of spectrum decay. According to [1], theidea is to lessen the effect of smaller singular vectors in the algorithm by speeding up thedecay of the associated singular values. The power iteration appeared in the prior sectionconcerning a relative error bound in the spectral norm. If a power iteration is needed, aparameter q is specified for the number of iterations. The other significant parameter of thealgorithm wrt accuracy of the result is the oversampling parameter p.

    4.2 Random Multiplier Matrix Considerations

    The primary tradeoffs of using Gaussian random matrices as opposed to other multipliersconcerns:

    22

  • • Numerical results accuracy (Favorable)

    • Probabilty of Failure 3e−p (Favorable)

    • O(mnl) arithmetic complexity bound of dense matrix multiplication (Expensive)

    • Memory space considerations O(nk) (Expensive)

    Substantial arithmetic cost reduction in the matrix matrix multiplication with randommultipliers is possible with the Fast Johnson-Lindenstrauss Transform (FJLT) and Subsam-pled Random Fourier and Hadamard Transforms (SRFT/SHRT). The recursive definitionof Hadamard and Discrete Fourier Transform (DFT) matrices enable matrix vector multi-plication in O(n log n) time for an n × n matrix. Unfortunately, the probability of failureincreases to O( 1k ) for a rank-k approximation. Whereas only O(n + k) random entries arerequired for a multiplier of the SRFT/SHRT type, the dense Gaussian normal n×k matrixnecessitates the creation of O(nk) random entries.

    Alternatively, we may use sparsified versions of (SRFT/SHRT) matrices called AbridgedFourier and Hadamard matrices. Though there are no formal probability guarantees ofsuccess with these multipliers, they further reduce the cost of matrix matrix multiplication.For an Abridge Hadamard n×n matrix having 2d nonzero entries in each row and column,matrix vector multiplication incurs only 2dn arithmetic operations. Please see [17] forfurther treatment of these random multipliers and a review of empirical results with thesemultipliers. Another factor in the choice of recursively defined matrices like the Hadamardand DFT pertains to the memory access patterns and the potential for memory bottleneckswith massive data sets. Though not mentioned in the LRA literature, the pitfalls of theFast Fourier Transform have been raised in other areas.

    A significant issue investigated in [31] concerns the nature of FFT processing wherebysmall data sets are continually transferred from external memory. Due to the data lay-outs typically used by FFT algorithms, non-sequential memory accesses are incurred thatnegatively impact performance. An article from the field of geosciences [32] cites the mem-ory access patterns in FFT as a factor that limits scalability. Furthermore, they showexperimental results whereby an alternative algorithm with twice as many arithmetic op-erations has a five-fold performance improvement over that of FFT. As it concerns parallelprocessing with FFT, [31] refers to experimental results which show that FFT-based nu-merical approaches do not favorably compare with the Fast Multipole Method over andabove several thousand parallel processes.

    A more detailed investigation is beyond the scope of this survey, but it raises seriousquestions as to the suitability for FFT and Hadamard matrices for MMDSs in large scaleparallel processing environments. Perhaps, the most intriguing issue is the question ofwhether a primary focus of reducing theoretical arithmetic complexity bounds is necessarilythe most productive direction for further research in this area. In any case, an analysis oftemporal and data locality characteristics associated to multiplication with random matrixmultipliers also merits careful attention. Likewise, a similar argument can be made wrtto the tradeoff between arithmetic complexity reduction and any detrimental impact in aparallel processing environment. Such impacts may arise from dependencies among thearithmetic operations that limit parallelism.

    In summary the desirable properties of random multipliers include orthogonality, spar-sity, and structure. Orthogonal matrices are significant for not affecting the distancesbetween vectors among other properties and they do provide numerical stability. We

    23

  • have already encountered the important property of distance preservation in the Johnson-Lindenstrauss Lemma. Sparsity is important is important in that it reduces the spacecomplexity of a matrix and also reduces the number of multiplication operations with thematrix. Finally, structured matrices require both less arithmetic operations and memoryspace for their storage.

    4.3 A Column Subset Algorithm with Sampling

    Approximation algorithms based on sampling columns(rows) from a matrix originated inthe TCS strategy of employing nonuniform probability distributions constructed from thematrix. The goal is to create pass-efficient LRA algorithms to accommodate those matricesthat are prohibitively too large to be stored in main memory. Therefore, a small constantnumber of passes over the data set from external memory is acceptable. The sampling prob-abilities of the columns are formed from the Euclidean norms of either the columns(rows)of an input matrix A ∈ Rm×n or its singular vectors.

    Input: A ∈ Rm×n, V(A,k) ∈ Rn×kInput: rank k, error parameter �Output: C ∈ Rm×k, X ∈ Rk×n, S ∈ Rn×k, D ∈ Rk×k

    c← 3200k2�2

    Number of columns to sampleS ← zeros(n, c) Sampling matrixD ← zeros(c, c) Scaling matrix

    Get leverage score from each column i of V T(A,k)for i← 1 to c do

    pi ←

    ∣∣∣(V T(A,k))(i)

    ∣∣∣22

    kend

    Sample and rescale c columns of A according to leverage scores pfor i← 1 to n do

    j ← RandomSample({1 . . . n}, p)S(j, i)← 1D(i, i) = 1√cpj

    end

    M ← ASDMk ← Best rank-k approximation to MC ←MkX ← C†kAOutput C, X

    Algorithm 2: CX Algorithm with Column Sampling[Drineas et al 2008]

    In the CX approximation algorithm shown in Algorithm 2 of this paper, these samplingprobabilities are used to sample and rescale columns (rows) in c independent identical (i.i.d.)

    24

  • trials to create a column submatrix C ∈ Rm×c of A. The column-based approach of Section4 of [22] which provides a relative-error bound is shown below in an adapted form thatcombines Algorithms 1 and 4 of [22] to return a low rank format rank-k approximation ofA. The top k right singular vectors of A as given by V(A,k) ∈ Rn×k are key input to thealgorithm from which probabilities are determined. As this survey paper is concerned withnew approaches to LRA, one may certainly question the value of an algorithm that seeminglypresumes the underlying problem to already be solved to some extent. That is, the requiredinput of the right singular vectors are already available from an SVD algorithm or someother method. On the one hand, a CX decomposition is desirable when actual columns ofthe matrix will convey a clearer understanding of the matrix content to the target audienceas opposed to the more abstract singular vectors and singular values provided by the SVD.Thus, the SVD algorithm may be seen as a necessary component of the CUR algorithm.

    It can be argued that the idea of computing a full SVD certainly seems profligate if thegoal is to obtain an approximate decomposition, even if the approximation error can be madearbitrarily small. Two approaches have already been mentioned in the prior section: (i)inexpensive approximation of the sampling probabilities and (ii) preprocessing of the matrixso that sampling probabilities become nearly uniform. In any case, the insight gained fromrecognizing the power of sampling is valuable in situations where the probability distributionis already known beforehand or can be readily approximated. In such a case the cost ofa sampling algorithm becomes O(mn) and this complexity cost that is linear wrt to thestorage space is very attractive for MMDS’s.

    4.4 A CUR Algorithm with Column and Row Sampling

    The column-based sampling algorithm is now extended to the column and row samplingalgorithm of [22] for the CUR approximation of a matrix A. It is important to note thatthe CX algorithm is executed as the first step to acquire the column subset C. The leftsingular vectors UC of C are also assumed to be available to the CUR algorithm given inAlgorithm 3 of this paper. However, the cost of their computation is small given C ∈ Rm×cwhere c � min{(m,n)}. Algorithm 3 proceeds somewhat analogously to that of CX todetermine the row subset R though with a caveat. Though we are interested in a row subsetof A, there is a dependence on the choice of these rows on C.

    To understand this better, one way of looking at the CUR decomposition is as theproduct C(UR). Thus, the columns of C act as a basis for the approximation to A.Moreover, column i of the approximation is a linear combination of columns of C and thecoefficients thereof given by column i of (UR). Therefore, the choice of the row subsetR relies on the particular choice of C. The dependency is manifested, at least in part,by the sampling of rows according to probability distributions of the singular vectors ofC. Furthermore, we use left singular vectors now in the reverse role to that of columnsbecause the rowspace is associated to the right singular vectors. Please refer back to theprior section as it concerns the motivation for using sampling probabilities for columnsampling. Lastly, the U matrix is obtained using the matrices obtained in the column androw subset aquisition steps. The presentation here combines Algorithms 1,2, and 4 of [22].

    If one ignores the cost of computing the right singular vectors V of A, then the complex-ity of this CUR sampling algorithm is O(mn) which reflects the cost of reading all entriesof A. Otherwise, the cost is bounded by the time to determine the top-k singular vectors ofV . Once again, we see in this algorithm the use of conventional NLA algorithms operating

    25

  • Input: A ∈ Rm×n, V(A,k) ∈ Rn×kInput: number of columns c, number of columns r, error parameter �Output: C ∈ Rm×c, R ∈ Rr×n, U ∈ Rc×r

    [C, ∼, ∼, ∼] = CX Algorithm( A, V(A,c), c, � )

    [C, ∼, ∼] = SVD( C )

    [W T , ∼, S, D] = CX Algorithm( CT , UC , r, � )

    R = DSTAU = W+

    Output C, U , R

    Algorithm 3: CUR Decomposition with Column and Row Sampling[Drineas et al 2008]

    on much smaller matrices than the input matrix through a randomization strategy. In par-ticular, a matrix possibly too large to fit into main memory is also accommodated by thealgorithm that would fail otherwise with those conventional algorithms requiring the entirecontents of a dense matrix to be in main memory at the same time.

    4.5 A Pseudo-Skeleton Algorithm for CUR

    The Pseudo-Skeleton CUR approach of this subsection is shown in Algorithm 4 of this paperthat demonstrates how an LRA may be obtained without necessarily accessing all entries ofthe input matrix. It does so using a technique called cross-approximation that successivelyalternates between the selection of columns and rows in turn until an intersection of suchrows and columns is found that satisfies a chosen criteria within some � bound. Cross-approximation is depicted schematically in the accompanying Figure 1 where an initialrandom choice of a prescribed k number of rows is made at the start of the algorithm. Thisinitial set of rows is that shown by the lower block of highlighted rows in Figure 1. Let thisinitial row set be called R1. In the next step of the algorithm a set of columns is selected(C1) as indicated by it’s intersection (W1) with R1. This intersection of C1 with R1 satisfiesthe criteria of having the maximum volume among all column subsets that intersect R1.

    Subsequently, a choice of k rows is made among all k row subsets that yields the max-imum volume k × k submatrix in an intersection with C1. Let this chosen row subset bedenoted R2 and in this case the intersection with C1 is W2. After the next column subsetchoice, we have the intersection W3. The algorithm terminates when the additional volumegained relative to the new volume is below some threshhold. At loop termination, the lastcolumn and row subsets obtained form the matrices C and R while U is set to the inverseof the intersection of C with R.

    The maximum volume submatrix within a set of columns or rows may be obtainedusing conventional rank-revealing LU or QR algorithms where the concept of volume isutilized. Consequently, each pass of the Cross-Approximation algorithm loop requires onlyO((m + n)k2) arithmetic operations. Recall that the volume of a matrix is the absolutevalue of it’s determinant. Moreover, the determinant of any triangular matrix is equal to

    26

  • Figure 1: Cross-Approximation Sequence of Column and Row Selections

    W1

    W1

    W1

    W1

    W2 W3

    The first three recursive steps of a Cross Approximation algorithm output of three shaded matrices W1, W2, and W3

    s

    Adapted from Low Rank Approximation: New Insights, Accurate Superfast Algorithms, Pre-processing and

    Extensions, Victor Y. Pan, Qi Luan, John Svadlenka, Liang Zhao 2017

    the product of it’s diagonal entries and the LU factorization generates an upper triangularmatrix. Please see [4] for an efficient RRLU algorithm that may be used here. The algorithmmay actually terminate without having referenced all of the entries of the matrix. Thetradeoff is that the reduction in arithmetic operations in obtained at the risk of a poorapproximation which did not consider all rows and columns of the matrix.

    5 Open Problems

    The theoretical results reviewed in this paper have emphasized the error bound of variousapproximation results with their arithmetic and space complexities. Depending upon theparticular result, the error bound has been given in either the Frobenius of Spectral norm.The question naturally arises as to where the possibilities exist for further improvements tothese bounds and complexities. From a review of the literature to date, two questions thatmay be asked wrt to A ∈ Rm×n where m ≤ n are the following specifically with regard tothe Spectral norm error bound:

    1. Do there exist random multipliers for the Dimension Reduction algorithm of Section4 with a relative error bound guarantee such that Matrix Matrix multiplication canbe done faster than O(mn log n)?

    2. Is there an approximation algorithm with a relative error bound that produces fac-tor(s) consisting of a subset of actual rows and columns of the matrix?

    To give some perspective on the first question, we have already seen that the FJLT andSRFT random multipliers, by virtue of their recursive structure, provide the O(mn log n)bound. The Abridged Hadamard and Fourier multipliers provide lower arithmetic boundsaccording to empirical evidence, but there is no formal guarantee. It should be mentionedthat there is a result [18] in the Frobenius norm that utilizes a CountSketch multiplier S com-monly seen in data stream-oriented applications. Owing to the fact that the CountSketch

    27

  • Input: A ∈ Rm×nInput: rank k, tolerance �Output: C ∈ Rm×c, R ∈ Rr×n, U ∈ Rc×r

    I ← random set of k integers ∈ [1..m]volume← 0

    while true do

    [volumeNew, J ] = GetMaxV olume(A, k, I)

    if∣∣volumeNew−volume

    volumeNew

    ∣∣ ≤ � thenreturn I, J

    end

    volume← volumeNew[volumeNew, I] = GetMaxV olume(A, k, J)

    if∣∣volumeNew−volume

    volumeNew

    ∣∣ ≤ � thenreturn I, J

    end

    volume← volumeNewend

    C ← A(:, J)R← A(I, :)U ← A(I, J)−1

    Algorithm 4: Pseudo-Skeleton CUR Algorithm by Cross Approximation [Goreinov etal 1997]

    matrix has as at most one nonzero entry in each column, the matrix matrix multiplication ofAS can be performed with an O(nnz(A)) cost bound. However, their relative error bound isgiven in the Frobenius norm. The applicability of CountSketch as a random multiplier doesnot apply to arbitrary vectors as stated by the Johnson-Lindenstrauss Lemma. Instead, itis assumed that vectors belong to ColSpace(A). Please see [18] for more details.

    As it pertains to the second question, we have already seen a result from the NLAcommunity for pseudo-skeleton approximation. Recall the error bound for this CUR:

    ‖A− CUR‖2 ≤√k(√m+

    √n)‖A−Ak‖2

    Strictly speaking, the above is a multiplicative error bound but not a (1+�) relative errorbound. Another similar multiplicative error bound exists for the two-sided InterpolativeDecomposition (ID) previously presented:

    ‖A− Âk‖2 ≤√

    1 + k(min(m,n)− k)‖A−Ak‖2It should be stated again that choosing the optimal set of columns and rows for a rank-k

    approximation is an NP-Hard problem.

    28

  • 6 Conclusion

    In this survey paper on LRA the major highlights of the literature and recent developmentsof the field have been examined. The key topics of Dimension Reduction, random structuredmultipliers, Column Sampling and CUR approximation have been addressed in some detail.An effort has been made to show the chronological sequence of important developments andthe refinements to error bounds, arithmetic, data pass, and space complexities with eachnew important result. Perspectives have been included from both the Numerical LinearAlgebra and Theoretical Computer Science communities. The challenges concerning LRAfor massive data sets and the suitability of randomized approximation algorithms to meetthese challenges has been presented.

    The limitations and the lack of flexibility of conventional algorithms wrt MMDS’s havespurred the current interest in new algorithms for LRA. These new algorithms featurecapabilities that allow out-of-core data sets to be treated, increased usage of parallelism,and the production of matrix approximations in terms of subsets of actual rows and columnsof the matrix to support data analysis activities. Additionally, these new randomizationstrategies leverage existing conventional algorithms on much smaller data sets through theuse of randomization. We have also seen the progression and refinement of theoreticalresults from additive error bounds to multiplicative and relative error-bounds.

    Randomization has been seen in a couple of flavors. For example in the dimensionreduction approach, random linear combinations of a matrix’s columns form a subspaceonto which an input matrix is projected to form an approximation. Column and rowsampling is an alternative means to select a subset of entries of a matrix from whichan approximation is eventually obtained. CUR Cross-approximation chooses subsets ofintersections of rows and columns to yield an approximation that may not need to referencethe entire matrix. The developments in this field hold promise also for their application torelated theoretical problems in NLA as well as to problems in the physical and life sciences,engineering, Internet and data sciences.

    AcknowledgementsI would like to thank my mentor, Professor Victor Pan, for his thoughtful comments,

    insight, and support throughout my doctoral education.

    References

    [1] N. Halko, P. G. Martinsson, J. A. Tropp, Finding Structure with Randomness:Probabilistic Algorithms for Approximate Matrix Decompositions, SIAM Review,53, 2, 217–288, 2011.

    [2] M. W. Mahoney, Randomized Algorithms for Matrices and Data, Foundations andTrends in Machine Learning, NOW Publishers, 3, 2, 2011.

    [3] Golub, Gene H., and Christian Reinsch, Singular value decomposition and leastsquares solutions, Numerische mathematik, 14, 5, 403–420, 1970.

    [4] C.-T. Pan, On the existence and computation of rank-revealing LU factorizations,Linear Algebra and its Applications, 316, 199–222, 2000.

    29

  • [5] M. Gu and S. C. Eisenstat, Efficient algorithms for computing a strong rank-revealing QR factorization, SIAM Journal on Scientific Computing, 17, 4, 848-869,1996.

    [6] G. H. Golub, C. F. Van Loan, Matrix Computations, John Hopkins UniversityPress, 4th edition, 2013.

    [7] Ruston, A., Auerbach’s theorem and tensor products of Banach spaces,Mathematical Proceedings of the Cambridge Philosophical Society, 58, 3,doi:10.1017/S0305004100036744, 476-480, 1962.

    [8] Cheng, Hongwei, et al., On the compression of low rank matrices, SIAM Journalon Scientific Computing, 26, 4, 1389-1404, 2005.

    [9] W. B. Johnson and J. Lindenstrauss, Extension of Lipschitz mapping into Hilbertspaces, Proc. of modern analysis and probability, Contemporary Mathematics, 26,189-206, 1984.

    [10] D. Achlioptas, Database-friendly random projections, Proc. ACM Symp. on thePrinciples of Database Systems, 274-281, 2001.

    [11] T. Sarlós, Improved Approximation Algorithms for Large Matrices via RandomProjections, Proceedings of IEEE Symposium on Foundations of Computer Science(FOCS), 143–152, 2006.

    [12] C. H. Papadimitriou, P. Raghavan, H. Tamaki and S. Vempala, Latent SemanticIndexing: A probabilistic analysis, Journal of Computer and System Sciences, 61,2, 217-235, 2000.

    [13] Woodruff, David P., Sketching as a tool for numerical linear algebra, Foundationsand Trends in Theoretical Computer Science, 10, 1–2, 1–157, 2014.

    [14] Rudelson, Mark, and Roman Vershynin, Non-asymptotic theory of random matri-ces: extreme singular values, arXiv preprint arXiv:1003.2990, 2010.

    [15] N. Ailon and B. Chazelle, Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform, STOC 2006: Proc. 38th Ann. ACM Theory of Computing,557-563, 2006.

    [16] F. Woolfe, E. Liberty, V. Rokhlin and Mark Tygert, A fast randomized algorithmfor the approximation of matrices, Applied and Computational Harmonic Analysis,25, 335-366, 2008.

    [17] Pan, Victor, John Svadlenka, and Liang Zhao, Fast Derandomized Low-rank Ap-proximation and Extensions, arXiv preprint arXiv:1607.05801, 2016.

    [18] K. Clarkson, D. Woodruff, Low Rank Approximation and Regression in InputSparsity Time, Proceedings of the forty-fifth annual ACM symposium on Theoryof Computing, 81-90, 2013.

    [19] A. Frieze, R. Kannan and S. Vempala, Fast Monte-Carlo algorithms for findinglow-rank approximations, Journal of the ACM 51, 6, 1025-1041, 2004.

    30

  • [20] A. Deshpande, L. Rademacher, S. Vempala and G. Wang, Matrix approximationand projective clustering via volume sampling, Theory of Computing, 2, 225-247,2006.

    [21] M. Rudelson and R. Vershynin, Sampling from large matrices: An approachthrough geometric functional analysis, Journal of the ACM , 54, 2007.

    [22] Drineas, Petros, Michael W. Mahoney, and S. Muthukrishnan, Relative-error CURmatrix decompositions, SIAM Journal on Matrix Analysis and Applications, 30,2, 844-881, 2008.

    [23] P. Drineas, M. Magdon-Ismail, M. Mahoney, D. Woodruff, Fast Approximation ofMatrix Coherence and Statistical Leverage, Journal of Machine Learning Research, 13, 2012.

    [24] Drineas, Petros, Ravi Kannan, and Michael W. Mahoney, Fast Monte Carlo algo-rithms for matrices III: Computing a compressed approximate matrix decomposi-tion, SIAM Journal on Computing, 36 1, 184–206, 2006.

    [25] Boutsidis, Christos, and David P. Woodruff, Optimal cur matrix decompositions,SIAM Journal on Computing, 46, 2, 543-589, 2017.

    [26] S. A. Goreinov, E. E. Tyrtyshnikov and N. L. Zamarashkin, A theory of pseudo-skeleton approximation, Linear Algebra And Its Applications, 261, 1–21,1997.

    [27] S. A. Goreinov and E. E. Tyrtyshnikov, The maximal-volume concept in approxi-mation by low-rank matrices, Contemporary Mathematics, 208, 47-51, 2001.

    [28] Haidar, A., Kurzak, J., and Luszczek, P., An improved parallel singular valuealgorithm and its implementation for multicore hardware, Proceedings of the In-ternational Conference on High Performance Computing, Networking, Storage andAnalysis, ACM, November 2013.

    [29] E. Elmroth, F. Gustavson, I. Jonsson, B. Kagstrom, Recursive Blocked Algorithmsand Hybrid Data Structures for Dense Matrix Library Software, SIAM Review, 46,1, 3-45, 2004.

    [30] B. Akin, F. Franchetti, J. Hoe, FFTS With Near-Optimal Memory Access ThroughBlock Layouts, IEEE International Conference on Acoustics, Speech, and SignalProcessing, 3898-3902, 2014.

    [31] L. Barba, R. Yokota, How Will the Fast Multipole Method Fare in the ExascaleEra?, SIAM News, 46, 6, 2013.

    [32] O. Lindtjorn, R. Clapp, O. Pell, H. Fu, M. Flynn, Beyond Traditional Microproces-sors for Geoscience For High-Performance Computing Applications, IEEE Micro,31, 2, 41-49, 2011.

    [33] Quintana-Ort, Gregorio, Xiaobai Sun, and Christian H. Bischof, A BLAS-3 ver-sion of the QR factorization with column pivoting, SIAM Journal on ScientificComputing, 19, 5, 1486-1494, 1998.

    31