low-rank tucker decomposition of large tensors using ... · university of colorado boulder...

26
Low-rank Tucker decomposition of large tensors using TensorSketch Stephen Becker * and Osman Asif Malik Department of Applied Mathematics University of Colorado Boulder May 18, 2018 Abstract We propose two randomized algorithms for low-rank Tucker decompositions of tensors. The algorithms, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order. To the best of our knowledge, ours are the only algorithms which can do this. We test our algorithms on sparse synthetic data and compare them to multiple other methods. We also apply one of our algorithms to a real dense tensor representing a video and use the resulting decomposition to correctly classify frames containing disturbances. 1 Introduction Many real datasets have more than two dimensions and are therefore better represented using tensors, or multi- way arrays, rather than matrices. In the same way that methods such as the singular value decomposition (SVD) can help in the analysis of data in matrix form, tensor decompositions are important tools when working with tensor data. As multidimensional datasets grow larger and larger, there is an increasing need for methods that can handle them, even on modest hardware. One approach to the challenge of handling big data, which has proven to be very fruitful in the past, is the use of randomization. In this paper, we present two algorithms for computing the Tucker decomposition of a tensor which incorporate random sketching. Our algorithms, which are single pass and can handle streamed data, are suitable when the decomposition we seek is of low-rank. A key challenge to incorporating sketching in the Tucker decomposition is that the relevant design matrices are Kronecker products of the factor matrices. This makes them too large to form and store in RAM, which prohibits the application of standard sketching techniques. Recent work [30, 31, 2, 10] has led to a new technique called TensorSketch which is ideally suited for sketching Kronecker products. It is based on this technique that we develop our algorithms. In some applications, such as the compression of scientific data produced by high-fidelity simulations, the data tensors can be very large (see e.g. the recent work [1]). Since such data frequently is produced incrementally, e.g. by stepping forward in time, a compression algorithm which is one-pass and can handle the tensor elements being streamed would make it possible to compress the data without ever having to store it in full. Our algorithms have these properties. In summary, our paper makes the following algorithmic contributions: We propose two algorithms for Tucker decomposition which incorporate TensorSketch. They are intended to be used for low-rank decompositions. We propose an idea for defining the sketch operators upfront. In addition to increasing accuracy and reducing run time, it allows us to make several other improvements. These include only requiring a single pass of the data, and being able to handle tensors whose elements are streamed in any order. To the best of our knowledge, ours are the only algorithms which can do this. * [email protected] [email protected] 1

Upload: others

Post on 24-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

Low-rank Tucker decomposition of large tensors usingTensorSketch

Stephen Becker∗ and Osman Asif Malik†Department of Applied Mathematics

University of Colorado Boulder

May 18, 2018

AbstractWe propose two randomized algorithms for low-rank Tucker decompositions of tensors. The algorithms,

which incorporate sketching, only require a single pass of the input tensor and can handle tensors whoseelements are streamed in any order. To the best of our knowledge, ours are the only algorithms which cando this. We test our algorithms on sparse synthetic data and compare them to multiple other methods.We also apply one of our algorithms to a real dense tensor representing a video and use the resultingdecomposition to correctly classify frames containing disturbances.

1 IntroductionMany real datasets have more than two dimensions and are therefore better represented using tensors, or multi-way arrays, rather than matrices. In the same way that methods such as the singular value decomposition(SVD) can help in the analysis of data in matrix form, tensor decompositions are important tools whenworking with tensor data. As multidimensional datasets grow larger and larger, there is an increasing needfor methods that can handle them, even on modest hardware. One approach to the challenge of handling bigdata, which has proven to be very fruitful in the past, is the use of randomization. In this paper, we presenttwo algorithms for computing the Tucker decomposition of a tensor which incorporate random sketching. Ouralgorithms, which are single pass and can handle streamed data, are suitable when the decomposition we seekis of low-rank. A key challenge to incorporating sketching in the Tucker decomposition is that the relevantdesign matrices are Kronecker products of the factor matrices. This makes them too large to form and storein RAM, which prohibits the application of standard sketching techniques. Recent work [30, 31, 2, 10] hasled to a new technique called TensorSketch which is ideally suited for sketching Kronecker products. It isbased on this technique that we develop our algorithms.

In some applications, such as the compression of scientific data produced by high-fidelity simulations,the data tensors can be very large (see e.g. the recent work [1]). Since such data frequently is producedincrementally, e.g. by stepping forward in time, a compression algorithm which is one-pass and can handlethe tensor elements being streamed would make it possible to compress the data without ever having to storeit in full. Our algorithms have these properties.

In summary, our paper makes the following algorithmic contributions:

• We propose two algorithms for Tucker decomposition which incorporate TensorSketch. They areintended to be used for low-rank decompositions.

• We propose an idea for defining the sketch operators upfront. In addition to increasing accuracy andreducing run time, it allows us to make several other improvements. These include only requiring asingle pass of the data, and being able to handle tensors whose elements are streamed in any order. Tothe best of our knowledge, ours are the only algorithms which can do this.

[email protected][email protected]

1

Page 2: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

1.1 A brief introduction to tensors and the Tucker decompositionWe use the same notations and definitions as in the review paper by Kolda and Bader [21]. Due to limitedspace, we only explain our notation here, with definitions given in the supplementary material. A tensorX ∈ RI1×I2×···×IN is an array of dimension N , also called an N -way tensor. Boldface Euler script letters,e.g. X, denote tensors of dimension 3 or greater; bold capital letters, e.g. X, denote matrices; bold lowercaseletters, e.g. x, denote vectors; and lower case letters, e.g. x, denote scalars. For scalars indicating dimensionsize, upper case letters, e.g. I, will be used. “⊗” and “�” denote the Kronecker and Khatri-Rao products,respectively. The mode-n matricization of a tensor X ∈ RI1×I2×···×IN is denoted by X(n) ∈ RIn×

∏i6=n In .

Similarly, x(:) ∈ R∏

n In denotes the vectorization of X into a column vector. The n-mode tensor-times-matrix(TTM) product of X and a matrix A ∈ RJ×In is denoted by X×nA ∈ RI1×···×In−1×J×In+1×···×IN . The normof X is defined as ‖X‖ = ‖x(:)‖2. For a positive integer n, we use the notation [n] := {1, 2, . . . , n}.

There are multiple tensor decompositions. In this paper, we consider the Tucker decomposition. A Tuckerdecomposition of a tensor X ∈ RI1×I2×···×IN is

X = G×1 A(1) ×2 A

(2) · · · ×N A(N) =: JG;A(1),A(2), . . . ,A(N)K, (1)

where G ∈ RR1×R2×···×RN is called the core tensor and each A(n) ∈ RIn×Rn is called a factor matrix.Without loss of generality, the factor matrices can be assumed to have orthonormal columns, which we willassume as well. We say that X in (1) is a rank-(R1, R2, . . . , RN ) tensor. Two standard algorithms for Tuckerdecomposition are higher-order orthogonal iteration (HOOI) and higher-order SVD (HOSVD). HOSVD is aless accurate approximation of HOOI.

1.2 A brief introduction to TensorSketchIn this paper, we apply TensorSketch to approximate the solution to large overdetermined least-squaresproblems, and to approximate chains of TTM products similar to those in (1). TensorSketch is arandomized method which allows us to reduce the cost and memory usage of these computations in exchangefor somewhat reduced accuracy. It can be seen as a specialized version of another sketching method calledCountSketch, which was introduced in [7] and further analyzed in [8]. The benefit of TensorSketch isthat it can be applied to Kronecker products of matrices very efficiently.

TensorSketch was first introduced in 2013 in [30] where it is applied to compressed matrix multiplication.In [31], it is used for approximating support vector machine polynomial kernels efficiently. Avron et al. [2]show that TensorSketch provides an oblivious subspace embedding. Diao et al. [10] provide theoreticalguarantees which we will rely on in this paper. Below is an informal summary of those results we will use; forfurther details, see the paper by Diao et al., especially Theorem 3.1 and Lemma B.1. Let B ∈ RL×M be amatrix, where L�M . Like other classes of sketches, an instantiation of TensorSketch is a linear mapT : RL → RJ , where J � L, such that, if y ∈ RL and x

def= arg minx ‖TAx−Ty‖2, then for J sufficiently

large (depending on ε > 0), with high probability ‖Ax− y‖2 ≤ (1 + ε) minx ‖Ax− y‖2.The distinguishing feature of TensorSketch is that if the matrix B is of the form B = A(N) ⊗

A(N−1) ⊗ · · · ⊗ A(1), where each A(n) ∈ RI×R, I � R, then the cost of computing TA can be shownto be O(NIR + JRN ), excluding log factors, whereas naïve matrix multiplication would cost O(JINRN ).Moreover, TA can be computed without ever forming the full matrix B. This is achieved by first applying anindependent CountSketch operator S(n) ∈ RJ×I to each factor matrix A(n) and then computing the fullTensorSketch using the fast Fourier transform (FFT). The formula for this is

TA = T

1⊗i=N

A(i) = FFT−1( 1⊙

i=N

(FFT

(S(n)A(n)

))>)> . (2)

These results generalize to the case when the factor matrices are of different sizes. A more thoroughexplanation of TensorSketch, including how to arrive at the formula (2), is provided in the supplementarymaterial.

2

Page 3: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

2 Related workRandomized algorithms have been applied to tensor decompositions before. Wang et al. [35] and Battaglinoet al. [5] apply sketching techniques to the CANDECOMP/PARAFAC (CP) decomposition. Drineas andMahoney [11], Zhou and Cichocki [36], Da Costa et al. [9] and Tsourakakis [34] propose different randomizedmethods for computing HOSVD. The method in [34], which is called MACH, is also extended to computingHOOI. Mahoney et al. [27] and Caiafa and Cichocki [6] present results that extend the CUR factorization formatrices to tensors. Other decomposition methods that only consider a small number of the tensor entriesinclude those by Oseledets et al. [29] and Friedland et al. [13].

Another approach to decomposing large tensors is to use memory efficient and distributed methods. Koldaand Sun [22] introduce the Memory Efficient Tucker (MET) decomposition for sparse tensors as a solution tothe so called intermediate blow-up problem which occurs when computing the chain of TTM products inHOOI. Other papers that use memory efficient and distributed methods include [4, 23, 24, 25, 32, 18, 1, 19, 28].

Other research focuses on handling streamed tensor data. Sun et al. [33] introduce a framework forincremental tensor analysis. The basic idea of their method is to find one set of factor matrices whichworks well for decomposing a sequence of tensors that arrive over time. Fanaee-T and Gama [12] introducemulti-aspect-streaming tensor analysis which is based on the histogram approximation concept rather thanlinear algebra techniques. Neither of these methods correspond to Tucker decomposition of a tensor whoseelements are streamed. Gujral et al. [15] present a method for incremental CP decomposition.

We compare our algorithms to TUCKER-ALS and MET in Tensor Toolbox version 2.6 [3, 22], FSTD1 withadaptive index selection from [6], as well as the HOOI version of the MACH algorithm in [34].1 TUCKER-ALSand MET, which are mathematically equivalent, provide good accuracy, but run out of memory as tensor sizeincreases. MACH scales somewhat better, but also runs out of memory for larger tensors. Its accuracy is alsolower than that of TUCKER-ALS/MET. None of these algorithms are one-pass. FSTD1 scales well, buthas accuracy issues on very sparse tensors. FSTD1 does not need to access all elements of the tensor and isone-pass, but since the entire tensor needs to be accessible the method cannot handle streamed data.

3 Tucker decomposition using TensorSketchThe Tucker decomposition problem of decomposing a data tensor Y ∈ RI1×I2×···×IN can be formulated as

arg minG,A(1),...,A(N)

{∥∥∥Y− JG;A(1), . . . ,A(N)K∥∥∥ : G ∈ RR1×···×RN ,A(n) ∈ RIn×Rn for n ∈ [N ]

}. (3)

The standard approach to this problem is to use alternating least-squares (ALS). By rewriting the objectivefunction appropriately (use e.g. Proposition 3.7 in [20]), we then get the following steps, which are repeateduntil convergence:

1. For n = 1, . . . , N , update A(n) = arg minA∈RIn×Rn

∥∥∥∥(⊗1

i=Ni 6=n

A(i)

)G>(n)A

> −Y>(n)

∥∥∥∥2F

. (4)

2. Update G = arg minZ∈RR1×···×RN

∥∥∥∥(⊗1

i=NA(i)

)z(:) − y(:)

∥∥∥∥22

. (5)

It turns out that the solution to (5) is given by G = Y×1 A(1)> ×2 A

(2)> · · · ×N A(N)>. Moreover, one canshow that the solution for the nth factor matrix A(n) in (4) is given by the Rn leading left singular vectors ofthe mode-n matricization of Y×1 A

(1)> · · · ×n−1 A(n−1)> ×n+1 A

(n+1) · · · ×N A(N)>. These insights lead toAlgorithm 1, which we will refer to as TUCKER-ALS. It is also frequently called higher-order orthogonaliteration (HOOI). More details can be found in [21].

1For FSTD1, we use the Matlab code from the website of one of the authors (http://ccaiafa.wixsite.com/cesar). ForMACH, we adapted the Python code provided on the author’s website (https://tsourakakis.com/mining-tensors/) to Matlab.MACH requires an algorithm for computing the HOOI decomposition of the sparsified tensor. For this, we use TUCKER-ALSand then switch to higher orders of MET as necessary when we run out of memory. As recommended in [34], we keep eachnonzero entry in the original tensor with probability 0.1 when using MACH.

3

Page 4: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

Algorithm 1: TUCKER-ALS (aka HOOI)

input :Y, target rank (R1, R2, . . . , RN )output : Rank-(R1, R2, . . . , RN ) Tucker decomposition JG;A(1), . . . ,A(N)K of Y

1 Initialize A(2),A(3), . . . ,A(N)

2 repeat3 for n = 1, . . . , N do4 Z = Y×1 A

(1)> · · · ×n−1 A(n−1)> ×n+1 A

(n+1) · · · ×N A(N)>

5 A(n) = Rn leading left singular vectors of Z(n) /* Solves Eq. (4) */6 end7 until termination criteria met8 G = Y×1 A

(1)> ×2 A(2)> · · · ×N A(N)> /* Solves Eq. (5) */

9 return G, A(1), . . . ,A(N)

3.1 First proposed algorithm: TUCKER-TSFor our first algorithm, we TensorSketch both the least-squares problems in (4) and (5), and then solve thesmaller resulting problems. We give an algorithm for this approach in Algorithm 2. We call it TUCKER-TS,where “TS” stands for TensorSketch. The core tensor and factor matrices in line 1 are initialized randomlywith each element i.i.d. Uniform(−1, 1). The factor matrices are subsequently orthogonalized. On line 2we define TensorSketch operators of appropriate size. Specifically, we define CountSketch operatorsS(n)1 ∈ RJ1×In and S

(n)2 ∈ RJ2×In for n ∈ [N ] and then define each T(n) as in (2) but based on {S(n)

1 }n∈[N ]

and with the nth term excluded in the Kronecker and Khatri-Rao products. T(N+1) is defined similarly, butbased on {S(n)

2 }n∈[N ] and without excluding any terms in the Kronecker and Khatri-Rao products.It is important to note that the guarantees for TensorSketched least-squares in [10] hold when a

random sketch is applied to a fixed least-squares problem. This corresponds to defining a new TensorSketchoperator each time a least-squares problem is solved in Algorithm 2. In all of our experiments, we observethat our approach of instead defining the sketch operators upfront lead to a substantial reduction in the errorfor the algorithm as a whole (see Figure 1). We have not yet been able to provide theoretical justification forwhy this is.

The following proposition shows that the normal equations formulation of the least-squares problem online 7 in Algorithm 2 is well-conditioned with high probability, and therefore can be efficiently solved usingconjugate gradient (CG). This is true because the factor matrices are orthogonal, and does not hold for thesmaller system on line 5, so this system we solve via direct methods. In our experiments, for an accuracy of1e-6, CG takes about 15 iterations regardless of I.

Proposition 3.1. Assume T(N+1) is defined as in line 2 in Algo. 2. Let M def= (T(N+1)

⊗1i=N A(i))>(T(N+1)

⊗1i=N A(i)),

where all A(n) have orthonormal columns, and suppose ε, δ ∈ (0, 1). If J2 ≥ (∏

nRn)2(2 + 3N )/(ε2δ), thenthe 2-norm condition number of M satisfies κ(M) ≤ (1 + ε)2/(1− ε)2 with at least probability 1− δ.

Remark 3.2. Defining the sketching operators upfront allows us to make the following improvements:

(a) Since Y remains unchanged throughout the algorithm, the N + 1 sketches of Y only need to be computedonce, which we do upfront in a single pass over the data (using a careful implementation). This can alsobe done if elements of Y are streamed.

(b) Since the same CountSketch is applied to each A(n) when sketching the Kronecker product in theinner loop, we can compute the quantity A

(n)s1 := (FFT(S

(n)1 A(n)))> after updating A(n) and reuse it

when computing other factor matrices until A(n) is updated again.(c) When In ≥ J1 + J2 for some n ∈ [N ], we can reduce the size of the least-squares problem on line 5.

Note that the full matrix A(n) is not needed until the return statement—only the sketches S(n)1 A(n) and

S(n)2 A(n) are necessary to compute the different TensorSketches. Replacing T(n)Y>(n) on line 5 with

T(n)[Y>(n)S(n)1 , Y>(n)S

(n)>2 ], which also can be computed upfront, we get a smaller least-squares problem

4

Page 5: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

which has the solution [S(n)1 A(n), S

(n)2 A(n)]. Before the return statement, we then compute the full

factor matrix A(n). With this adjustment, we cannot orthogonalize the factor matrices on each iteration,and therefore Proposition 3.1 does not apply. In this case, we therefore use a dense method instead ofCG when computing G in Algorithm 2.

Algorithm 2: TUCKER-TS (proposal)

input :Y, target rank (R1, R2, . . . , RN ), sketch dimensions (J1, J2)output : Rank-(R1, R2, . . . , RN ) Tucker decomposition JG;A(1), . . . ,A(N)K of Y

1 Initialize G,A(2),A(3), . . . ,A(N)

2 Define TensorSketch operators T(n) ∈ RJ1×∏

i6=n Ii , for n ∈ [N ], and T(N+1) ∈ RJ2×∏

i Ii

3 repeat4 for n = 1, . . . , N do

5 A(n) = arg minA

∥∥∥(T(n)⊗1

i=N,i6=n A(i))G>(n)A

> −T(n)Y>(n)

∥∥∥2F

6 end

7 G = arg minZ

∥∥∥(T(N+1)⊗1

i=N A(i))z(:) −T(N+1)y(:)

∥∥∥22

8 Orthogonalize each A(i) and update G

9 until termination criteria met10 return G, A(1), . . . ,A(N)

3.2 Second proposed algorithm: TUCKER-TTMTSWe can rewrite the TTM product on line 4 of Algorithm 1 to Z(n) = Y(n)

⊗1i=Ni6=n

A(i). We TensorSketch

this formulation as follows: Z(n) = (T(n)Y>(n))>T(n)

⊗1i=Ni6=n

A(i), n ∈ [N ], where each T(n) ∈ RJ1×∏

i6=n Ii is

a TensorSketch operator with target dimension J1. We can similarly sketch the computation on line 8 inAlgorithm 1 using a TensorSketch T(N+1) ∈ RJ2×

∏i Ii operator with target dimension J2. Replacing the

computations on lines 4 and 8 in Algorithm 1 with these sketched computations, we get our second algorithmwhich we call TUCKER-TTMTS, where “TTMTS” stands for “TTM TensorSketch.” The algorithm isgiven in Algorithm 7. On line 1b, we define the sketching operators in the same way as for Algorithm 2.Since these are defined upfront, the same caveat applies here as in Algorithm 2.

The following informal proposition shows that the error for each sketched computation in TUCKER-TTMTS is additive rather than multiplicative as for TUCKER-TS. A formal statement and proof is given inthe supplementary material.

Proposition 3.3 (TUCKER-TTMTS (informal)). Assume each TensorSketch operator is redefined priorto being used. Let OBJ denote the objective function in (4), and let A(n) be the Rn leading left singularvectors of Z(n) defined on line 8 in Algorithm 7. Under certain conditions, A(n) satisfies OBJ(A(n)) ≤minA OBJ(A) + εC with high probability if J1 is sufficiently large, where C depends on Y, the target rank,and the other factor matrices. A similar result holds for the update on line 13 and the objective function in(5).

3.3 Stopping conditions and orthogonalizationUnless stated otherwise, we stop after 50 iterations or when the change in ‖G‖ is less than 1e-3. The sametype of convergence criteria are used in [22]. In Algorithm 2, we orthogonalize the factor matrices and updateG using the reduced QR factorization. If we use the improvement in Remark 3.2 (c), we need to approximateG. This is discussed in the supplementary material. In Algorithm 7, we compute an estimate of G using thesame formula as in line 13, but using the smaller sketch dimension J1 instead.

3.4 Complexity analysisWe compare the complexity of Algorithms 1–7 for the case N = 3. We assume that In = I and Rn = R < Ifor all n ∈ [N ]. Furthermore, we assume that J1 = KRN−1 and J2 = KRN for some constant K > 1, which

5

Page 6: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

Algorithm 3: TUCKER-TTMTS (proposal)

/* Identical to Algorithm 1, except for lines below */

1a Initialize A(2),A(3), . . . ,A(N)

1b Define TensorSketch operators T(n) ∈ RJ1×∏

i6=n Ii , for n ∈ [N ], and T(N+1) ∈ RJ2×∏

i Ii

4 Z(n) =(T(n)Y>(n)

)> (T(n)

⊗1i=N,i6=n A

(i))

8 g(:) =(T(N+1)

⊗1i=N A(i)

)>T(N+1)y(:)

is a choice that works well in practice. Table 1 shows the complexity when each of the variables I and R areassumed to be large. A more detailed complexity analysis is given in the supplementary material.

Variable assumed to be large

Algorithm I = size of fiber R = rank

1 (T.-ALS) (#iter + 1) ·RI3 (#iter + 1) ·RI3

2 (proposal) nnz(Y) + IR4 R3 +#iter ·R6

7 (proposal) nnz(Y) + IR4 +#iter · IR4 R6 +#iter ·R4

Table 1: Leading order computational complexity, ignoring log factors and assuming K = O(1), where #iteris the number of main loop iterations. Y is the 3-way target tensor. The main benefits of our proposedalgorithms is reducing the O(IN ) complexity of Alg. 1 with RO(N) complexity due to the sketching, sincetypically R� I.

4 ExperimentsIn this section we present results from experiments; code to replicate them is online at [URL redacted duringreview]. All synthetic results are averages over ten runs in an environment using four cores of an Intel XeonE5-2680 v3 @2.50GHz CPU and 21 GB of RAM. For Algorithms 2 and 7, the sketch dimensions J1 andJ2 must be chosen. We have found that the choice J1 = KRN−1 and J2 = KRN , for a constant K > 4,works well in practice. Figure 1 shows examples of how the error of TUCKER-TS and TUCKER-TTMTS, inrelation to the that of TUCKER-ALS, changes with K. It also shows results for variants of each algorithmwhen the TensorSketch operator is redefined each time it is used (called “multi-pass” in the figure). Forboth algorithms, defining TensorSketch operators upfront leads to higher accuracy than redefining thembefore each application. In subsequent experiments, we always define the sketch operators upfront (i.e., aswritten in Algos. 2 and 7) and, unless stated otherwise, always use K = 10.

4 6 8 10 12 14 16 18 20

K

1

2

3

4

Err

or

rela

tive

to T

UC

KE

R-A

LS

(a) TUCKER-TS: multi-pass vs. one-pass

TUCKER-TS multi-pass

TUCKER-TS

TUCKER-ALS

4 6 8 10 12 14 16 18 20

K

1

1.5

2

Err

or

rela

tive

to T

UC

KE

R-A

LS

(b) TUCKER-TTMTS: multi-pass vs. one-pass

TUCKER-TTMTS multi-pass

TUCKER-TTMTS

TUCKER-ALS

Figure 1: Errors of TUCKER-TS and TUCKER-TTMTS, relative to that of TUCKER-ALS, for differentvalues of the sketch dimension parameter K. For both plots, the tensor size is 500 × 500 × 500 withnnz(Y) ≈ 1e+6 and true rank (15, 15, 15). The algorithms use a target rank of (10, 10, 10).

6

Page 7: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

4.1 Sparse synthetic dataIn this subsection, we apply our algorithms to synthetic sparse tensors. For all synthetic data we use In = Iand Rn = R for all n ∈ [N ]. The sparse tensors are each created from a random dense core tensor andrandom sparse factor matrices, where the sparsity of the factor matrices is chosen to achieve the desiredsparsity of the tensor. We add i.i.d. normally distributed noise with standard deviation 1e-3 to all nonzerotensor elements.

Figures 2 and 3 show how the algorithms scale with increased dimension size I. Figure 4 and 5 show howthe algorithms scale with tensor density and algorithm target rank R, respectively. TUCKER-ALS/METand MACH run out of memory when I = 1e+5. FSTD1 is fast and scalable but inaccurate for very sparsetensors. The algorithm repeatedly finds indices of Y by identifying the element of maximum magnitude infibers of the residual tensor. However, when Y is very sparse, it frequently happens that whole fibers in theresidual tensor are zero. In those cases, the algorithm fails to find a good set of indices. This explains its pooraccuracy in our experiments. We see that TUCKER-TS performs very well when Y truly is low-rank and weuse that same rank for reconstruction. TUCKER-TTMTS in general has a larger error than TUCKER-TS,but scales better with higher target rank. Moreover, when the true rank of the input tensor is greater thanthe target rank (Fig. 3), which is closer to what real data might look like, the error of TUCKER-TTMTS ismuch closer to that of TUCKER-TS.

1e+3 1e+4 1e+5 1e+6

Dimension size (I)

10-2

10-1

100

Re

lative

err

or

(a) Varying dimension size (I), Rtrue

= R

1e-2

1e-1

1e-0

TUCKER-TS (this paper)

TUCKER-TTMTS (this paper)

TUCKER-ALS/MET

FSTD1

MACH

1e+3 1e+4 1e+5 1e+6

Dimension size (I)

102R

un

tim

e (

s)

(b) Varying dimension size (I), Rtrue

= R

10 s

100 s

TUCKER-TS (this paper)

TUCKER-TTMTS (this paper)

TUCKER-ALS

MET(1)

MET(2)

FSTD1

MACH

Out of memory

Out of memoryOut of

memory

Figure 2: Relative error and run time for random sparse 3-way tensors with varying dimension size I andnnz(Y) ≈ 1e+6. Both the true and target ranks are (10, 10, 10).

1e+3 1e+4 1e+5 1e+6

Dimension size (I)

0

0.5

1

Rela

tive e

rror

(a) Varying dimension size (I), Rtrue

> R

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS/MET

FSTD1

MACH

1e+3 1e+4 1e+5 1e+6

Dimension size (I)

100

102

Run tim

e (

s)

(b) Varying dimension size (I), Rtrue

> R

10 s

100 s

1000 s

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS

MET(1)

MET(2)

FSTD1

MACH

Out of memory

Out of memory

Out of

memory

Figure 3: Relative error and run time for random sparse 3-way tensors with varying dimension size I andnnz(Y) ≈ 1e+6. The true rank is (15, 15, 15) and target rank is (10, 10, 10). A convergence tolerance of 1e-1is used for these experiments.

4.2 Dense real-world dataIn this section we apply TUCKER-TTMTS to a real dense tensor representing a grayscale video. The videoconsists of 2,200 frames, each of size 1,080 by 1,980 pixels. The whole tensor, which requires 38 GB ofRAM, is too large to load at the same time. Instead, it is loaded in pieces which are sketched and thenadded together. The video shows a natural scene, which is disturbed by a person passing by the camera

7

Page 8: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

1e+5 1e+6 1e+7 1e+8

Number of nonzeros

10-2

100

Rela

tive e

rror

(a) Varying density

1e-2

1e-1

1e-0

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS/MET

FSTD1

MACH

1e+5 1e+6 1e+7 1e+8

Number of nonzeros

100

102

104

Run tim

e (

s)

(b) Varying density

10 s

100 s

1000 s

10000 s

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS

MET(1)

MET(2)

FSTD1

MACH

Figure 4: Relative error and run time for random sparse 3-way tensors with dimension size I = 1e+4 andvarying number of nonzeros. Both the true and target ranks are (10, 10, 10).

2 4 6 8 10 12 14 16 18

Target rank (R)

10-1

100

Re

lative

err

or

(a) Varying target rank (R) 1e-0

1e-1

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS/MET

FSTD1

MACH

2 4 6 8 10 12 14 16 18

Target rank (R)

100

102

Ru

n t

ime

(s)

(b) Varying target rank (R)

1 s

10 s

100 s

1000 s

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS

MET(1)

MET(2)

FSTD1

MACH

Figure 5: Relative error and run time for random sparse 3-way tensors with dimension size I = 1e+4 andnnz(Y) ≈ 1e+7. The true and target ranks are (R,R,R), with R varying.

twice. Since the camera is in a fixed position, we can expect this tensor to be compressible. We computea rank (10, 10, 10) Tucker decomposition of the tensor using TUCKER-TTMTS with the sketch dimensionparameter set to K = 100 and a maximum of 30 iterations. We then apply k-means clustering to the factormatrix A(3) ∈ R2200×10 corresponding to the time dimension, classifying each frame using the correspondingrow in A(3) as a feature vector. We find that using three clusters works better than using two. We believethis is due to the fact that the light intensity changes through the video due to clouds, which introduces athird time varying factor. Figure 6 shows five sample frames with the corresponding assigned clusters. Withfew exceptions, the frames which contain a disturbance are correctly grouped together into class 3 with theremaining frames grouped into classes 1 and 2. The video experiment is online at [URL redacted duringreview].

(a) Frame 500 (b) Frame 1450 (c) Frame 1650 (d) Frame 1850 (e) Frame 2000

500 1000 1500 2000

Frame

500 1000 1500 2000

Frame

500 1000 1500 2000

Frame

500 1000 1500 2000

Frame

500 1000 1500 2000

Frame

1

2

3

Cla

ss

Current frame

Incorrect

classif-

ication

Correct classification

Figure 6: Five sample frames with their assigned classes. The frames (b) and (d) contain a disturbance.

8

Page 9: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

5 ConclusionWe have proposed two algorithms for low-rank Tucker decomposition which incorporate TensorSketch whichcan handle streamed data. Experiments corroborate our complexity analysis which shows that the algorithmsscale well both with dimension size and density. TUCKER-TS, and to a lesser extent TUCKER-TTMTS,scale poorly with target rank, so they are most useful when R� I.

References[1] Woody Austin, Grey Ballard, and Tamara G. Kolda. Parallel Tensor Compression for Large-Scale Scientific

Data. Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium,IPDPS 2016, pages 912–922, 2016.

[2] Haim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polynomial kernel. InZ. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances inNeural Information Processing Systems 27, pages 2258–2266. Curran Associates, Inc., 2014.

[3] Brett W. Bader, Tamara G. Kolda, et al. Matlab tensor toolbox version 2.6. Available online, February2015.

[4] Muthu Baskaran, Benoit Meister, Nicolas Vasilache, and Richard Lethin. Efficient and scalable computa-tions with sparse tensors. 2012 IEEE Conference on High Performance Extreme Computing, HPEC2012, 2012.

[5] Casey Battaglino, Grey Ballard, and Tamara G. Kolda. A practical randomized CP tensor decomposition.CoRR, abs/1701.06600, 2017.

[6] Cesar F. Caiafa and Andrzej Cichocki. Generalizing the column-row matrix decomposition to multi-wayarrays. Linear Algebra and Its Applications, 433(3):557–573, 2010.

[7] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams.Theoretical Computer Science, 312(3-15):3–20, 2004.

[8] Kenneth L. Clarkson and David P. Woodruff. Low Rank Approximation and Regression in Input SparsityTime. Journal of the ACM, 63(6):1–45, 2012.

[9] Michele N. Da Costa, Renato R. Lopes, and João Marcos T. Romano. Randomized methods forhigher-order subspace separation. European Signal Processing Conference, 2016-Novem:215–219, 2016.

[10] Huaian Diao, Zhao Song, Wen Sun, and David P. Woodruff. Sketching for kronecker product regressionand p-splines. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 1299–1308, 2018.

[11] Petros Drineas and Michael W. Mahoney. A randomized algorithm for a tensor-based generalization ofthe singular value decomposition. Linear Algebra and Its Applications, 420(2-3):553–571, 2007.

[12] Hadi Fanaee-T and João Gama. Multi-aspect-streaming tensor analysis. Knowledge-Based Systems,89:332–345, 2015.

[13] S. Friedland, V. Mehrmann, A. Miedlar, and M. Nkengla. Fast low rank approximations of matrices andtensors. Electron. J. Linear Algebra, 22:462, 2011.

[14] Gene H. Golub and Charles F. Van Loan. Matrix Computations.

[15] Ekta Gujral, Ravdeep Pasricha, and Evangelos E. Papalexakis. SamBaTen: Sampling-based BatchIncremental Tensor Decomposition. http: // arxiv. org/ abs/ 1709. 00668 , 2017.

[16] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness:Probabilistic algorithms for constructing approximate matrix decompositions. 2009.

9

Page 10: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

[17] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 40 West 20thStreet, New York, NY 10011, USA, 1 edition, 1985.

[18] Inah Jeon, Evangelos E. Papalexakis, U. Kang, and Christos Faloutsos. HaTen2: Billion-scale tensordecompositions. Proceedings - International Conference on Data Engineering, 2015-May:1047–1058,2015.

[19] Oguz Kaya and Bora Ucar. High Performance Parallel Algorithms for the Tucker Decomposition ofSparse Tensors. Proceedings of the International Conference on Parallel Processing, pages 103–112, Sept2016.

[20] Tamara G. Kolda. Multilinear operators for higher-order decompositions. SANDIA Report SAND2006-2081, (April):1–28, 2006.

[21] Tamara G. Kolda and Brett W. Bader. Tensor Decompositions and Applications. SIAM Review,51(3):455–500, 2009.

[22] Tamara G. Kolda and Jimeng Sun. Scalable tensor decompositions for multi-aspect data mining.Proceedings - IEEE International Conference on Data Mining, ICDM, pages 363–372, 2008.

[23] Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. An input-adaptive andin-place approach to dense tensor-times-matrix multiply. In Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis, SC ’15, pages 76:1–76:12, NewYork, NY, USA, 2015. ACM.

[24] Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc. Optimizing sparse tensor times matrixon multi-core and many-core architectures. Proceedings of IA3 2016 - 6th Workshop on IrregularApplications: Architectures and Algorithms, Held in conjunction with SC 2016: The InternationalConference for High Performance Computing, Networking, Storage and Analysis, pages 26–33, 2017.

[25] Bangtian Liu, Chengyao Wen, Anand D. Sarwate, and Maryam Mehri Dehnavi. A Unified OptimizationApproach for Sparse Tensor Operations on GPUs. Proceedings - IEEE International Conference onCluster Computing, ICCC, pages 47–57, Sept 2017.

[26] Charles F.Van Loan. The ubiquitous Kronecker product. Journal of Computational and AppliedMathematics, 123(1-2):85–100, 2000.

[27] Michael W. Mahoney, Mauro Maggioni, and Petros Drineas. Tensor-CUR decompositions for tensor-baseddata. Siam J. Matrix Anal. Appl., 30(3):957–987, 2008.

[28] Jinoh Oh, Kijung Shin, Evangelos E. Papalexakis, Christos Faloutsos, and Hwanjo Yu. S-HOT: ScalableHigh-Order Tucker Decomposition. Proceedings of the Tenth ACM International Conference on WebSearch and Data Mining - WSDM ’17, pages 761–770, 2017.

[29] I. V. Oseledets, D. V. Savostianov, and E. E. Tyrtyshnikov. Tucker dimensionality reduction of three-dimensional arrays in linear time. SIAM Journal on Matrix Analysis and Applications, 30(3):939–956,2008.

[30] Rasmus Pagh. Compressed matrix multiplication. ACM Transactions on Computation Theory, 5(3):1–17,2013.

[31] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. Proceedingsof the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD’13, page 239, 2013.

[32] Yang Shi, U. N. Niranjan, Animashree Anandkumar, and Cris Cecka. Tensor Contractions with ExtendedBLAS Kernels on CPU and GPU. Proceedings - 23rd IEEE International Conference on High PerformanceComputing, HiPC 2016, pages 193–202, 2017.

10

Page 11: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

[33] Jimeng Sun, Dacheng Tao, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos. Incrementaltensor analysis. ACM Transactions on Knowledge Discovery from Data, 2(3):1–37, 2008.

[34] Charalampos E. Tsourakakis. MACH: fast randomized tensor decompositions. In Proceedings of theSIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010, Columbus, Ohio,USA, pages 689–700, 2010.

[35] Yining Wang, Hsiao-Yu Tung, Alexander J Smola, and Anima Anandkumar. Fast and guaranteed tensordecomposition via sketching. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,editors, Advances in Neural Information Processing Systems 28, pages 991–999. Curran Associates, Inc.,2015.

[36] Guoxu Zhou, Andrzej Cichocki, and Shengli Xie. Decomposition of big tensors with low multilinear rank.CoRR, abs/1412.1885, 2014.

6 Appendix

A Definitions of tensor conceptsIn this section we give definitions of the tensor concepts used in the main manuscript and in the supplementarymaterial. We use the same notations and definitions as in [21]. A tensor X ∈ RI1×I2×···×IN is an arrayof dimension N , also called an N -way array. Boldface Euler script letters, e.g. X, will denote tensors ofdimension 3 or greater; bold capital letters, e.g. X, will denote matrices; bold lowercase letters, e.g. x, willdenote vectors; and lower case letters, e.g. x, will denote scalars. We use a colon to denote all elements alonga certain dimension; for example, xn: is the nth row of X, and X::n is the nth so called frontal slice of the3-way tensor X. The norm of a tensor X ∈ RI1×I2×···×IN is defined as

‖X‖ =

√√√√ I1∑i1=1

I2∑i2=1

· · ·IN∑

iN=1

x2i1i2···iN . (6)

The Kronecker product of two matrices A ∈ RI1×R1 and B ∈ RI2×R2 is denoted by A⊗B ∈ RI1I2×R1R2 andis defined by

A⊗B =

a11B a12B · · · a1R1Ba21B a22B · · · a2R1B...

......

aI11B aI12B · · · aI1R1B

. (7)

The Khatri-Rao product of two matrices A ∈ RI1×R and B ∈ RI2×R is denoted by A�B ∈ RI1I2×R and isdefined by

A�B =[a:1 ⊗ b:1 a:2 ⊗ b:2 · · · a:R ⊗ b:R

]. (8)

We can “flatten”, or matricize, a tensor into a matrix. The mode-n matricization of a tensor X ∈ RI1×I2×···×IN

is denoted by X(n) ∈ RIn×∏

i6=n Ii and maps the element on position (i1, i2, . . . , iN ) in X to position (in, j) inX(n), where

j = 1 +

N∑k=1k 6=n

(ik − 1)Jk with Jk =

k−1∏m=1m6=n

Im. (9)

Similarly, x(:) ∈ R∏

n In will denote the vectorization of X we get by stacking the columns of X(1) into along column vector. The n-mode product of a tensor X ∈ RI1×I2×···×IN and matrix A ∈ RJ×In is denoted byX×n A ∈ RI1×···×In−1×J×In+1×···×IN and defined by

(X×n A)i1···in−1jin+1···iN =

In∑in=1

xi1i2···iNajin . (10)

11

Page 12: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

We can express this definition more compactly using matrix notation as follows:

Y = X×n A ⇔ Y(n) = AX(n). (11)

B Some further information on CountSketch and TensorSketchAs mentioned in the main manuscript, TensorSketch can be seen as restricted kind of CountSketchwhich can be applied very efficiently to Kronecker products. We first explain how CountSketch can beapplied to least-squares problems and for matrix multiplication, and then explain how these ideas extendto TensorSketch. Nothing in this subsection is our own work; it is simply a brief introduction to thesketching techniques we use.

The basic idea of applying CountSketch to least-squares regression is the following. Suppose A ∈ RI×R,where I � R, and y ∈ RI , and consider solving the overdetermined least-squares problem

x∗ := arg minx∈RR

‖Ax− y‖2 . (12)

CountSketch allows us to reduce the size of this problem by first applying a subspace embedding matrixS : RI → RJ to A and y, where J is much smaller than I, and so that we instead solve

x′ := arg minx∈RR

‖SAx− Sy‖2 . (13)

One way to define the matrix S is as S = ΦD, where

• h : [I]→ [J ] is a random map such that (∀i ∈ I)(∀j ∈ J) P(h(i) = j) = 1/J ;

• Φ ∈ RJ×I is a matrix with Φh(i),i = 1, and all other entries equal to 0; and

• D ∈ RI×I is a diagonal matrix, with each diagonal entry equal to +1 or −1 with equal probability.

For a fixed ε > 0, one can ensure that the solution to the sketched problem (13) satisfies

‖Ax′ − y‖2 ≤ (1 + ε) ‖Ax∗ − y‖2 (14)

with high probability by choosing the sketch dimension J sufficiently large. Moreover, the matrix S canbe applied in O(nnz(A)) time, where nnz(A) denotes the number of nonzero elements of A, making it anespecially efficient sketch if A is sparse; see [8] for further details.

CountSketch can also be used for approximate matrix multiplication. Let S ∈ RJ×I denote theCountSketch operator. Informally, for matrices A and B with I rows and a fixed ε > 0, one can ensurethat ∥∥A>S>SB−A>B

∥∥F≤ ε ‖A‖F ‖B‖F (15)

with high probability by choosing the sketch dimension J sufficiently large. A formal statement and proof ofthis is given in Lemma 32 of [8].

Now suppose the matrix A is of the form A = A(1) ⊗A(2) ⊗ · · · ⊗A(N), where each A(n) ∈ RIn×Rn ,In > Rn. Then A is of size I × R, where I :=

∏n In and R :=

∏nRn. TensorSketch is a restricted

variant of CountSketch which allows us to sketch A without having to first form the matrix. This is doneby instead sketching each factor matrix A(n) individually, and then computing the corresponding sketch ofA, which can be done efficiently using the fast Fourier transform (FFT).

TensorSketch was first introduced in [30] where it is applied to compressed matrix multiplication. In[31], TensorSketch is used for approximating support vector machine polynomial kernels efficiently. Avronet al. [2] show that TensorSketch provides an oblivious subspace embedding. Diao et al. [10] show that anapproximate solution to the least squares problem in the sense of (14) when A is a Kronecker product can beobtained with high probability by solving the TensorSketched problem instead of the full problem. Theremainder of this subsection will briefly describe how TensorSketch works.

In the following, in order to adhere to Matlab notation with indexing starting at 1, some definitions willdiffer slightly from those in [10].

12

Page 13: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

Definition B.1 (k-wise independent). A family H := {h : [I]→ [J ]} of hash functions is k-wise independentif for any distinct x1, . . . , xk ∈ [I], and uniformly random h ∈ H, the hash codes h(x1), . . . , h(xk) areindependent random variables, and the hash code of any fixed x ∈ [I] is uniformly distributed in [J ]. We alsocall a function h drawn randomly and uniformly from H k-wise independent.

For n ∈ [N ], let

• hn : [In]→ [J ] be 3-wise independent hash functions, and let

• sn : [In]→ {−1,+1} be 4-wise independent sign functions.

Furthermore, define the hash function

H : [I1]× [I2]× · · · × [IN ]→ [J ] : (i1, . . . , iN ) 7→

(N∑

n=1

(hn(in)− 1) mod J

)+ 1 (16)

and the sign function

S : [I1]× [I2]× · · · × [IN ]→ {−1, 1} : (i1, . . . , iN ) 7→N∏

n=1

sn(in). (17)

TensorSketch is the same as applying CountSketch to A with Φ defined using H instead of h, and withthe diagonal of D given by S. To understand how the maps H and S map a certain row index, note that thereis a bijection between the set of rows i ∈ [

∏n In] of A and the N -tuples (i1, i2, . . . , iN ) ∈ [I1]× [I2]×· · ·× [IN ].

Specifically, for the ith row, there is a unique N -tuple (i1, i2, . . . , iN ) such that Ai,: = A(1)i1,:⊗A(2)

i2,:⊗· · ·⊗A(N)

iN ,:.Pagh [30] observed that the application of CountSketch based on H and S can be done efficiently—withoutever forming A—by CountSketching each factor matrix A(n) using hn and sn, and then computing theTensorSketch of A using FFT. Let S(n) be the CountSketch matrix corresponding to using the hashfunction hn and the diagonal matrix D(n) with diagonal D(n)

i,i = sn(i). The application of S(n) to a(n):rn , the

rnth column of A(n), can be represented by the J − 1 degree polynomial

Pn,rn(ω) =

In∑i=1

sn(i)a(n)irnωhn(i)−1 =

In∑i=1

c(n)irnωhn(i)−1, (18)

where c(n)rn = (c

(n)1rn

, . . . , c(n)Nrn

) are the polynomial coefficients. With this representation, the rows of a(n):rn aremultiplied with the correct sign given by sn, and then grouped together in different polynomial coefficientsdepending on which bucket hn assigns them to. Letting T denote the TensorSketch operator, the rthcolumn of TA can similarly be represented as the polynomial

Pr(ω) =

I∑i=1

S(i1, . . . , iN )airωH(i1,...,iN ) (19)

=

I∑i=1

s1(i1) · · · sN (iN )a(1)i1r1· · · a(N)

iNrNωh1(i1)+···+hN (iN )−N mod J (20)

= FFT−1(FFT

(c(1)r1

)∗ · · · ∗ FFT

(c(N)rN

)), (21)

where FFT denotes the unpadded FFT, “∗” denotes Hadamard (element-wise) product and where the iin (19) corresponds to the N -tuple (i1, i2, . . . , iN ) ∈ [I1]× [I2]× · · · × [IN ]. Similarly, r corresponds to theN -tuple (r1, r2, . . . , rN ) ∈ [R1]× [R2]× · · · × [RN ]. So each column of TA can be computed efficiently fromthe sketches of the corresponding columns of the factor matrices. More concisely, we can write

TA = FFT−1( N⊙

n=1

(FFT

(S(n)A(n)

))>)> , (22)

13

Page 14: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

where FFT is applied columnwise to a matrix argument. Applying T naively to A would, in the generaldense case, cost O

(J∏N

n=1 InRn

). With the trick above, however, it is straightforward to compute the

reduced cost to be O(J∑N

n=1 InRn + J log J∑N

n=1Rn + J log J∏N

n=1Rn

).

Theorem 3.1 in [10] gives theoretical guarantees for the optimality of the solution to (13) when S isTensorSketch. We have found the sketch dimension of that theorem to be pessimistic in practice, withmuch smaller sketch dimensions yielding satisfying results. Indeed, Example 6.1 of [10] also achieves goodaccuracy with a significantly smaller sketch dimension than what the theorem dictates.

Approximate matrix multiplication as in (15) also holds for TensorSketch; see Lemma B.1 in [10] for aformal statement and proof.

C Detailed algorithmsIn the detailed algorithms below, we will use the following notations: CS(A, h, s) will denote the application ofCountSketch to the matrixA using the hash function h and the sign function s. TS

(A, {hn}Nn=1 , {sn}

Nn=1

)will denote the application of TensorSketch to the matrix A with the set of hash functions {hn}Nn=1 andsign functions {sn}Nn=1. We will overload the function TS in the following way. With the same notation as in(22), suppose A

(n)s =

(FFT

(S(n)A(n)

))>. We will define TS with a single argument as

TS({

A(n)s

}N

i=1

)= FFT−1

( N⊙n=1

A(n)s

)> . (23)

In Algorithm 4, we give a version of TUCKER-TS where we define a new TensorSketch operator beforeeach least-squares sketch. In Algorithm 5, we give a detailed version of the TUCKER-TS algorithm in themain manuscript in the case when the dimension is sufficiently small for double-sketching to be unnecessary,i.e., when each In ≤ J1 + J2 and improvement 3 for TUCKER-TS is not used. In Algorithm 6, we give adetailed version of TUCKER-TS for when In ≥ J1 + J2 for all n ∈ [N ] and improvement 3 is applied toall dimensions. In cases when some In are large and some small, it is straightforward to combine elementsfrom both algorithms to get the optimal algorithm. We give the algorithm for TUCKER-TS in these twoversions to simplify complexity analysis. In Algorithm 7, we give a detailed version of the TUCKER-TTMTSalgorithm in the main manuscript.

C.1 OrthogonalizationIn Algorithms 4 and 5 we orthogonalize using QR factorization in the following way. The computation

Q(n)R(n) = A(n) reduced QR factorization (24)

A(n) = Q(n) (25)

G = G×n R(n) (26)

ensures that each A(n) is orthogonal, and that the Tucker decomposition remains unchanged. In Algorithm 6,we cannot do this computation since we only maintain sketched versions of the factor matrices. We will use thefollowing heuristic instead. After updating G on line 13, we compute an approximation of the normalizationof G as follows: Set Gnormalized = G. For n = 1, . . . , N ,

Q(n)R(n) = A(n)s2 reduced QR factorization (27)

Gnormalized = Gnormalized ×n Q(n) (28)

We then check for convergence based on ‖Gnormalized‖ instead of ‖G‖.

14

Page 15: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

C.2 Complexity analysisWe analyze the complexity of TUCKER-ALS given in the main paper, as well as Algorithms 4–6. Onecan show that the leading order cost of applying TensorSketch T ∈ RJ×

∏n In to

⊗Ni=1 A

(i) with eachA(i) ∈ RIn×Rn is O(

∑n InRn + J log J

∑nRn + J log J

∏nRn). For simplicity, we will assume that In = I

and Rn = R for all n ∈ [N ]. We divide costs into costs per iteration of the main loop, and one-time costs dueto computations outside the main loop.

C.2.1 TUCKER-ALS (algorithm given in main paper)

The dominant cost in TUCKER-ALS is the TTM product. Assuming that I > NR, the cost per iteration isO(RIN ). The final computation of G also costs O(RIN ). If Y is sparse these costs will be lower.

C.2.2 Algorithm 4: TUCKER-TS, multi-pass

Algorithm 4: TUCKER-TS, multi-pass (detailed)

input :Y, (R1, R2, . . . , RN ), (J1, J2)output : Rank-(R1, R2, . . . , RN ) Tucker decomposition JG;A(1), . . . ,A(N)K of Y

1 Initialize G, A(2),A(3), . . . ,A(N)

2 repeat3 for n = 1, . . . , N do4 For k ∈ [N ] \ n, determine hash functions h(1)k : [Ik]→ [J1] and sign functions

sk : [Ik]→ {−1,+1}

5 M1 = TS

(⊗1i=Ni6=n

A(i),{h(1)i

}1

i=Ni 6=n

, {si}1i=Ni 6=n

)

6 Y>(n),s1 = TS

(Y>(n),

{h(1)i

}1

i=Ni 6=n

, {si}1i=Ni 6=n

)7 A(n) = arg minA

∥∥∥M1G>(n)A

> −Y>(n),s1

∥∥∥2F

8 end9 For k ∈ [N ], determine hash functions h(2)k : [Ik]→ [J1] and sign functions sk : [Ii]→ {−1,+1}

10 M2 = TS(⊗1

i=N A(i),{h(2)i

}1

i=N, {si}1i=N

)11 y(:),s2 = TS

(y(:),

{h(2)i

}1

i=N, {si}1i=N

)12 G = arg minZ

∥∥M2z(:) − y(:),s2

∥∥22

13 until termination criteria met14 return G, A(1), . . . ,A(N)

One-time cost Negligible.

Cost per iteration The inner for loop has the following costs:

• Line 5: We can TensorSketch efficiently to a cost of O(NIRJ1 +NRJ1 log J1 +RN−1J1 log J1).

• Line 6: Since we have to compute this as a CountSketch, the cost is nnz(Y).

• Line 7: The cost for computing M1G>(n) is O(J1R

N ), the cost for QR factorizing this quantity isO(J1R

2), and the cost for solving each of the I least-squares problems is O(IJ1R+ IR2).

15

Page 16: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

So the cost for the inner for loop is O(N2IRJ1 +N2RJ1 log J1 +NRN−1J1 log J1 +Nnnz(Y) +NJ1RN +

NIJ1R+NIR2). The cost of the remaining lines are:

• Line 10: We can TensorSketch efficiently to a cost of O(NIRJ2 +NRJ2 log J2 +RNJ2 log J2).

• Line 11: Since we have to compute this as a CountSketch, the cost is nnz(Y).

• Line 12: The cost for solving this system using e.g. QR factorization is O(J2R2N ).

The total cost per iteration is therefore, to leading order,

O(Nnnz(Y) +N2IRJ1 +N2RJ1 log J1 +NRN−1J1 log J1 +NJ1RN (29)

+NIJ1R+NIR2 +NIRJ2 +NRJ2 log J2 +RNJ2 log J2 + J2R2N ). (30)

C.2.3 Algorithm 5: TUCKER-TS, one-pass with I < J1 + J2

Algorithm 5: TUCKER-TS, one-pass with all In < J1 + J2 (detailed)

input :Y, (R1, R2, . . . , RN ), (J1, J2)output : Rank-(R1, R2, . . . , RN ) Tucker decomposition JG;A(1), . . . ,A(N)K of Y

1 For n ∈ [N ], determine hash functions h(l)n : [In]→ [Jl] with l = 1, 2, and sn : [In]→ {−1,+1}

2 Initialize G. For n ∈ [N ], initialize A(n) and compute A(n)s1 =

(FFT

(CS(A(n), h

(1)n , sn

)))>3 For n ∈ [N ], Y>(n),s1 = TS

(Y>(n),

{h(1)i

}1

i=Ni 6=n

, {si}1i=Ni 6=n

)4 y(:),s2 = TS

(y(:),

{h(2)i

}1

i=N, {si}1i=N

)5 repeat6 for n = 1, . . . , N do

7 M1 = TS

({A

(n)s1

}1

i=Ni 6=n

)8 A(n) = arg minA

∥∥∥M1G>(n)A

> −Y>(n),s1

∥∥∥2F

9 A(n)s1 =

(FFT

(CS(A(n), h

(1)n , sn

)))>10 end

11 M2 = TS(⊗1

i=N A(i),{h(2)i

}1

i=N, {si}1i=N

)12 G = arg minZ

∥∥M2z(:) − y(:),s2

∥∥22

13 until termination criteria met14 return G, A(1), . . . ,A(N)

One-time cost

• Line 2: The cost of N − 1 CountSketches and FFT of factor matrices is O(NIR+NRJ1 log J1).

• Lines 3 and 4: The sketching costs O(Nnnz(Y)).

So the total one-time cost is O(Nnnz(Y) +NIR+NRJ1 log J1).

16

Page 17: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

Cost per iteration In the inner loop, we have the following costs:

• Line 7: The inverse FFT is the dominant cost at O(RN−1J1 log J1).

• Line 8: The cost for computing M1G>(n) is O(J1R

N ), the cost for QR factorizing this quantity isO(J1R

2), and the cost for solving each of the I least-squares problems is O(IJ1R+ IR2).

• Line 9: The cost of a CountSketches and FFT is O(IR+RJ1 log J1).

So the total cost for the inner loop is O(NRN−1J1 log J1 +NJ1RN +NIJ1R+NIR2). We have the following

additional costs per outer iteration:

• Line 11: We can TensorSketch efficiently to a cost of O(NIRJ2 +NRJ2 log J2 +RNJ2 log J2).

• Line 12: The cost for solving this system using e.g. QR factorization is O(J2R2N ). However, since

the design matrix in the normal equation formulation of this problem is usually well-conditioned, weinstead choose to solve this step using conjugate gradient, at a cost O(R2N ).

The total cost per iteration is therefore, to leading order,

O(NRN−1J1 log J1 +NJ1RN +NIJ1R+NIR2 +NIRJ2 (31)

+NRJ2 log J2 +RNJ2 log J2 +R2N ). (32)

C.2.4 Algorithm 6: TUCKER-TS, one-pass with I > J1 + J2

One-time costs

• Line 2: The cost of N − 1 CountSketches and FFT of factor matrices is O(NIR+NRJ1 log J1).

• Line 3: The TensorSketch costs O(Nnnz(Y)) and the subsequent smaller CountSketches costO(N min(IJ1, nnz(Y)).

• Line 4: Since we have to compute this as a CountSketch, the cost is nnz(Y).

So the upfront cost is O(Nnnz(Y) +NIR+NRJ1 log J1), but with some additional smaller computationscompared to Algorithm 5. We will also have some additional one-time costs after the end of the main loop:

• Line 16: The inverse FFT is the dominant cost at O(NRN−1J1 log J1) for N repetitions.

• Line 17: The cost for computing M1G>(n) is O(J1R

N ), the cost for QR factorizing this quantity isO(J1R

2), and the cost for solving each of the I least-squares problems is O(IJ1R + IR2). So thedominant cost, for N repetitions, is O(NJ1R

N +NIJ1R+NIR2).

• Line 19: The cost of the CountSketch and the FFT is O(NIR +NRJ1 log J1) for N − 1 repetitions.

• Line 22: The cost of applying TensorSketch is O(NIRJ2 +NRJ2 log J2 +RNJ2 log J2).

• Line 23: The cost for solving this system using e.g. QR factorization is O(J2R2N ).

The total one-time cost is therefore

O(Nnnz(Y) +NRN−1J1 log J1 +NJ1RN +NIJ1R+NIR2 +NIRJ2 (33)

+NRJ2 log J2 +RNJ2 log J2 + J2R2N ). (34)

17

Page 18: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

Algorithm 6: TUCKER-TS, one-pass with all In > J1 + J2 (detailed)

input :Y, (R1, R2, . . . , RN ), (J1, J2)output : Rank-(R1, R2, . . . , RN ) Tucker decomposition JG;A(1), . . . ,A(N)K of Y

1 For n ∈ [N ], determine hash functions h(l)n : [In]→ [Jl] with l = 1, 2, and sn : [In]→ {−1,+1}

2 Initialize G. For n ∈ [N ], initialize A(n) and compute A(n)s1 =

(FFT

(CS(A(n), h

(1)n , sn

)))>3 For n ∈ [N ], Y>(n),s1 = TS

(Y>(n),

{h(1)i

}1

i=Ni 6=n

, {si}1i=Ni 6=n

), Y(n),s′1

= CS(Y(n),s1 , h

(1)n , sn

),

Y(n),s′′1= CS

(Y(n),s1 , h

(2)n , sn

)4 y(:),s2 = TS

(Y(:),

{h(2)i

}1

i=N, {si}1i=N

)5 repeat6 for n = 1, . . . , N do

7 M1 = TS

({A

(i)s1

}1

i=Ni 6=n

)8 [As1 ,As2 ] = arg minA

∥∥∥M1G>(n)A

> −[Y>(n),s′1

,Y>(n),s′′1

]∥∥∥2F

9 A(n)s1 = (FFT (As1))

>

10 A(n)s2 = (FFT (As2))

>

11 end

12 M2 = FFT−1((⊙1

i=N A(i)s2

)>)13 G = arg minZ

∥∥M2z(:) − y(:),s2

∥∥22

14 until termination criteria met15 for n = 1, . . . , N do

16 M1 = TS

({A

(i)s1

}1

i=Ni 6=n

)17 A(n) = arg minA

∥∥∥M1G>(n)A

> −Y>(n),s1

∥∥∥2F

18 if n < N then

19 A(n)s1 =

(FFT

(CS(A(n), h

(1)n , sn

)))>20 end21 end

22 M2 = TS(⊗1

i=N A(i),{h(2)i

}1

i=N, {si}1i=N

)23 G = arg minZ

∥∥M2z(:) − y(:),s2

∥∥22

24 return G, A(1), . . . ,A(N)

Cost per iteration

• Line 7: The inverse FFT is the dominant cost at O(RN−1J1 log J1).

• Line 8: The cost for computingM1G>(n) isO(J1R

N ), the cost for QR factorizing this quantity isO(J1R2),

and the cost for solving each of the J1 + J2 least-squares problems is O(J21R+ J1R

2 + J1J2R+ J2R2).

The leading order cost for the line is O(J1RN + J2

1R+ J1J2R+ J2R2).

• Lines 9 and 10: The cost of these two lines is O(RJ1 log J1 +RJ2 log J2).

18

Page 19: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

So the total cost for the inner for loop is therefore O(NRN−1J1 log J1+NJ1RN +NJ2

1R+NJ1J2R+NJ2R2+

NRJ2 log J2). The cost of the remaining lines are:

• Line 12: The inverse FFT is the dominant cost at O(RNJ2 log J2).

• Line 13: The cost of solving this system e.g. via QR factorization is O(J2R2N ).

The total cost per iteration is therefore

O(NRN−1J1 log J1 +NJ1RN +NJ2

1R+NJ1J2R (35)

+NJ2R2 +NRJ2 log J2 +RNJ2 log J2 + J2R

2N ). (36)

C.2.5 Algorithm 7: TUCKER-TTMTS

Algorithm 7: TUCKER-TTMTS (detailed)

input :Y, (R1, R2, . . . , RN ), (J1, J2)output : Rank-(R1, R2, . . . , RN ) Tucker decomposition JG;A(1), . . . ,A(N)K of Y

1 For n ∈ [N ], determine hash functions h(l)n : [In]→ [Jl] with l = 1, 2, and sn : [In]→ {−1,+1}

2 Initialize G. For n ∈ [N ], initialize A(n) and compute A(n)s1 =

(FFT

(CS(A(n), h

(1)n , sn

)))>3 For n ∈ [N ], Y>(n),s1 = TS

(Y>(n),

{h(1)i

}1

i=Ni 6=n

, {si}1i=Ni 6=n

)4 y(:),s2 = TS

(y(:),

{h(2)i

}1

i=N, {si}1i=N

)5 repeat6 for n = 1, . . . , N do

7 M1 = TS

({A

(i)s1

}1

i=Ni 6=n

)8 Z(n) = Y(n),s1M1

9 A(n) = Rn leading left singular vectors of Z(n)

10 end11 until termination criteria met

12 M2 = TS(⊗1

i=N A(i),{h(2)i

}1

i=N, {si}1i=N

)13 g(:) = M>2 y(:),s2

14 return G, A(1), . . . ,A(N)

One-time costs

• Line 2: The cost of N − 1 CountSketches and FFT of factor matrices is O(NIR+NRJ1 log J1).

• Lines 3 and 4: The sketching costs O(Nnnz(Y)).

• Line 12: We can TensorSketch efficiently to a cost of O(NIRJ2 +NRJ2 log J2 +RNJ2 log J2).

• Line 13: The cost of this matrix-vector multiplication is O(J2RN ).

The total one-time cost is therefore O(NIRJ2 +Nnnz(Y) +NRJ1 log(J1) +NRJ2 log(J2) +RNJ2 log(J2)).

19

Page 20: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

Cost per iteration

• Line 7: The inverse FFT is the dominant cost at O(RN−1J1 log J1).

• Line 8: The cost of this matrix-matrix multiplication is O(J1IRN−1).

• Line 9: We use the randomized SVD of [16], the cost of which is O(IRN−1 log(R) +RN+1)

Each of these are repeated N times per outer iteration, so the total cost per outer iteration is thereforeO(NRN−1J1 log(J1) +NJ1IR

N−1 +NIRN−1 log(R) +NRN+1).

C.2.6 Summary of complexity results

In Tables 2–5 we summarize the results above. For clarity, we have made the additional assumption thatJ1 = KRN−1 and J2 = KRN for some constant K > 1. We have found that this choice of J1 and J2 oftenworks well in practice. Tables 2 and 3 give the leading order complexities for one-time costs and per iterationcosts, respectively. Tables 4 and 5 give the corresponding complexities for each of the cases when the variablesI, R and K are assumed to be large.

Alg. One-time cost

T.-ALS RIN

4 Negligible

5 Nnnz(Y) +NIR+NKRN logK +N2KRN logR

6 Nnnz(Y) +NR2N−2K logK +N2KR2N−2 logR+NKR2N−1 +NIKRN+1 +R2NK logK +NKR2N logR+KR3N

7 NIKRN+1 +Nnnz(Y) +NKRN+1 logK +N2KRN+1 logR+KR2N logK +KNR2N logR

Table 2: Comparison of computational complexity for one-time computations for TUCKER-ALS andAlgorithms 4–7. All complexities are to leading order.

Alg. Per iteration cost

T.-ALS RIN

4 Nnnz(Y)+N2IKRN+N2KRN logK+N3KRN logR+NKR2N−2 logK+N2KR2N−2 logR+NIKRN+1 +KR2N logK +NKR2N logR+KR3N

5 NR2N−2K logK+N2KR2N−2 logR+NKR2N−1+NIKRN+1+R2NK logK+NKR2N logR

6 NR2N−2K logK +N2KR2N−2 logR+NK2R2N +R2NK logK +NKR2N logR+KR3N

7 NKR2N−2 logK +KN2R2N−2 logR+NKIR2N−2

Table 3: Comparison of computational complexity for per iteration computations for TUCKER-ALS andAlgorithms 4–7. All complexities are to leading order.

D Proof of Proposition 3.1

Let V be the subspace spanned by the columns of U def=⊗1

i=N A(i). Note that since each A(i) has orthonormalcolumns, so does U [26]. Suppose J2 ≥ (

∏nRn)2(2 + 3N )/(ε2δ). Using Theorem 4.2.2 in [17], we have that

20

Page 21: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

One-time cost when the given variable is large

Alg. I R K

T.-ALS RIN RIN n/a

4 Negligible Negligible Negligible

5 Nnnz(Y) +NIR N2KRN logR NRNK logK

6 Nnnz(Y) +NIKRN+1 KR3N (NR2N−2 +R2N )×K logK

7 Nnnz(Y) +NIKRN+1 KNR2N logR (NRN+1 +R2N )K logK

Table 4: Comparison of computational complexity for one-time computations for TUCKER-ALS andAlgorithms 4–7 when each of the variables I, R and K are assumed to be large. All complexities are toleading order.

Per iteration cost when the given variable is large

Alg. I R K

T.-ALS RIN RIN n/a

4 Nnnz(Y) +N2IKRN

+NIKRN+1KR3N (N2RN +NR2N−2 +R2N )

×K logK

5 NIKRN+1 NKR2N logR NR2N−2K logK+R2NK logK

6 Negligible KR3N NK2R2N

7 NIKR2N−2 KN2R2N−2 logR NR2N−2K logK

Table 5: Comparison of computational complexity for per iteration computations for TUCKER-ALS andAlgorithms 4–7 when each of the variables I, R and K are assumed to be large. All complexities are toleading order.

the maximum and minimum eigenvalues of M satisfy

λmax(M) = maxx:‖x‖≤1

∥∥∥T(N+1)Ux∥∥∥22, (37)

λmin(M) = minx:‖x‖≤1

∥∥∥T(N+1)Ux∥∥∥22. (38)

(39)

Due to Lemma B.1 in [10], we therefore have

λmax(M) ≤ (1 + ε)2 maxx:‖x‖≤1

‖Ux‖22 = (1 + ε)2 maxx:‖x‖≤1

‖x‖22 = (1 + ε)2, (40)

λmin(M) ≥ (1− ε)2 minx:‖x‖≤1

‖Ux‖22 = (1− ε)2 minx:‖x‖≤1

‖x‖22 = (1− ε)2, (41)

(42)

with probability at least 1 − δ, since Ux ∈ V for all x ∈ R∏

n Rn . Since M is symmetric and positivesemi-definite, we also have singular values: σmax(M) = λmax(M) ≤ (1 + ε)2, σmin(M) = λmin(M) ≥ (1− ε)2.The result now follows:

κ(M) =σmax

σmin≤ (1 + ε)2

(1− ε)2. (43)

21

Page 22: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

E Formal statement and proof of Proposition 3.3Let

G = Y×1 A(1)> ×2 A

(2)> · · · ×N A(N)>. (44)

Moreover, let A∗ and G∗ be the updated values on lines 5 and 8 in Algorithm 1 in the main manuscript, andlet A and G be the corresponding updates on lines 6 and 9 in Algorithm 3 in the main manuscript.

Proposition E.1 (TUCKER-TTMTS). Assume that G(n)G>(n) is invertible, with G given in (44), and that

each TensorSketch operator is redefined prior to being used. If the sketch dimension J1 ≥ (2 +3N−1)/(ε2δ),then with probability at least 1− δ it holds that∥∥∥∥∥∥∥

1⊗i=Ni 6=n

A(i)

G>(n)A> −Y>(n)

∥∥∥∥∥∥∥F

∥∥∥∥∥∥∥ 1⊗

i=Ni6=n

A(i)

G>(n)A∗> −Y>(n)

∥∥∥∥∥∥∥F

(45)

+ ε

∥∥∥∥(G(n)G>(n)

)−1∥∥∥∥F

‖G‖2 ‖Y‖N∏i=1i 6=n

Ri. (46)

Moreover, if the sketch dimension J2 ≥ (2 + 3N )/(ε2δ), then with probability at least 1− δ it holds that∥∥∥∥∥(

1⊗i=N

A(i)

)g(:) − y(:)

∥∥∥∥∥2

∥∥∥∥∥(

1⊗i=N

A(i)

)g∗(:) − y(:)

∥∥∥∥∥2

+ ε ‖Y‖

√√√√ N∏i=1

Ri. (47)

Proof. It is straightforward to show that A∗ solves

A∗ = arg minA∈RIn×Rn

∥∥∥∥∥∥∥ 1⊗

i=Ni 6=n

A(i)

G>(n)A> −X>(n)

∥∥∥∥∥∥∥2

F

; (48)

see e.g. the discussion in Subsection 4.2 in [21]. Setting the gradient of the objective function equal to zeroand rearranging, we get that

G(n)G>(n)A

∗> = G(n)

1⊗i=Ni6=n

A(i)

>

Y>(n). (49)

By a similar argument, we find

G(n)G>(n)A

> = G(n)

1⊗i=Ni6=n

A(i)

>

T(n)>T(n)Y>(n). (50)

The matrix A> is therefore the solution to a perturbed version of the system in (49). Using Lemma B.1 in[10], we get that the perturbation satisfies∥∥∥∥∥∥∥∥G(n)

1⊗i=Ni 6=n

A(i)

>

Y>(n) −G(n)

1⊗i=Ni6=n

A(i)

>

T(n)>T(n)Y>(n)

∥∥∥∥∥∥∥∥F

(51)

≤ ε ‖G‖

∥∥∥∥∥∥∥1⊗

i=Ni 6=n

A(i)

∥∥∥∥∥∥∥F

‖Y‖ (52)

22

Page 23: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

with probability at least 1− δ if J1 ≥ (2 + 3N−1)/(ε2δ). Standard sensitivity analysis in linear algebra (seee.g. Subsection 2.6 in [14]) now gives that

∥∥∥A∗> − A>∥∥∥F≤ ε

∥∥∥∥(G(n)G>(n)

)−1∥∥∥∥F

‖G‖

∥∥∥∥∥∥∥1⊗

i=Ni 6=n

A(i)

∥∥∥∥∥∥∥F

‖Y‖ (53)

with probability at least 1− δ if J1 ≥ (2 + 3N−1)/(ε2δ). It follows that∥∥∥∥∥∥∥ 1⊗

i=Ni6=n

A(i)

G>(n)A> −Y>(n)

∥∥∥∥∥∥∥F

(54)

∥∥∥∥∥∥∥ 1⊗

i=Ni6=n

A(i)

G>(n)A∗> −Y>(n)

∥∥∥∥∥∥∥F

+

∥∥∥∥∥∥∥1⊗

i=Ni 6=n

A(i)

∥∥∥∥∥∥∥F

‖G‖∥∥∥A∗> − A>

∥∥∥F

(55)

∥∥∥∥∥∥∥ 1⊗

i=Ni6=n

A(i)

G>(n)A∗> −Y>(n)

∥∥∥∥∥∥∥F

+ ε

∥∥∥∥(G(n)G>(n)

)−1∥∥∥∥F

‖G‖2 ‖Y‖N∏i=1i 6=n

Ri (56)

(57)

with probability at least 1− δ if J1 ≥ (2 + 3N−1)/(ε2δ), and where we used the fact that⊗1

i=Ni 6=n

A(i) is an

orthogonal matrix of size (∏

i 6=n Ii)× (∏

i 6=nRi) and therefore∥∥∥∥∥∥∥1⊗

i=Ni6=n

A(i)

∥∥∥∥∥∥∥2

F

=

N∏i=1i 6=n

Ri. (58)

For the other claim, Lemma B.1 in [10] gives that

∥∥∥g∗(:) − g(:)

∥∥∥2

=

∥∥∥∥∥∥(

1⊗i=N

A(i)

)>y(:) −

(T(N+1)

1⊗i=N

A(i)

)>T(N+1)y(:)

∥∥∥∥∥∥F

(59)

≤ ε

∥∥∥∥∥1⊗

i=N

A(i)

∥∥∥∥∥F

‖Y‖ (60)

with probability at least 1− δ if J2 ≥ (2 + 3N )/(ε2δ). It follows that∥∥∥∥∥(

1⊗i=N

A(i)

)g(:) − y(:)

∥∥∥∥∥2

(61)

∥∥∥∥∥(

1⊗i=N

A(i)

)g∗(:) − y(:)

∥∥∥∥∥2

+

∥∥∥∥∥1⊗

i=N

A(i)

∥∥∥∥∥2

∥∥∥g∗(:) − g(:)

∥∥∥2

(62)

∥∥∥∥∥(

1⊗i=N

A(i)

)g∗(:) − y(:)

∥∥∥∥∥2

+ ε ‖Y‖

√√√√ N∏i=1

Ri, (63)

where we again use the fact that⊗1

i=N A(i) is a (∏

i Ii)× (∏

iRi) orthogonal matrix, and its 2-norm thereforeis 1, and its Frobenius norm is equal to

√∏iRi.

23

Page 24: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

F Further experimentsIn this section, we present some further experiments. Figures 7 and 8 shows additional examples of how theerror of TUCKER-TS and TUCKER-TTMTS, relative to that of TUCKER-ALS, changes with the sketchdimension parameter K. The figures also show how the algorithms perform when each TensorSketchoperator is redefined prior to being used (called “multi-pass” in the figures). We note that our approach ofdefining the TensorSketch operators upfront consistently leads to a lower error compared to when thesketch operators are redefined before being used each time. Figures 9 and 10 show how the algorithms scalewith increased dimension size I for 4-way tensors. TUCKER-ALS, MET and MACH run out of memorywhen I = 1e+5. For reasons explained in the main manuscript, FSTD1 does not work well on tensors thatare very sparse, which we can observe here.

5 10 15 20

K

1

2

3

4

Err

or

rela

tive

to

TU

CK

ER

-AL

S

(a) Sparse 3-way tensor, Rtrue

= R

TUCKER-TS multi-pass

TUCKER-TS

TUCKER-ALS

5 10 15 20

K

1

1.2

1.4

1.6E

rro

r re

lative

to

TU

CK

ER

-AL

S

(b) Sparse 4-way tensor, Rtrue

= R

5 10 15 20

K

1

1.2

1.4

1.6

Err

or

rela

tive

to

TU

CK

ER

-AL

S

(c) Sparse 4-way tensor, Rtrue

> R

5 10 15 20

K

1

2

3

4

Err

or

rela

tive

to

TU

CK

ER

-AL

S

(d) Dense 3-way tensor, Rtrue

= R

5 10 15 20

K

1

2

3

4

Err

or

rela

tive

to

TU

CK

ER

-AL

S

(e) Dense 3-way tensor, Rtrue

> R

Figure 7: Further examples of how the error of TUCKER-TS, relative to that of TUCKER-ALS, is impactedby the sketch dimension parameter K. For plot (a), the tensor size is 500× 500× 500 with nnz(Y) ≈ 1e+6,and with both the true and algorithm target rank equal to (10, 10, 10). For plot (b), the tensor size is100 × 100 × 100 × 100 with nnz(Y) ≈ 1e+7, and with both the true and algorithm target rank equalto (5, 5, 5, 5). For plot (c), the tensor size is 100 × 100 × 100 × 100 with nnz(Y) ≈ 1e+7, with true rank(10, 10, 10, 10) and algorithm target rank (5, 5, 5, 5). For plot (d), the tensor is dense and of size 500×500×500,and with both the true and algorithm target rank equal to (10, 10, 10). For plot (e), the tensor is dense andof size 500× 500× 500, and with true rank (15, 15, 15) and algorithm target rank (10, 10, 10).

24

Page 25: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

5 10 15 20

K

10

20

30

40

Err

or

rela

tive to T

UC

KE

R-A

LS

(a) Sparse 3-way tensor, Rtrue

= R

TUCKER-TTMTS multi-pass

TUCKER-TTMTS

TUCKER-ALS

5 10 15 20

K

2

4

6

8

Err

or

rela

tive to T

UC

KE

R-A

LS

(b) Sparse 4-way tensor, Rtrue

= R

5 10 15 20

K

1

1.1

1.2

1.3

1.4

Err

or

rela

tive to T

UC

KE

R-A

LS

(c) Sparse 4-way tensor, Rtrue

> R

5 10 15 20

K

1

2

3

4

Err

or

rela

tive to T

UC

KE

R-A

LS

(d) Dense 3-way tensor, Rtrue

= R

5 10 15 20

K

1

1.2

1.4

1.6

1.8

2

Err

or

rela

tive to T

UC

KE

R-A

LS

(e) Dense 3-way tensor, Rtrue

> R

Figure 8: Further examples of how the error of TUCKER-TTMTS, relative to that of TUCKER-ALS, isimpacted by the sketch dimension parameter K. The tensor properties for each subplot are the same as inFigure 7

1e+2 1e+3 1e+4 1e+5

Dimension size (I)

10-2

10-1

100

Rela

tive e

rror

(a) Varying dimension size (I), Rtrue

= R

1e-2

1e-1

1e-0

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS/MET

FSTD1

MACH

1e+2 1e+3 1e+4 1e+5

Dimension size (I)

100

101

102

103

Run t

ime (

s)

(b) Varying dimension size (I), Rtrue

= R

10 s

100 s

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS

MET(1)

MET(2)

MET(3)

FSTD1

MACH

Out of memory

Out of memory

Out of

memory

Figure 9: Comparison of relative error and run time for random sparse 4-way tensors. The number ofnonzero elements is nnz(Y) ≈ 1e+6. Both the true and target ranks are (5, 5, 5, 5). We used K = 10 in theseexperiments.

25

Page 26: Low-rank Tucker decomposition of large tensors using ... · University of Colorado Boulder TechnicalReport,DepartmentofAppliedMathematics 1.1 A brief introduction to tensors and the

University of Colorado Boulder Technical Report, Department of Applied Mathematics

1e+2 1e+3 1e+4 1e+5

Dimension size (I)

0

0.2

0.4

0.6

0.8

1

1.2

Rela

tive e

rror

(a) Varying dimension size (I), Rtrue

> R

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS/MET

FSTD1

MACH

1e+2 1e+3 1e+4 1e+5

Dimension size (I)

100

101

102

103

Run tim

e (

s)

(b) Varying dimension size (I), Rtrue

> R

1 s

10 s

100 s

1000 s

TUCKER-TS (proposal)

TUCKER-TTMTS (proposal)

TUCKER-ALS

MET(1)

MET(2)

MET(3)

FSTD1

MACH

Out of memory

Out of

memory

Out of memory

Figure 10: Comparison of relative error and run time for random sparse 4-way tensors. The number ofnonzero elements is nnz(Y) ≈ 1e+6. The true rank is (7, 7, 7, 7) and the algorithm target rank is (5, 5, 5, 5).In these experiments, we used K = 10 and a convergence tolerance of 1e-1.

26