1 recovery of future data via convolution nuclear norm ...1 recovery of future data via convolution...

16
1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE , Wayne Zhang Abstract—This paper is about recovering the unseen future data from a given sequence of historical samples, so called as future data recov- ery —a significant problem closely related to time series forecasting. For the first time, we study the problem from a perspective of tensor com- pletion. Namely, we convert future data recovery into a more inclusive problem called sequential tensor completion (STC), which is to recover a tensor of sequential structure from some entries sampled arbitrarily from the tensor. Unlike the ordinary tensor completion (OTC) problem studied in the majority of literature, STC has a distinctive setup that the target tensor is sequential and not permutable, which means that the target owns rich spatio-temporal structures. This enables the possibility of restoring the arbitrarily selected missing entries, which is not possible under the framework of OTC. Then we propose two methods to ad- dress STC, including Discrete Fourier Transform based 1 minimization (DFT 1 ) and Convolution Nuclear Norm Minimization (CNNM), where DFT 1 is indeed a special case of CNNM. Whenever the target is low- rank in some convolution domain, CNNM provably succeeds in solving STC. This immediately gives that DFT 1 is also successful when the Fourier transform of the target is sparse, as convolution low-rankness is a generalization of Fourier sparsity. Experiments on univariate time series, images and videos show encouraging results. Index Terms—data forecasting, sequential tensor, matrix completion, Fourier transform, convolution, low-rankness, sparsity. 1 I NTRODUCTION In many circumstances it is desirable to predict what would happen in future, e.g., the safety of a self-driving vehicle has to account on the prediction of the moving of other vehicles and pedestrians. However, it is theoretically impossible to observe directly the things coming ahead of time. Given this pressing situation, it is crucial to develop computational methods that can recover the future data from historical samples, i.e., data forecasting. Since the data samples can of- ten be organized in tensor form, we would like to formulate mathematically the forecasting problem as follows: Problem 1.1 (Future Data Recovery). Let p and q be two pre-specified positive integers. Suppose that {M t } p+q t=1 is a se- quence of p + q order-(n - 1) tensors with n 1, i.e., G. Liu is with B-DAT and CICAEET, School of Automation, Nan- jing University of Information Science and Technology, Nanjing, China 210044. Email: [email protected]. W. Zhang is with SenseTime Research, Harbour View 1 Podium, 12 Science Park East Ave, Hong Kong Science Park, Shatin, N.T., Hong Kong 999077. Email: [email protected]. M t R m1×···×mn-1 , t 1. Given {M t } p t=1 , can we identify the values of {M t } p+q t=p+1 ? The above problem is essentially a promotion of time series forecasting [1–3], highlighting the identifiability issue as well as the action of generalizing univariate or multivariate sequences to tensor-valued ones, in which each value M t itself is a structured tensor. In general, Problem 1.1 has an immense scope that covers a wide range of forecasting problems. For example, when n =1, M t is a real number and thus Problem 1.1 falls back to univariate time series forecasting [4]. In the case of n =3, M t is of matrix-valued, thereby Problem 1.1 embodies the challenging task of video prediction [5], which aims at obtaining the next q image frames of a given sequence of p frames. A great many approaches for time series analysis have been proposed in the literature, see the surveys in [2, 3]. Classical methods, e.g., Auto-Regressive Moving Average (ARMA) [1] and its extensions, often preset specific models to characterize the evolution of one or multiple variables through time. This type of methods may rely heavily on their parametric settings, and exhaustive model selection is usually demanded for them to attain satisfactory results on fresh datasets. Now it is prevalent to adopt the idea of data driven, which suggests utilizing a huge mass of historical sequences to train luxurious models such that the evolution law may emerge autonomously. Such a seemingly “untalented” idea is indeed useful, leading to dozens of promising methods for forecasting [6]. In particular, the methods based on Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) [7] have shown superior performance on many datasets, e.g., [8–10]. While impres- sive, existing studies still have some limitations: For the current deep architectures (e.g., [8–10]) to solve Problem 1.1, the most critical hypothesis is that the desired evolution law can be learnt by us- ing sufficient historical sequences to train an over- parameterized network. To obey the hypothesis, one essentially needs to gather a great many training sequences that are similar to {M t } p+q t=1 in structure. This, however, is not always viable in reality, because finding extra data similar to {M t } p+q t=1 is a challeng- ing problem by itself. In that sense, the traditional scheme of working with only one sequence is still of considerable significance, and thus we shall focus on the setup of no extra data in this paper. arXiv:1909.03889v3 [cs.LG] 13 Feb 2020

Upload: others

Post on 12-Oct-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

1

Recovery of Future Data via ConvolutionNuclear Norm Minimization

Guangcan Liu, Senior Member, IEEE , Wayne Zhang

F

Abstract—This paper is about recovering the unseen future data froma given sequence of historical samples, so called as future data recov-ery—a significant problem closely related to time series forecasting. Forthe first time, we study the problem from a perspective of tensor com-pletion. Namely, we convert future data recovery into a more inclusiveproblem called sequential tensor completion (STC), which is to recovera tensor of sequential structure from some entries sampled arbitrarilyfrom the tensor. Unlike the ordinary tensor completion (OTC) problemstudied in the majority of literature, STC has a distinctive setup that thetarget tensor is sequential and not permutable, which means that thetarget owns rich spatio-temporal structures. This enables the possibilityof restoring the arbitrarily selected missing entries, which is not possibleunder the framework of OTC. Then we propose two methods to ad-dress STC, including Discrete Fourier Transform based `1 minimization(DFT`1 ) and Convolution Nuclear Norm Minimization (CNNM), whereDFT`1 is indeed a special case of CNNM. Whenever the target is low-rank in some convolution domain, CNNM provably succeeds in solvingSTC. This immediately gives that DFT`1 is also successful when theFourier transform of the target is sparse, as convolution low-ranknessis a generalization of Fourier sparsity. Experiments on univariate timeseries, images and videos show encouraging results.

Index Terms—data forecasting, sequential tensor, matrix completion,Fourier transform, convolution, low-rankness, sparsity.

1 INTRODUCTION

In many circumstances it is desirable to predict what wouldhappen in future, e.g., the safety of a self-driving vehicle hasto account on the prediction of the moving of other vehiclesand pedestrians. However, it is theoretically impossible toobserve directly the things coming ahead of time. Giventhis pressing situation, it is crucial to develop computationalmethods that can recover the future data from historicalsamples, i.e., data forecasting. Since the data samples can of-ten be organized in tensor form, we would like to formulatemathematically the forecasting problem as follows:

Problem 1.1 (Future Data Recovery). Let p and q be twopre-specified positive integers. Suppose that Mtp+qt=1 is a se-quence of p + q order-(n − 1) tensors with n ≥ 1, i.e.,

• G. Liu is with B-DAT and CICAEET, School of Automation, Nan-jing University of Information Science and Technology, Nanjing, China210044. Email: [email protected].

• W. Zhang is with SenseTime Research, Harbour View 1 Podium, 12Science Park East Ave, Hong Kong Science Park, Shatin, N.T., HongKong 999077. Email: [email protected].

Mt ∈ Rm1×···×mn−1 ,∀t ≥ 1. Given Mtpt=1, can we identifythe values of Mtp+qt=p+1?

The above problem is essentially a promotion of timeseries forecasting [1–3], highlighting the identifiability issue aswell as the action of generalizing univariate or multivariatesequences to tensor-valued ones, in which each value Mt

itself is a structured tensor. In general, Problem 1.1 hasan immense scope that covers a wide range of forecastingproblems. For example, when n = 1, Mt is a real numberand thus Problem 1.1 falls back to univariate time seriesforecasting [4]. In the case of n = 3, Mt is of matrix-valued,thereby Problem 1.1 embodies the challenging task of videoprediction [5], which aims at obtaining the next q imageframes of a given sequence of p frames.

A great many approaches for time series analysis havebeen proposed in the literature, see the surveys in [2, 3].Classical methods, e.g., Auto-Regressive Moving Average(ARMA) [1] and its extensions, often preset specific modelsto characterize the evolution of one or multiple variablesthrough time. This type of methods may rely heavily ontheir parametric settings, and exhaustive model selectionis usually demanded for them to attain satisfactory resultson fresh datasets. Now it is prevalent to adopt the ideaof data driven, which suggests utilizing a huge mass ofhistorical sequences to train luxurious models such that theevolution law may emerge autonomously. Such a seemingly“untalented” idea is indeed useful, leading to dozens ofpromising methods for forecasting [6]. In particular, themethods based on Recurrent Neural Network (RNN) andLong Short-Term Memory (LSTM) [7] have shown superiorperformance on many datasets, e.g., [8–10]. While impres-sive, existing studies still have some limitations:

• For the current deep architectures (e.g., [8–10]) tosolve Problem 1.1, the most critical hypothesis isthat the desired evolution law can be learnt by us-ing sufficient historical sequences to train an over-parameterized network. To obey the hypothesis, oneessentially needs to gather a great many trainingsequences that are similar to Mtp+qt=1 in structure.This, however, is not always viable in reality, becausefinding extra data similar to Mtp+qt=1 is a challeng-ing problem by itself. In that sense, the traditionalscheme of working with only one sequence is still ofconsiderable significance, and thus we shall focus onthe setup of no extra data in this paper.

arX

iv:1

909.

0388

9v3

[cs

.LG

] 1

3 Fe

b 20

20

Page 2: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

2

• In many applications such as meteorology, the givendata Mtpt=1 itself may be incomplete. It is generallychallenging for most forecasting methods to copewith such situations [11, 12].

• To our knowledge, there is still a lack of theories thatfigure out the conditions under which the desiredfuture data is identifiable. Such theories are in factcrucially important because, without any restrictions,Mtp+qt=p can be arbitrary and thus unpredictable.

To tackle the above issues, we would suggest a differentmethodology, namely we convert Problem 1.1 into a moregeneral problem as follows. For a tensor-valued time seriesMtp+qt=1 with Mt ∈ Rm1×···×mn−1 , we define an order-ntensor M ∈ Rm1×···×mn as

[M ]i1,··· ,in = [Min ]i1,··· ,in−1, 1 ≤ ij ≤ mj , 1 ≤ j ≤ n,

where mn = p + q and [·]i1,···,in is the (i1, · · · , in)th entryof an order-n tensor. That is, M is formed by concatenatinga sequence of order-(n − 1) tensors into an order-n one.Subsequently, we define a sampling set Ω ⊂ 1, · · · ,m1 ×· · · × 1, · · · ,mn as in the following:

Ω = (i1, · · · , in) : (1)1 ≤ ij ≤ mj ,∀1 ≤ j ≤ n− 1 and 1 ≤ in ≤ p.

That is, Ω is a set consisting of the locations of the observa-tions in Mtpt=1, while the complement set of Ω, denoted asΩ⊥, stores the locations of the future entries in Mtp+qt=p+1.With these notations, we turn to a more inclusive problemnamed sequential tensor completion (STC).1

Problem 1.2 (Sequential Tensor Completion). Let M ∈Rm1×···×mn be an order-n data tensor formed from some tensor-valued time series, and let L0 ∈ Rm1×···×mn be a sequentialtensor of interest that satisfies some regularity conditions andL0 ≈ M . Suppose that we are given a subset of the entries inM and a sampling set Ω ⊂ 1, · · · ,m1 × · · · × 1, · · · ,mnconsisting of the locations of observed entries. The configuration ofΩ is quite progressive, namely the locations of the missing entriesincluded in Ω⊥ are allowed to be distributed arbitrarily. Can werestore the target tensor L0? If so, under which conditions?

In this way, Problem 1.1 is incorporated into the scope oftensor completion, which is to fill in the missing entries of apartially observed tensor. Such a scenario has an immediateadvantage; that is, it becomes straightforward to handlethe difficult cases where the given data Mtpt=1 itself isincomplete. Moreover, we insert intentionally a deviationbetween M and L0 in the consideration of the situations inpractice: Very often the data tensor M may not obey strictlythe regularity conditions imposed on the target L0.

1. To understand this paper, it is very important to see the differencebetween sequential and ordinary tensors. In general, ordinary tensorsare permutable and can be equivalently transformed by reorderingtheir slices, while sequential tensors contain rich spatio-temporal struc-tures and are therefore not permutable. For example, the ratings fromsome users to a set of movies form an ordinary matrix, whose propertyis invariant to the permutation of its columns and rows. In contrast,the matrix that represents a natural image is sequential, because theposition of each pixel has certain meaning that is not invariant to per-mutation. Note that we have no intension to claim that the sequentialtensors are superior over the ordinary ones, and actually they have verydifferent scopes.

While appealing, Problem 1.2 cannot be solved by or-dinary tensor completion (OTC) methods, e.g., [13–20], whichtreat L0 as an ordinary tensor of permutable. This is becausepermutability can be in conflict with identifiability: When-ever the target L0 is permutable and meanwhile some ofits slices are wholly missing, it is unrealistic to identify L0.In other words, all the models that are invariant to permutationhave no application to STC (see Footnote 2). Specifically, OTCmethods may fill in the missing entries with zeros when thesampling set Ω is configured as in (1). Yet, the STC problemcould be addressed by a framework similar to OTC:

minLφ(L) +

λ

2

∑(i1,··· ,in)∈Ω

([L]i1,···,in − [M ]i1,···,in)2, (2)

where φ(·) is some regularizer and λ > 0 is a parameter.As pointed out by [21], the key for solving the forecastingproblem is to devise a proper regularizer that can capturethe spatio-temporal structures existing in data. To handlethe STC problem, the regularizer φ(·) should, at least, besensitive to permutation.2 Based on this, it is easy to see thatthe recent regularizer in [22] is inapplicable either.

In general, the spatio-temporal structures underlyingdata are varied, and it is hard to handle all of them by asingle model. As an example, we would like to considernatural images, videos or something else similar. For suchdata, it is well-known that the Fourier coefficients of thedata tensor are mostly close to zero [23–26], which motivatesus to revisit a classical method termed Discrete FourierTransform (DFT) based `1 minimization (DFT`1 ). Given acollection of observations, DFT`1 seeks a tensor that notonly possesses the sparsest Fourier representation but alsominimizes a squared loss on the observations:

minL‖F(L)‖1 +

λ

2

∑(i1,··· ,in)∈Ω

([L]i1,···,in − [M ]i1,···,in)2, (3)

where F(·) is the DFT operator, ‖ · ‖1 denotes the `1 normof a tensor seen as a long vector, and λ > 0 is a parameter.It is provable that DFT`1 strictly succeeds in solving STC,as long as L0 has a sparse representation in the Fourierdomain and M = L0. When there is a discrepancy betweenM and L0, DFT`1 still guarantees to produce near recoveryto L0. Remarkably, our analysis explains why DFT`1 cansolve STC: No matter how the missing entries are placed, DFT`1converts STC to an OTC problem that always has a well-posedsampling pattern (see Section 4.1 and Section 4.2). In addition,the optimization problem in (3) is convex and can be solvedby any of the many first-order methods, e.g., AlternatingDirection Method of Multipliers (ADMM) [27, 28]. To solveproblem (3) for anm-dimensional tensor withm = Πn

j=1mj ,ADMM needs only O(m logm) time.

While theoretically effective and computationally effi-cient, DFT`1 suffers from a drawback that every directionsof the tensor are treated equally. This may lead to undesiredartifacts while applying the method onto heterogeneousdata, e.g., images and videos. To achieve better recovery

2. Consider the setup of (1). Suppose that L1 is an optimal solutionto (2). Then we could obtain a different L2 by interchanging L1’s twoslices whose locations are within p+ 1 ≤ in ≤ p+ q. If φ(·) is invariantto permutation, then L2 is also optimal to (2), which means that themodel cannot identify L0.

Page 3: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

3

performance, we further propose a novel regularizer calledconvolution nuclear norm, which is the sum of convolutioneigenvalues [29], and which are generalization of Fourierfrequencies. The derived method, convolution nuclear normminimization (CNNM), performs tensor completion by thefollowing convex program:

minL‖L‖cnn +

λ

2

∑(i1,··· ,in)∈Ω

([L]i1,···,in − [M ]i1,···,in)2, (4)

where ‖·‖cnn is the convolution nuclear norm of a tensor andλ > 0 is a parameter. In general, CNNM is a generalizationof DFT`1 , which is indeed equivalent to setting the kernel inCNNM to have the same size as the target L0. By choosing aproper kernel size according to the structure of data, CNNMcan outperform dramatically DFT`1 in terms of recovery ac-curacy. Similar to DFT`1 , CNNM also guarantees to exactly(resp. nearly) recover the target L0, as long as L0 is low-rankin some convolution domain and M = L0 (resp. M ≈ L0).Indeed, the theories for CNNM are generalizations of thosefor DFT`1 , and they can explain why it is beneficial tocontrol the kernel size. The optimization problem in (4) isconvex and can be solved by ADMM in O(mk2) time withk m. Experiments on univariate time series, images andvideos demonstrate superior performance of the proposedmethods. To summarize, the contributions of this papermainly include:

We propose a new methodology for investigating thedata forecasting problem, namely we cast the recov-ery of future data into a special tensor completionproblem called STC. Grounded on our scheme, it isstraightforward to handle the difficult cases wherethe historical data itself is incomplete.

We propose a novel method termed CNNM to tacklethe STC problem, and we prove that CNNM issuccessful whenever the target tensor is low-rank insome convolution domain. Our analysis, for the firsttime, provides a sampling bound for recovering theunseen future observations from historical data.

Our analysis also implies a new result for DFT`1 ,which shows that DFT`1 succeeds in solving STCprovided that the Fourier transform of the target issparse. Unlike the previous compressive sensing the-ories, our result is independent of Restricted Isom-etry Property (RIP) [30], which cannot be obeyedunder the context of tensor completion [13].

The rest of this paper is organized as follows. Section 2summarizes the mathematical notations used throughoutthe paper. Section 3 explains the technical insights behindthe proposed methods. Section 4 is mainly consist of the-oretical analysis. Section 5 shows the mathematical proofsof the proposed theories. Section 6 demonstrates empiricalresults and Section 7 concludes this paper.

2 SUMMARY OF MAIN NOTATIONS

Capital letters are used to represent tensors with order 1 orhigher, including vectors, matrices and high-order tensors.For an order-n tensor X , [X]i1,···,in is the (i1, · · · , in)thentry of X . Two types of tensor norms are used fre-quently throughout the paper: the Frobenius norm given by

‖X‖F =√∑

i1,···,in |[X]i1,···,in |2, and the `1 norm denotedby ‖ · ‖1 and given by ‖X‖1 =

∑i1,···,in |[X]i1,···,in |, where

| · | denotes the magnitude of a real or complex number.Another two frequently used norms are the operator normand nuclear norm [31, 32] of order-2 tensors (i.e., matrices),denoted by ‖ · ‖ and ‖ · ‖∗, respectively.

Calligraphic letters, such as F , P and A, are used todenote linear operators. In particular, I denotes the identityoperator and I is the identity matrix. For a linear operatorL : H1 → H2 between Hilbert spaces, its Hermitian adjoint(or conjugate) is denoted as L∗ and given by

〈L(X), Y 〉H2 = 〈X,L∗(Y )〉H1 ,∀X ∈ H1, Y ∈ H2, (5)

where 〈·, ·〉Hiis the inner product in the Hilbert space Hi

(i = 1 or 2). But the subscript is omitted whenever Hi refersto a Euclidian space.

The symbol Ω is reserved to denote the sampling setconsisting of the locations of observed entries. For Ω ⊂1, · · · ,m1 × · · · × 1, · · · ,mn, its mask tensor is denotedby ΘΩ and given by

[ΘΩ]i1,···,in =

1, if (i1, · · · , in) ∈ Ω,0, otherwise.

Denote by PΩ the orthogonal projection onto Ω. Then wehave the following:

PΩ(X) = ΘΩ X and Ω = supp(ΘΩ), (6)

where denotes the Hadamard product and supp(·) is thesupport set of a tensor.

In most cases, we work with real-valued matrices (i.e.,order-2 tensors). For a matrix X , [X]i,: is its ith row, and[X]:,j is its jth column. Let ω = j1, · · · , jl be a 1D sam-pling set. Then [X]ω,: denotes the submatrix of X obtainedby selecting the rows with indices j1, · · · , jl, and similarlyfor [X]:,ω . For a 2D sampling set Ω ⊂ 1, · · · ,m1 ×1, · · · ,m2, we imagine it as a sparse matrix and defineits “rows”, “columns” and “transpose” in a similar way asfor matrices. Namely, the ith row of Ω is denoted by Ωi andgiven by Ωi = i2 : (i1, i2) ∈ Ω, i1 = i, the jth column isdefined as Ωj = i1 : (i1, i2) ∈ Ω, i2 = j, and the transposeis given by ΩT = (i2, i1) : (i1, i2) ∈ Ω.

Letters U and V are reserved for the left and rightsingular vectors of a real-valued matrix, respectively. Theorthogonal projection onto the column space U is denotedby PU and given by PU (X) = UUTX , and similarly forthe row space PV (X) = XV V T . The same notation is alsoused to represent a subspace of matrices, e.g., we say thatX ∈ PU for any matrix X obeying PU (X) = X .

3 DFT`1 AND CNNMThis section introduces the technical details of DFT`1 andCNNM, as well as some preliminary knowledge helpful forunderstanding the proposed techniques.

3.1 Multi-Directional DFTMulti-directional DFT, also known as multi-dimensionalDFT, is a very important concept in signal processing andcompressive sensing. Its definition is widely available onsome public websites. Here, we shall present a definition

Page 4: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

4

that would be easy for engineers to understand. First con-sider the case of n = 1, i.e., the DFT of a vector X ∈ Rm. Inthis particular case, F(X) can be simply expressed as

F(X) = U1X,

with U1 ∈ Cm×m being a complex-valued, symmetric ma-trix that satisfies UH1 U1 = U1U

H1 = mI, where (·)H is the

conjugate transpose of a complex-valued matrix. We shallcall U1 as the 1D Fourier transform matrix, whose entriesare determined once the dimension m is given. Actually,one can use the Matlab function “dftmtx” to generate thistransform matrix.

Similarly, when n = 2, the DFT of a matrix X ∈ Rm1×m2

is given by

F(X) = U1XU2,

where U1 ∈ Cm1×m1 and U2 ∈ Cm2×m2 are the 1D Fouriertransform matrices for the columns and rows of X , respec-tively. For a general order-n tensorX ∈ Rm1×···×mn , its DFTis given by

F(X) = X ×1 U1 · · · ×n Un,

where Ui ∈ Cmi×mi is the 1D Fourier transform matrixfor the ith direction, and ×j (1 ≤ j ≤ n) is the mode-jproduct [33] between tensor and matrix.

3.2 Convolution Matrix

The concept of (discrete) convolution, the most fundamentalconcept in signal processing, plays an important role in thispaper. Its definition—though mostly unique—has multiplevariants, depending on which boundary condition is used.What we consider in this paper is the circular convolution,i.e., the convolution equipped with circulant boundary condi-tion [34]. Let’s begin with the simple case of n = 1, i.e., thecircular convolution procedure of converting X ∈ Rm andK ∈ Rk (k ≤ m) into X ?K ∈ Rm:

[X ?K]i =k∑s=1

[X]i−s[K]s, i = 1, · · · ,m,

where ? denotes the convolution operator, and it is assumedthat [X]i−s = [X]i−s+m for i ≤ s; this is the so-calledcirculant boundary condition. Note that we assume k ≤ mthroughout this paper, and we call the smaller tensor K asthe kernel. In general, the convolution operator is linear andcan be converted into matrix multiplication:

X ?K = Ak(X)K,∀X,K,

where Ak(X) is the convolution matrix [29] of a tensor, andthe subscript k is put to remind the readers that the convolu-tion matrix is always associated with a certain kernel size k.In the light of circular convolution, the convolution matrixof a vector X = [x1, · · · , xm]T is a truncated circulantmatrix of size m× k:

Ak(X) =

x1 xm · · · xm−k+2

x2 x1 · · · xm−k+3

......

......

xm xm−1 · · · xm−k+1

.

In other words, the jth column ofAk(X) is exactly Sj−1(X)with S being a circular shift operator:

S(X) = [xm, x1, x2, · · · , xm−1]T . (7)

This shift operator is implemented by the Matlab function“circshift”. In the special case of k = m, the convolutionmatrix Am(X) is exactly an m×m circulant matrix.

Now we turn to the general case of order-n tensors, withn ≥ 1. LetX ∈ Rm1×···×mn andK ∈ Rk1×···×kn be two real-valued order-n tensors. Again, it is configured that kj ≤mj ,∀1 ≤ j ≤ n and K is called as the kernel. Then theprocedure of circularly convoluting X and K into X ? K ∈Rm1×···×mn is given by

[X ?K]i1,···,in =∑

s1,··· ,sn[X]i1−s1,···,in−sn [K]s1,··· ,sn .

The above convolution procedure can be also converted intomatrix multiplication. Let vec(·) be the vectorization of atensor, then we have

vec(X ?K) = Ak(X)vec(K),∀X,K, (8)

whereAk(X) is anm×k matrix, withm = Πnj=1mj and k =

Πnj=1kj . To compute the convolution matrix of an order-n

tensor, one just needs to replace the one-directional circularshift operator given in (7) with a multi-directional one (seeSection 5.1), so as to stay in step with the structure of high-order tensors.

3.3 Connections Between DFT and ConvolutionWhenever the kernel K has the same size as the tensor X ,i.e., kj = mj , ∀1 ≤ j ≤ n, the produced convolution matrix,Am(X), is diagonalized by DFT. The cases of n = 1 andn = 2 are well-known and have been widely used in theliterature, e.g., [35, 36]. In effect, the conclusion holds forany n ≥ 1, as pointed out by [37]. To be more precise, letthe DFT of X be F(X) = X ×1 U1 · · · ×n Un, and denoteU = U1⊗· · ·⊗Un with⊗ being the Kronecker product. ThenUAm(X)UH is a diagonal matrix, namely UAm(X)UH =mΣ with Σ = diag (σ1, · · · , σm). Since the first column of Uis a vector of all ones, it is easy to see that

vec(F(X)) = Uvec(X) = U [Am(X)]:,1

= [UAm(X)]:,1 = [ΣU ]:,1 = [σ1, · · · , σm]T .

That is, the eigenvalues of the convolution matrix Am(X)are exactly the Fourier frequencies given by F(X). Hence,for any X ∈ Rm1×···×mn , we have

‖F(X)‖0 = rank (Am(X)) , ‖F(X)‖1 = ‖Am(X)‖∗ ,where ‖ · ‖0 is the `0 (pseudo) norm, i.e., the number ofnonzero entries in a tensor. As a consequence, the DFT`1program (3) is equivalent to the following real-valued con-vex optimization problem:

minL‖Am(L)‖∗ +

λ

2‖PΩ(L−M)‖2F , (9)

where λ > 0 is a parameter. Although real-valued andconvex, the above problem is hard to be solved efficiently.Thus the formulation (9) is used only for the purpose oftheoretical analysis. It is worth noting that, in the generalcases of kj < mj , the convolution matrix Ak(X) is a tallmatrix instead of a square one. Such convolution matrices,however, cannot be diagonalized by DFT.

Page 5: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

5

𝜅𝑖 𝑋 𝑋 ⋆ 𝜅𝑖(𝑋)

𝑖 = 1

𝑖 = 2

𝑖 = 3

𝑖 = 4

Fig. 1. Visualizing the convolution eigenvectors and filtered signals,using a 200 × 200 Lena image as the experimental data and settingk1 = k2 = 13. For the sake of visual effects, the convolution eigenvectorκi(X) is processed such that its values are in the range of 0 and 1, andthe filtered signal X ? κi(X) is normalized to have a maximum of 1.

3.4 Convolution Eigenvalues

The concept of convolution eigenvalues is first proposedand investigated by [29], under the context of image deblur-ring. Although made specific to order-2 tensors (i.e., matri-ces), the definitions given in [29] can be easily generalizedto order-n tensors with any n ≥ 1.

Definition 3.1 (Convolution Eigenvalues and Eigenvec-tors [29]). For a tensor X ∈ Rm1×···×mn associated with acertain kernel size k1 × · · · × kn (kj ≤ mj , ∀j), its firstconvolution eigenvalue is denoted as σ1(X) and given by

σ1(X) = maxK∈Rk1×···×kn

‖X ?K‖F , s.t. ‖K‖F = 1.

The maximizer to above problem is called the first convolutioneigenvector, denoted as κ1(X) ∈ Rk1×···×kn .

Similarly, the ith (i = 2, . . . , k, k = Πnj=1kj) convolution

eigenvalue, denoted as σi(X), is defined as

σi(X) = maxK∈Rk1×···×kn

‖X ?K‖F ,

s.t. ‖K‖F = 1, 〈K,κj(X)〉 = 0,∀j < i.

The maximizer to above problem is the ith convolution eigenvec-tor, denoted as κi(X) ∈ Rk1×···×kn .

Figure 1 shows the first four convolution eigenvectorsof a natural image. It can be seen that the convolutioneigenvectors are essentially a series of ordered filters, inwhich the later ones have higher cut-off frequencies. Notethat, unlike the traditional singular values of tensors, theconvolution eigenvalues of a fixed tensor are not fixed andmay vary along with the kernel size.

Due to the relationship given in (8), the convolutioneigenvalues are indeed nothing more than the singularvalues of the convolution matrix, and thus the so-calledconvolution nuclear norm is exactly the nuclear norm ofthe convolution matrix:

‖X‖cnn = ‖Ak(X)‖∗,∀X ∈ Rm1×···×mn .

Since Ak is a linear operator, ‖X‖cnn is a convex function ofX . As a result, the CNNM program (4) is indeed equivalentto the following real-valued convex program:

minL‖Ak(L)‖∗ +

λk

2‖PΩ(L−M)‖2F , (10)

where we amplify the parameter λ by a factor of k = Πnj=1kj

for the purpose of normalizing the two objectives to asimilar scale. In the extreme case of kj = mj ,∀j, CNNMfalls back to DFT`1 . As aforementioned, it is often desirableto control the kernel size such that k m. This is actuallythe primary cause of CNNM’s superiority over DFT`1 .

3.5 Optimization AlgorithmsAlgorithm for CNNM: The problem in (10) is convex andcan be solved by the standard ADMM algorithm. We firstconvert (10) to the following equivalent problem:

minL,Z‖Z‖∗ +

λk

2‖PΩ(L−M)‖2F , s.t. Ak(L) = Z.

Then the ADMM algorithm minimizes the augmented La-grangian function,

‖Z‖∗ +λk

2‖PΩ(L−M)‖2F + 〈Ak(L)− Z, Y 〉

2‖Ak(L)− Z‖2F ,

with respect to L and Z , respectively, by fixing the othervariables and then updating the Lagrange multiplier Y andthe penalty parameter τ . Namely, while fixing the othervariables, the variable Z is updated by

Z = arg minZ

1

τ‖Z‖∗ +

1

2‖Z − (Ak(L) +

Y

τ)‖2F ,

which is solved via Singular Value Thresholding (SVT) [38].While fixing the others, the variable L is updated via

L = (λPΩ + τI)−1

(A∗k(τZ − Y )

k+ λPΩ(M)

),

where the inverse operator is indeed simply the entry-wisetensor division. The multiplier Y and penalty parameterτ are updated via Y = Y + τ(F(L) − Z) and τ = 1.05τ ,respectively. The convergence of ADMM with two or fewerblocks has been well understood, and researchers had evendeveloped advanced techniques to improve its convergencespeed, see [27, 39]. While solving our CNNM problem, thecomputation of each ADMM iteration is dominated by theSVT step, which has a complexity of O(mk2). Usually, thealgorithm needs about 200 iterations to get converged.

Algorithm for DFT`1 : The problem in (3) is solved in asimilar way to CNNM. The main difference happens inupdating the variable Z , which needs to solve the followingconvex problem:

Z = arg minZ

1

τ‖Z‖1 +

1

2

∥∥∥∥Z − (F(L) +Y

τ

)∥∥∥∥2

F

.

Note here that the variable Z is of complex-valued, and thusone needs to invoke Lemma 4.1 of [40] to obtain a closed-form solution; namely,

Z = h1/τ

(F(L) +

Y

τ

),

Page 6: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

6

where hα(·), a mapping parameterized by α > 0, is anentry-wise shrinkage operator given by

hα(z) =

|z|−α|z| z, if |z| > α,

0, otherwise,∀z ∈ C. (11)

While fixing the others, the variable L is updated via

L = (λPΩ + τF∗F)−1(F∗(τZ − Y ) + λPΩ(M)),

where F∗ denotes the Hermitian adjoint of DFT and isindeed given by mF−1. Unlike CNNM, which needs toperform Singular Value Decomposition (SVD) on tall ma-trices, DFT`1 can use the Fast Fourier Transform (FFT) [41]algorithm to compute DFT, and thus has a computationalcomplexity of only O(m logm) with m = Πn

j=1mj .

4 ANALYSIS

In this section, we will provide theoretical analysis to vali-date the recovery ability of DFT`1 and CNNM, uncoveringthe mystery on why our methods can deal with arbitrarilyselected missing entries.

4.1 Effects of Convolution

According to the conclusions made by most existing studiesabout tensor completion, it seems “unrealistic” to restorethe target L0 when some of its slices are wholly missing.Consider the case of n = 2, i.e., matrix completion. It iscertain that losing entirely some columns or rows in L0

will violate the isomeric condition [42, 43], which is howevernecessary for the identifiability of L0. Nevertheless, allthese assertions are made under the context that L0 ispermutable, which is not the case with STC. As one mayhave noticed, convolution is indeed a key part of DFT`1and CNNM, playing a central role in solving the STCproblem. In short, the effects of convolution are two-fold:1) converting a possibly ill-posed sampling pattern to awell-posed one, and 2) reforming the representation of L0

such that low-rankness may emerge.

Convolution Sampling Set: In order to deal with the sam-pling set Ω with arbitrary, possibly ill-posed pattern, wewould introduce a concept called convolution sampling set:

Definition 4.1 (Convolution Sampling Set). For Ω ⊂ 1, · · · ,m1 × · · · × 1, · · · ,mn with m = Πn

j=1mj , its convolutionsampling set associated with kernel size k1 × · · · × kn is denotedby ΩA and given by

ΘΩA = Ak(ΘΩ) and ΩA = supp(ΘΩA), (12)

where ΘΩ ∈ Rm1×···×mn and ΘΩA ∈ Rm×k are the mask tensorand mask matrix of Ω and ΩA, respectively. Note here that thesubscript k is omitted from ΩA for the sake of simplicity.

In general, ΩA is a convolution counterpart of Ω, andthe corresponding orthogonal projection onto ΩA is givenby PΩA(Y ) = ΘΩA Y,∀Y ∈ Rm×k. No matter how themissing entries are selected, the convolution sampling setΩA always exhibits a well-posed pattern. Namely, each columnof the mask matrix ΘΩA has exactly ρ0m ones and (1−ρ0)mzeros, and each row of ΘΩA has at most (1 − ρ0)m zeros.

sampling set convolution sampling set

Fig. 2. Illustrating the effects of convolution. Left: the mask matrix ΘΩ ofa 2D sampling set Ω with m1 = m2 = 10, where the last 5 columns arewholly missing. Right: the mask matrix ΘΩA of the convolution samplingset ΩA with k1 = k2 = 10.

Whenever kj = mj ,∀j, each row of ΘΩA has also exactlyρ0m ones and (1− ρ0)m zeros, as shown in Figure 2.

The following lemma shows some algebraic propertiesabout ΩA, which play an important role in the proofs.

Lemma 4.1. Let Ω ⊂ 1, · · · ,m1 × · · · × 1, · · · ,mn.Suppose that the kernel size used to define Ak is given byk1×· · ·×kn with kj ≤ mj ,∀1 ≤ j ≤ n. Denote m = Πn

j=1mj

and k = Πnj=1kj . Let ΩA ⊂ 1, · · · ,m×1, · · · , k be the 2D

convolution sampling set defined as in (12). For any Y ∈ Rm×kand X ∈ Rm1×···×mn , we have the following:

A∗kAk(X) = kX,

AkPΩ(X) = PΩAAk(X),

A∗kPΩA(Y ) = PΩA∗k(Y ),

where A∗k is the Hermitian adjoint of Ak.

Convolution Rank and Coherence: In the STC problem,the target tensor L0 itself is unnecessary to be Tucker low-rank. Consider the extreme case of n = 1, i.e., L0 is avector. In this case, L0 is always full-rank in the light ofTucker decomposition. Interestingly, such tensors can stillhave a low convolution rank, which is defined as follows. Fora tensor X ∈ Rm1×···×mn with kernel size k1 × · · · × kn,its convolution rank is denoted by rk(X) and defined as thenumber of nonzero convolution eigenvalues, i.e., the rank ofits convolution matrix:

rk(X) = rank (Ak(X)) .

It is worth noting that the convolution rank is not a general-ization of the Tucker rank, as a tensor with low Tucker rankis unnecessarily low-rank in the convolution domain, andvice versa. In some special examples such as the symmetrictextures [44], convolution low-rankness and Tucker low-rankness can coexist. But, in general cases, there is noessential connection between these two techniques, and theymay have quite different purposes. In short, Tucker rankis oriented to ordinary tensors, while convolution rank isspecific to sequential tensors.

To prove that CNNM can recover L0 from a subset of itsentries with arbitrary sampling pattern, we need to assumethat the convolution rank of L0 is fairly low. This conditioncannot be met by all kinds of sequential tensors, but theredo exist a wide variety of cases of compliable, as we willexplain in Section 4.4. Moreover, since Ak(L0) ∈ Rm×k isa truncation of Am(L0) ∈ Rm×m, the sparsity degree of

Page 7: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

7

the Fourier representation is indeed an upper bound of theconvolution rank; namely,

rk(L0) = rank (Ak(L0)) ≤ rank (Am(L0)) = ‖F(L0)‖0.

The same as in most existing matrix completion theories,we also need to access the concept of coherence [13, 45].For a tensor X ∈ Rm1×···×mn , let the (skinny) SVD of itsconvolution matrix be Ak(X) = UΣV T . Then the convo-lution coherence of X is defined as the coherence of the itsconvolution matrix Ak(X):

µk(X) =m

rk(X)max( max

1≤i≤m‖[U ]i,:‖2F , max

1≤j≤m‖[V ]j,:‖2F ),

where m = Πnj=1mj and k = Πn

j=1kj . As we will showlater, the sampling complexity of CNNM—the percentageof observed entries that the method needs in order torestore L0 successfully—may reply on µk(L0). Candes andRecht [13] had proven that, under certain conditions, thecoherence parameters of matrices are bounded from aboveby some numerical constant.

4.2 Dual ProblemBy the definition of convolution matrix, recovering L0 canensure to recover its convolution matrix Ak(L0). On theother hand, Lemma 4.1 implies that obtaining Ak(L0) alsosuffices to recover L0. So, STC can be converted into thefollowing ordinary matrix completion (OMC) problem:

Problem 4.1 (Dual Problem). Use the same notations as inProblem 1.2. Denote by ΩA the convolution sampling set of Ω.Assume Ak(L0) ≈ Ak(M). Given a subset of the entries inAk(M) and the corresponding sampling set ΩA, the goal is torecover the convolution matrix Ak(L0).

As aforementioned, the pattern of ΩA is always well-posed, in a sense that some observations are available atevery column and row of the matrix.3 Hence, providedthat Ak(L0) is low-rank, Problem 4.1 is exactly the low-rank OMC problem widely studied in the literature [13,20, 45–47]. However, unlike the setting of random sam-pling adopted by most studies, the sampling regime hereis deterministic rather than random, thereby we have toaccount on the techniques established by [42, 43]. For thecompleteness of presentation, we would briefly introducethe key techniques followed from [42, 43].

First of all, our analysis needs the concepts of isomericcondition (or isomerism) and relative well-conditionedness.

Definition 4.2 (Ω/ΩT -Isomeric [42]). Let X ∈ Rm1×m2 be amatrix and Ω ⊂ 1, · · · ,m1 × 1, · · · ,m2 be a 2D samplingset. Suppose Ωi 6= ∅ and Ωj 6= ∅, ∀i, j. Then X is Ω-isomeric iff

rank([X]Ωj ,:

)= rank (X) ,∀j = 1, · · · ,m2.

Furthermore, the matrix X is called Ω/ΩT -isomeric iff X is Ω-isomeric and XT is ΩT -isomeric.

As pointed out by [43], isomerism is merely a necessarycondition for solving OMC, but not sufficient. To compen-sate the weakness of isomerism, they have proposed anadditional condition called relative well-conditionedness.

3. To meet this, the kernel size needs be chosen properly. For example,under the setup of Problem 1.1, kn should be greater than q.

(a) data

200 400 600 800 1000dimension m

0.2

0.3

0.4

0.5

0.6

0.7

0.8

sam

plin

g c

om

ple

xity

(b) deterministic sampling

DFTL1

CNNM(k1=0.3m)

CNNM(k1=0.7m)

200 400 600 800 1000dimension m

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

sam

plin

g c

om

ple

xity

(c) random sampling

DFTL1

CNNM(k1=0.3m)

CNNM(k1=0.7m)

Fig. 3. Investigating the difference between future data and randomlychosen missing entries. (a) The sine sequence used for experiments,Mtmt=1 withMt = sin(2tπ/m), therebyL0 is anm-dimensional vectorwith rk(L0) = 2 and µk(L0) = 1, ∀m ≥ k > 2. (b) The samplingcomplexity under the setup of forecasting, where Mtmt=ρ0m+1 is themissing data. (c) The sampling complexity under the context of randomsampling. In these experiments, the complexity is calculated as thesmallest fraction of observed entries for the methods to succeed inrecovering L0, in a sense that the recovery accuracy measured by PeakSignal-to-Noise Ratio (PSNR) is greater than 50.

Definition 4.3 (Ω/ΩT -Relative Condition Number [43]). Usethe same notations as in Definition 4.2. Suppose that [X]Ωj ,: 6= 0and [X]:,Ωi

6= 0, ∀i, j. Then the Ω-relative condition number ofX is denoted by γΩ(X) and given by

γΩ(X) = min1≤j≤m2

1/‖X([X]Ωj ,:)+‖2,

where (·)+ is the Moore-Penrose pseudo-inverse of a matrix. Fur-thermore, the Ω/ΩT -relative condition number of X is denoted byγΩ,ΩT (X) and given by γΩ,ΩT (X) = min(γΩ(X), γΩT (XT )).

In order to show that CNNM succeeds in recoveringL0 even when the missing entries are arbitrarily placed, itis necessary to prove that Ak(L0) is ΩA/Ω

TA-isomeric and

γΩA,ΩTA

(Ak(L0)) is reasonably large as well. To do this, weneed to apply the following lemma.

Lemma 4.2 ([43]). Use the same notations as in Definition 4.2,and let ΘΩ be the mask matrix of Ω. Denote by µX and rXthe coherence and rank of the matrix X , respectively. Define aquantity ρ as

ρ = min( min1≤i≤m1

‖[ΘΩ]i,:‖0/m2, min1≤j≤m2

‖[ΘΩ]:,j‖0/m1).

For any 0 ≤ α < 1, if ρ > 1 − (1 − α)/(µXrX) then X isΩ/ΩT -isomeric and γΩ,ΩT (X) > α.

4.3 Main ResultsFirst consider the ideal case where the observed data isprecise and noiseless, i.e., PΩ(M − L0) = 0. In this case, asshown below, the target tensor L0 can be exactly recoveredby a simplified version of CNNM (10).

Theorem 4.1 (Noiseless). Let L0 ∈ Rm1×···×mn and Ω ⊂1, · · · ,m1×· · ·×1, · · · ,mn. Let the adopted kernel size bek1× · · ·×kn with kj ≤ mj ,∀1 ≤ j ≤ n. Denote k = Πn

j=1kj ,m = Πn

j=1mj and ρ0 = ‖ΘΩ‖0/m. Denote by rk(L0) andµk(L0) the convolution rank and convolution coherence of thetarget L0, respectively. Assume that PΩ(M) = PΩ(L0). If

ρ0 > 1− k

4µk(L0)rk(L0)m,

then the exact solution of L = L0 is the unique minimizer to thefollowing convex optimization problem:

minL‖Ak(L)‖∗ , s.t. PΩ(L−M) = 0. (13)

Page 8: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

8

By setting the kernel to have the same size as the targetL0, CNNM falls back to DFT`1 . Thus, the following isindeed an immediate consequence of Theorem 4.1.

Corollary 4.1 (Noiseless). Use the same notations as in The-orem 4.1, and set kj = mj ,∀1 ≤ j ≤ n. Assume that theobservations are noiseless, namely PΩ(M) = PΩ(L0). If

ρ0 > 1− 1

4µm(L0)‖F(L0)‖0,

then the exact solution of L = L0 is the unique minimizer to thefollowing convex optimization problem:

minL‖F(L)‖1 , s.t. PΩ(L−M) = 0, (14)

which is a simplified version of DFT`1 (3).

The above theories imply that the sampling complexity,the lower boundary of ρ0, has no direct link to the tensordimension m. This is quite unlike the random samplingbased tensor completion theories (e.g., [13, 48]), whichsuggest that the sampling complexity for restoring theconvolution matrix, Am(L0) ∈ Rm×m, can be as low asO(µm(L0)rm(L0)(logm)2/m) [48], and which gives thatthe complexity should tend to decrease as m grows. Infact, there is no conflict because, as pointed out by [43], thesampling regime under forecasting is indeed deterministicrather than random. Figure 3 illustrates that the resultsderived from random sampling cannot apply to forecasting,confirming the certainty of our result.

Corollary 4.1 implies that Problem 1.1 is solved by theDFT`1 program (14) provided that M = L0 and

p > q(4µm(L0)‖F(L0)‖0 − 1).

So, in order to succeed in forecasting the future data, it couldbe helpful to increase the number of historical samples. Thiswill be more clear by looking into the sparsity of F(L0). Forexample, when ‖F(L0)‖0 < mβ/µm(L0) with 0 ≤ β < 1,Corollary 4.1 gives that the next q values of a given sequenceof p tensors are identified provided that

p > 41/(1−β)mβ/(1−β)q1/(1−β),

where m = Πn−1j=1mj is the dimension of the values in

a tensor-valued time series. The interpretation for The-orem 4.1 is similar. The main difference is that CNNMcan further reduce the sampling complexity by choosing aproper kernel size, as we will explain in Section 4.4.

The programs in (13) and (14) are designed for the casewhere the observations are noiseless. This is usually nottrue in practice and, even more, the convolution matrix ofM is often not strictly low-rank. So, it is more reasonableto consider that there is a deviation between M and L0;namely, ‖PΩ(M − L0)‖F ≤ ε. In this case, as we will showsoon, the target L0 can still be accurately recovered by thefollowing program equivalent to (10):

minL‖Ak(L)‖∗ , s.t. ‖PΩ(L−M)‖F ≤ ε, (15)

where ε > 0 is a parameter.

Theorem 4.2 (Noisy). Use the same notations as in Theorem 4.1.Suppose that ‖PΩ(M − L0)‖F ≤ ε. If

ρ0 > 1− 0.22k

µk(L0)rk(L0)m,

10 20 30 40 50kernel size (k

1 =k

2)

20

22

24

26

28

30

PS

NR

(a)

miss 5%miss 35%miss 65%

10 20 30 40 50kernel size (k

1 =k

2)

1.5

2

2.5

3

3.5

4

aver

aged

co

din

g le

ng

th

104 (b)

= 0.1 = 0.05 = 0.01

0 5 10log

10( )

5

10

15

20

25

30

PS

NR

(c)

= 0 = 0.01 = 0.04

Fig. 4. Exploring the influences of the parameters in CNNM, using a50× 50 image patch as the experimental data. (a) Plotting the recoveryaccuracy as a function of the kernel size. (b) Plotting the averagedcoding length as a function of the kernel size. (c) Plotting the recoveryaccuracy as a function of the parameter λ. For the experiments in (c), themissing rate is set as 35% and the observed entries are contaminatedby iid Gaussian noise with mean 0 and standard deviation σ. Note that inthis paper the PSNR measure is evaluated only on the missing entries.

then any optimal solution Lo to the CNNM program (15) gives anear recovery to the target tensor L0, in a sense that

‖Lo − L0‖F ≤ (1 +√

2)(38√k + 2)ε.

Again, the relationship between CNNM and DFT`1leads to the following result:

Corollary 4.2 (Noisy). Use the same notations as in Theo-rem 4.1, and set kj = mj ,∀j. Suppose that ‖PΩ(M−L0)‖F ≤ εand Lo is an optimal solution to the following DFT`1 program:

minL‖F(L)‖1 , s.t. ‖PΩ(L−M)‖F ≤ ε. (16)

If ρ0 > 1 − 0.22/(µm(L0)‖F(L0)‖0), then Lo gives a nearrecovery to L0, in a sense that

‖Lo − L0‖F ≤ (1 +√

2)(38√m+ 2)ε.

It is worthy of noting that, interestingly, the results inTheorem 4.2 and Corollary 4.2 are valid even if the data ten-sor M does not have any low-rank structures at all. This isbecause, for any tensor M , one can always decompose itinto M = L0 + N with Ak(L0) being strictly low-rankand ‖N‖F ≤ ε. This, however, does not mean that CNNMand DFT`1 can work well on all datasets: Whenever low-rankness is absent from M , the residual ε could be largeand the obtained estimate is unnecessarily accurate. Forour methods to produce satisfactory results, it is requiredthat the convolution rank of the data tensor M is low orapproximately so.

4.4 DiscussionsOn Influences of Parameters: The hyper-parameters inCNNM mainly include the kernel size k1 × · · · × kn andthe regularization parameter λ. According to the results inFigure 3, it seems beneficial to use large kernels. But thisis not the case with most real-world datasets. As shown inFigure 4(a), the recovery accuracy of CNNM increases asthe enlargement of the adopted kernel size at first, but thendrops eventually as the kernel size continues to grow. Infact, both phenomena are consistent with the our theories,which imply that the sampling complexity ρ0 is boundedfrom below by a quantity dominated by rk(L0)/k. For theparticular example in Figure 3, rk(L0) ≡ 2, ∀k ≥ 2, and thuslarge k produces better recovery. However, in most real-world datasets, the convolution rank may increase as the

Page 9: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

9

(a) Gini = 0.28 (b) Gini = 0.76

20 40 60 80 100#periods

0.2

0.4

0.6

0.8

1

Gin

i

(c)

Fig. 5. Periodicity generally leads to Fourier sparsity, no matter howcomplicate the structure of a single period is. The sparsity degree ismeasured by Gini index [50], which ranges from 0 to 1. (a) The signalof one period consisting of 50 random numbers. (b) The signal of threeperiods obtained by repeating the signal in (a) three times. (c) Gini indexversus the number of periods.

kernel size grows. To show the consistence in this case, wewould like to investigate empirically the coding length [49]of the convolution matrix of L0:

CLθ(Ak(L0))

=1

2(m+ k) log det

(I +

m

kθ2Ak(L0)(Ak(L0))T

),

where det(·) is the determinant of a matrix and θ > 0 isa parameter. In general, CLθ(Ak(L0)) is no more than acomputationally-friendly approximate to rk(L0), thereby areasonable approximate to rk(L0)/k is given by

ACLθ(Ak(L0)) =CLθ(Ak(L0))

k

=1

2

(mk

+ 1)

log det(I +

m

kθ2Ak(L0)(Ak(L0))T

),

where ACLθ(·) is the averaged coding length of a matrix.As we can see from Figure 4(b), the averaged coding

length ofAk(L0) is minimized at some value between k = 1and k = m, and, interestingly, the minimizer can coincidewith the point that maximizes the recovery accuracy pro-vided that the parameter θ is chosen properly. So, to achievethe best performance, the kernel size in CNNM has to bechosen carefully, which is not easy. Fortunately, CNNM is infact a reliable method and seldom breaks down unless thekernel is too small (e.g., kj = 1, ∀j). While applying CNNMto natural images, it is moderately good to use k1 = k2 = 13,as will be shown in our experiments.

Figure 4(c) shows the influence of the parameter λ. Itcan be seen that there is no loss to enlarge λ. The reasonis that we evaluate the recovery accuracy on the missingentries only. So, no matter whether the observations arecontaminated by noise or not, we would suggest settingλ = 1000 for CNNM and DFT`1 . To remove the noisepossibly existing in the observations, one may use some de-noising method to postprocess the results.

On Fourier Sparsity and Convolution Low-Rankness: Theproposed methods, DFT`1 and its generalization CNNM,depend directly or indirectly on the Fourier sparsity—thephenomenon that most of the Fourier coefficients of a signalare zero or approximately so, which appears frequently inmany domains, ranging from images [23] and videos [51]to Boolean functions [52] and wideband channels [53]. For a1D signal represented by anm-dimensional vector, it is well-known that the sparsity will show up whenever a pattern isobserved repeatedly over time or, in other words, the signalis periodic (or near periodic) and the series has lasted long

enough (i.e., m is large), as confirmed by Figure 5. Such anexplanation, however, would not generalize to high-ordertensors, in which there are multiple time-like directions.

Convolution low-rankness, a generalization of Fouriersparsity, is more interpretable. When n = 1 and L0 is avector, the jth column of the convolution matrix Ak(L0) issimply the vector obtained by circularly shifting the entriesin L0 by j − 1 positions. Now, one may see why or why notAk(L0) is low-rank. More precisely, when L0 possesses sub-stantial continuity and the shift degree is relatively small,the signals before and after circular shift are mostly thesame and therefore the low-rankness ofAk(L0) will present.Otherwise, the shift operator might cause miss-alignmentand prevent low-rankness—but this is not absolute: A non-periodic signal full of dynamics and discontinuities canhappen to have a convolution matrix of approximately low-rank (see the lower left part of Figure 7). The general caseof order-n tensors is similar, and the only difference is thata multi-directional shift operator is used to move the entriesof a tensor along multiple directions.

Beneath it all, the Fourier sparsity and convolutionlow-rankness are not intuitive phenomena that come outvery clearly in their interpretations, and thus it is noteasy to determine the exact scope of CNNM and DFT`1 .Our theories show that, when the degree of convolutionlow-rankness declines, the performance of CNNM (andDFT`1 ) may be depressed, in the sense that the requiredsampling complexity has to increase—the forecasting abilitywill be deprived when the convolution rank exceeds certainthreshold. Hence, roughly, our methods are applicable tonatural images, videos or something else similar.

Connections Between DFT`1 and RNN: In the light oflearning-based optimization [54, 55], DFT`1 is actually relatedto RNN. Consider the optimization problem in (3). Let theDFT of L be F(L) = L ×1 U1 · · · ×n Un. Then we havevec(F(L)) = Uvec(L), where U = U1 ⊗ · · · ⊗ Un. TakingY = vec(L), one can see that the formulation of DFT`1 isindeed equivalent to

minY ∈Rm

‖UY ‖1 +λ

2‖SY −B‖2F ,

where B = vec(PΩ(M)) ∈ Rm and S ∈ Rm×m is aselection matrix that satisfies Svec(Z) = vec(PΩ(Z)),∀Z ∈Rm1×···×mn . Now, taking X = UY simply leads to

minX∈Cm

‖X‖1 +λ

2‖AX −B‖2F ,

where A = SUH . The above problem can be solved by theIterative Shrinkage and Thresholding Algorithm (ISTA) [56],which iterates the following procedure:

Xk+1 = h1/(λρ)

(Xk −

AH(AXk −B)

ρ

),

where hα(·) is an entry-wise shrinkage operator givenby (11). Take W = I − AHA/ρ = I − USUH/ρ andD = AHB/ρ = USvec(PΩ(M))/ρ. Then we have

Xk+1 = h1/(λρ)(WXk +D).

Page 10: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

10

So, DFT`1 is essentially a particular RNN with activationfunction hα(·), and the resulted network parameters arefixed complex numbers determined by the Fourier basis.

CNNM is somehow related to the well-known Convolu-tional Neural Network (CNN) [57]. But the exact relation-ship is complicate and hard to figure out.

5 MATHEMATICAL PROOFS

This section presents in detail the proofs to the proposedlemmas and theorems.

5.1 Proof to Lemma 4.1

Proof. Denote by S the “circshift” operator in Matlab;namely, S(G, u, v) circularly shifts the elements in ten-sor G by u positions along the vth direction. For any(i1, · · · , in) ∈ 1, · · · , k1 × · · · × 1, · · · , kn, we definean invertible operator T(i1,··· ,in) : Rm1×···×mn → Rm as

T(i1,··· ,in)(G) = vec(Gn),∀G ∈ Rm1×···×mn ,

where vec(·) is the vectorization operator and Gn is deter-mined by the following recursive rule:

G0 = G,Gh = S(Gh−1, ih − 1, h), 1 ≤ h ≤ n.

Suppose that j = 1 +∑na=1(ia − 1)Πa−1

b=0kb, where it isassumed conveniently that k0 = 1. Then we have

[Ak(X)]:,j = T(i1,··· ,in)(X). (17)

According to the definition of the Hermitian adjoint opera-tor given in (5), we have

A∗k(Z) =∑

i1,··· ,in

T −1(i1,··· ,in)([Z]:,j),∀Z ∈ Rm×k, (18)

where it is worth noting that the number j functionallydepends on the index (i1, · · · , in). By (17) and (18),

A∗kAk(X) =∑

i1,··· ,in

T −1(i1,··· ,in)T(i1,··· ,in)(X) = kX.

The second claim is easy to prove. By (6), (12) and (17),

[AkPΩ(X)]:,j = T(i1,··· ,in)(ΘΩ X) =

T(i1,··· ,in)(ΘΩ) T(i1,··· ,in)(X) = [ΘΩA ]:,j [Ak(X)]:,j

= [PΩAAk(X)]:,j .

It remains to prove the third claim. By (6) and (12),

A∗kPΩA(Y ) = A∗k(ΘΩA Y ) = A∗k(Ak(ΘΩ) Y ),

which, together with (17) and (18), gives that

A∗kPΩA(Y ) =∑

i1,··· ,in

T −1(i1,··· ,in)([Ak(ΘΩ) Y ]:,j)

=∑

i1,··· ,in

T −1(i1,··· ,in)([Ak(ΘΩ)]:,j) T −1

(i1,··· ,in)([Y ]:,j)

=∑

i1,··· ,in

ΘΩ T −1(i1,··· ,in)([Y ]:,j) = PΩA∗k(Y ).

5.2 Proof to Theorem 4.1The proof process is quite standard. We shall first provethe following lemma that establishes the conditions underwhich the solution to (13) is unique and exact.

Lemma 5.1. Suppose the SVD of the convolution matrix of L0 isgiven by Ak(L0) = U0Σ0V

T0 . Denote by PT0

(·) = U0UT0 (·) +

(·)V0VT0 −U0U

T0 (·)V0V

T0 the orthogonal projection onto the sum

of U0 and V0. Then L0 is the unique minimizer to the problemin (13) provided that:

1. P⊥ΩA ∩ PT0= 0.

2. There exists Y ∈ Rm×k such that PT0PΩA(Y ) = U0V

T0

and ‖P⊥T0PΩA(Y )‖ < 1.

Proof. Take W = P⊥T0PΩA(Y ). Then A∗k(U0V

T0 + W ) =

A∗kPΩA(Y ). By Lemma 4.1,

A∗kPΩA(Y ) = PΩA∗k(Y ) ∈ PΩ.

By the standard convexity arguments shown in [58], L0 is anoptimal solution to the convex optimization problem in (13).

It remains to prove that L0 is the unique minimizer.To do this, we consider a feasible solution L0 + ∆ withPΩ(∆) = 0, and we shall show that the objective valuestrictly increases unless ∆ = 0. Due to the convexity ofnuclear norm, we have

‖Ak(L0 + ∆)‖∗ − ‖Ak(L0)‖∗ ≥ 〈A∗k(U0VT0 +H),∆〉

= 〈U0VT0 +H,Ak(∆)〉,

where H ∈ P⊥T0and ‖H‖ ≤ 1. By the duality between the

operator and nuclear norms, we can always choose an Hsuch that

〈H,Ak(∆)〉 = ‖P⊥T0Ak(∆)‖∗.

In addition, it follows from Lemma 4.1 that

〈PΩA(Y ),Ak(∆)〉 = 〈Y,PΩAAk(∆)〉 = 〈Y,AkPΩ(∆)〉 = 0.

Hence, we have

〈U0VT0 +H,Ak(∆)〉 = 〈PΩA(Y ) +H −W,Ak(∆)〉

= 〈H −W,Ak(∆)〉 ≥ (1− ‖W‖)‖P⊥T0Ak(∆)‖∗.

Since ‖W‖ < 1, ‖Ak(L0 + ∆)‖∗ is greater than ‖Ak(L0)‖∗unless Ak(∆) ∈ PT0

. Note that PΩAAk(∆) = AkPΩ(∆) =0, i.e., Ak(∆) ∈ P⊥ΩA . Since P⊥ΩA ∩PT0

= 0, it follows thatAk(∆) = 0, which immediately leads to ∆ = 0.

In the rest of the proof, we shall show how we willprove the dual conditions listed in Lemma 5.1. Noticethat, even if the locations of the missing entries are ar-bitrarily distributed, each column of ΩA has exactly acardinality of ρ0m, and each row of ΩA contains at leastk − (1− ρ0)m elements. Denote by ρ the smallest samplingcomplexity of each row and column of ΩA. Provided thatρ0 > 1− 0.25k/(µk(L0)rk(L0)m), we have

ρ ≥ k − (1− ρ0)m

k> 1− 0.25

µk(L0)rk(L0).

Then it follows from Lemma 4.2 that Ak(L0) is ΩA/ΩTA-

isomeric and γΩA,ΩTA

(Ak(L0)) > 0.75. Thus, according toLemma 5.11 of [43], we have

‖PT0P⊥ΩAPT0‖ ≤ 2(1− γΩA,ΩTA

(Ak(L0))) < 0.5 < 1,

Page 11: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

11

which, together with Lemma 5.6 of [43], results in P⊥ΩA ∩PT0

= 0. As a consequence, we could define Y as

Y = PΩAPT0(PT0PΩAPT0

)−1(U0VT0 ).

It can be verified that PT0PΩA(Y ) = U0VT0 . Moreover, it

follows from Lemma 5.12 of [43] that

‖P⊥T0PΩA(Y )‖ ≤ ‖P⊥T0

PΩAPT0(PT0PΩAPT0)−1‖‖U0VT0 ‖

=

√1

1− ‖PT0P⊥ΩAPT0

‖− 1 < 1,

which finishes to construct the dual certificate.

5.3 Proof to Theorem 4.2

Proof. Let N = Lo − L0 and denote NA = Ak(N). Noticethat ‖PΩ(Lo − M)‖F ≤ ε and ‖PΩ(M − L0)‖F ≤ ε. Bytriangle inequality, ‖PΩ(N)‖F ≤ 2ε. Thus,

‖PΩA(NA)‖2F = ‖AkPΩ(N)‖2F = k‖PΩ(N)‖2F ≤ 4kε2.

To bound ‖N‖F , it is sufficient to bound ‖NA‖F . So, itremains to bound ‖P⊥ΩA(NA)‖F . To do this, we define Yand W in the same way as in the proof to Theorem 4.1.Since Lo = L0 + N is an optimal solution to (15), we havethe following:

0 ≥ ‖Ak(L0 +N)‖∗ − ‖Ak(L0)‖∗≥ (1− ‖W‖)‖P⊥T0

(NA)‖∗ + 〈PΩA(Y ), NA〉.

Provided that ρ0 > 1 − 0.22/(µ(L0)r(L0)), we can provethat ‖W‖ = ‖P⊥T0

PΩA(Y )‖ < 0.9. As a consequence, wehave the following:

‖P⊥T0(NA)‖∗ ≤ −10〈PΩA(Y ),PΩA(NA)〉

≤ 10‖PΩA(Y )‖‖PΩA(NA)‖∗ ≤ 19‖PΩA(NA)‖∗≤ 19

√k‖PΩA(NA)‖F ≤ 38kε,

from which it follows that ‖P⊥T0(NA)‖F ≤ ‖P⊥T0

(NA)‖∗ ≤38kε, and which simply leads to

‖P⊥T0P⊥ΩA(NA)‖F ≤ (38k + 2

√k)ε.

We also have

‖PΩAPT0P⊥ΩA(NA)‖2F

= 〈PT0PΩAPT0

P⊥ΩA(NA),PT0P⊥ΩA(NA)〉

≥ (1− ‖PT0P⊥ΩAPT0

‖)‖PT0P⊥ΩA(NA)‖2F ,

≥ 1

2‖PT0

P⊥ΩA(NA)‖2F ,

which gives that

‖PT0P⊥ΩA(NA)‖2F ≤ 2‖PΩAPT0

P⊥ΩA(NA)‖2F= 2‖PΩAP⊥T0

P⊥ΩA(NA)‖2F ≤ 2(38k + 2√k)2ε2.

Combining the above justifications, we have

‖NA‖F ≤ ‖PT0P⊥ΩA(NA)‖F + ‖PT0

PΩA(NA)‖F+ ‖P⊥T0

(NA)‖F ≤ (√

2 + 1)(38k + 2√k)ε.

Finally, the fact ‖N‖F = ‖NA‖F /√k finishes the proof.

deterministic sampling

5 15 35 55 75 95observed entries (%)

2

6

14

22

30

38

con

volu

tio

n r

ank

random sampling

5 15 35 55 75 95observed entries (%)

2

6

14

22

30

38

con

volu

tio

n r

ank

Fig. 6. Investigating the recovery performance of DFT`1 under the setupof random sampling and forecasting (i.e., deterministic sampling). Thewhite and black areas mean “success” and “failure”, respectively, wherethe recovery is regarded as being successful iff PSNR > 50.

6 EXPERIMENTS

All experiments are conducted on a workstation equippedwith two Intel(R) Xeon(R) E5-2620 v4 2.10GHz CPU proces-sors and 256GB RAM. All the methods considered for com-parision are implemented using Matlab 2019a, and the codesare available at https://github.com/gcliu1982/CNNM.

6.1 SimulationsWe shall verify the proven theories by synthetic data. Wegenerate M and L0 according to a model as follows: [M ]t =[L0]t =

∑ai=1 sin(2tiπ/m), where m = 1000, t = 1, · · · ,

m and a = 1, · · · , 19. So the target L0 is a univariatetime series of dimension 1000, with ‖F(L0)‖0 = 2a andµm(L0) = 1. The values in L0 are further normalized tohave a maximum of 1. Regarding the locations of observedentries, we consider two settings: One is to randomly selecta subset of entries as in Figure 3(c), referred to as “ran-dom sampling”, the other is the forecasting setup used inFigure 3(b), referred to as “deterministic sampling”. Theobservation fraction is set as ρ0 = 0.05, 0.1, · · · , 0.95. Forthe setup of random sampling, we run 20 trials, and thusthere are 19× 19× 21 = 7581 simulations in total.

The results are shown in Figure 6, the left part of whichillustrates that the bound proven in Corollary 4.1 is indeedtight. As we can see, the recovery of future observationsis much more challenging than restoring the missing entrieschosen randomly. In fact, forecasting would be the worst caseof tensor completion. In these experiments, the convolutionrank rk(L0) does not grow with the kernel size, so there isno benefit to try different kernel sizes in CNNM.

6.2 Univariate Time Series ForecastingNow we consider two real-world time series downloadedfrom Time Series Data Library (TSDL): One for the annualWolfer sunspot numbers from 1770 to 1869 with m = 100,and the other for the highest mean monthly levels of LakeMichigan from 1860 to 1955 with m = 96. We considerfor comparison the well-known methods of ARMA andLSTM.4 ARMA contains many hyper-parameters, which aremanually tuned to maximize its recovery accuracy (in termsof PSNR) on the first sequence. The LSTM architecture used

4. Unlike the recent proposed deep architectures, the classical LSTMsupports the setup of only one sequence.

Page 12: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

12

Gini = 0.65 PSNR=19.86 PSNR=17.27 PSNR=19.10 k1=56, PSNR=19.30

data

Gini = 0.94

ARMA(5,4)

PSNR=36.38

LSTM

PSNR=33.98

DFTL1

PSNR=40.40

CNNM

k1=74, PSNR=41.47

Fig. 7. Results for univariate time series forecasting. The left figure shows the entire series used for experiments, where the observed and futureentries are plotted with black and green markers, respectively.

TABLE 1Comparison results obtained on 10 time series from TSDL, in terms of

PSNR and running time.

PSNR averagedmethods max min mean std time (secs)

ARMA(2,3) 33.84 5.63 20.77 7.26 1.67LSTM 34.88 13.17 20.38 7.03 46.35DFT`1 40.40 12.86 21.58 7.89 0.01

CNNM(k1 = 0.3m) 40.47 14.23 21.92 7.79 1.02CNNM(k1 = 0.4m) 40.85 14.23 21.93 7.80 1.77CNNM(k1 = 0.5m) 41.15 14.12 21.88 7.83 2.77

for experiment is consist of four layers, including an inputlayer with 1 unit, a hidden LSTM layer with 200 units, afully connected layer and a regression layer. The resultsare shown in Figure 7. Via manually choosing the bestparameters, ARMA can achieve the best performance on thefirst dataset. But its results are unsatisfactory while applyingthe same parametric setting to the second one. By contrast,DFT`1 produces reasonable results on both datasets thatdiffer greatly in the evolution rules. This is not incredible,because the method never assumes explicitly how the futureentries are related to the previously observed ones, and it isindeed the Fourier sparsity of the series itself that enablesthe recovery of the unseen future data.

We have in hand 10 univariate time series from TSDL.These sequences have different dimensions ranging from 56to 1461 and Fourier Gini indices from 0.52 to 0.94. For eachsequence, its last 10 percent is treated as the unseen futuredata. Since the previously used ARMA(5,4) model needs aconsiderable number of historical samples to train and oc-casionally gets crashed while dealing with short sequences,we consider instead an ARMA(2,3) model as the baseline.The comparison results are shown in Table 1. It can be seenthat DFT`1 outperforms ARMA(2,3) and LSTM, in termsof both recovery accuracy and computational efficiency. Viachoosing proper kernel sizes, CNNM can future outperformDFT`1 , though the improvements here are not dramatic.While dealing with images and videos, the recovery accu-racy of CNNM is distinctly higher than that of DFT`1 , aswill be shown in the next two subsections.

20 40 60 80

missing entries (%)

15

20

25

30

35

PS

NR

LRMCDFT

L1

CNNM

Fig. 8. Evaluating the recovery performance of various methods, usingthe 200×200 Lena image as the experimental data. The missing entriesare chosen uniformly at random, and the numbers plotted above areaveraged from 20 random trials.

TABLE 2PSNR and running time on the 200× 200 Lena image. In these

experiments, 60% of the image pixels are randomly chosen to bemissing. The numbers shown below are collected from 20 runs.

Methods PSNR Time (seconds)LRMC 21.65± 0.06 18.7± 0.5DFT`1 25.13± 0.06 0.51± 0.01

CNNM(13× 13) 27.20± 0.11 65.3± 0.8CNNM(23× 23) 27.45± 0.12 257± 3CNNM(33× 33) 27.53± 0.12 717± 12CNNM(43× 43) 27.57± 0.12 1478± 24CNNM(53× 53) 27.58± 0.11 2630± 69CNNM(63× 63) 27.57± 0.11 5482± 96

6.3 Image Completion

The proposed methods, DFT`1 and CNNM, are indeed gen-eral methods for completing sequential tensors rather thanspecific forecasting models. To validate their completionperformance, we consider the task of restoring the 200×200Lena image from its incomplete versions. We also includethe results of LRMC [13] for comparison. Figure 8 evaluatesthe recovery performance of various methods. It is clearthat LRMC is distinctly outperformed by DFT`1 , which isfurther outperformed largely by CNNM. Figure 9 showsthat the images restored by CNNM is visually better than

Page 13: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

13

input

miss 35%

miss 55%

LRMC

PSNR = 5.01

PSNR = 22.47

𝐃𝐅𝐓ℓ𝟏

PSNR = 26.28

PSNR = 25.71

CNNM

PSNR = 28.95

PSNR = 28.13

Fig. 9. Examples of image completion. From left to right: the input data, the results by LRMC, the results by DFT`1 , and the results by CNNM. TheGini index of the Fourier transform of the original Lena image is 0.79. Note that the PSNR value is computed only on the missing entries. Whenevaluating over the whole image, the PSNR of the image at the lower right corner is about 36.

Fig. 10. All 62 frames of the 50 × 50 × 62 video used for experiments.This sequence records the entire process that a bus passes through acertain location on a highway. The Gini index of the Fourier transform ofthis dataset is 0.73.

DFT`1 , whose results contain many artifacts but are stillbetter than LRMC. In particular, the second row of Figure 9illustrates that CNNM can well handle the difficult caseswhere some rows and columns of an image are whollymissing. Table 2 shows some detailed evaluation results. Aswe can see, when the kernel size rises from 13×13 to 53×53,the PSNR produced by CNNM only slightly increases from27.2 to 27.58. However, since the computational complexityof CNNM is O(mk2), the running time grows fast as theenlargement of the kernel size. So, we would suggest settingk1 = k2 = 13 while running CNNM on natural images.

6.4 Video Completion and PredictionWe create a 50 × 50 × 62 video consisting of a sequenceof 50 × 50 images patches quoted from the CDnet 2014database [59], see Figure 10. We first consider a com-pletion task of restoring the video from some randomlychosen entries. To show the advantages of the proposedmethods, we include for comparison three OTC methods,

20 40 60 80

missing entries (%)

10

15

20

25

30

35

40

PS

NR

TaMTNNMMLRTDFT

L1

CNNM

Fig. 11. Evaluation results of restoring the 50 × 50 × 62 video fromrandomly selected entries. The numbers plotted above are averagedfrom 20 random trials.

including Tensor as a Matrix (TaM) [13], Tensor NuclearNorm Minimization (TNNM) [15] and Mixture of Low-Rank Tensors (MLRT) [17]. As shown in Figure 11, DFT`1dramatically outperforms all OTC methods that ignore thespatio-temporal structures encoded in the video. But DFT`1is further outperformed distinctly by CNNM (13× 13× 13);this confirms the benefits of controlling the kernel size.

We now consider a forecasting task of recovering thelast 6 frames from the former 56 frames. Similar to theexperimental setup of Section 6.2, the LSTM network usedhere also contains 4 layers. The number of input units andLSTM hidden units are 2500 and 500, respectively. Since

Page 14: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

14

TABLE 3Evaluation results (PSNR) of video prediction. The task is to forecastthe last 6 frames given the former 56 frames. For CNNM, the first two

quantities of the kernel size are fixed as k1 = k2 = 13.

Methods 1st 2st 3st 4st 5st 6stTaM 5.22 5.49 5.75 5.96 6.15 6.23

TNNM 5.22 5.49 5.75 5.96 6.15 6.23MLRT 5.22 5.49 5.75 5.96 6.15 6.23LSTM 20.17 18.80 17.01 16.49 14.73 11.83DFT`1 22.56 19.53 18.72 19.43 21.21 26.20

CNNM(k3 = 13) 23.48 20.64 20.02 20.80 23.07 28.18CNNM(k3 = 31) 24.27 21.65 21.17 22.10 24.49 29.72CNNM(k3 = 62) 24.57 22.01 21.56 22.56 24.98 30.32

𝐃𝐅𝐓ℓ𝟏

1st 2st 3st 4st 5st 6st

GT

CNNM

(𝟏𝟑 × 𝟏𝟑 × 𝟏𝟑)

CNNM

(𝟏𝟑 × 𝟏𝟑 × 𝟑𝟏)

CNNM

(𝟏𝟑 × 𝟏𝟑 × 𝟔𝟐)

𝐋𝐒𝐓𝐌

Fig. 12. Visual results of video prediction. The goal of the task is toforecast the last 6 frames given the former 56 frames.

there is only one sequence, the learning problem for trainingthe LSTM network may be severely ill-conditioned. Table 3compares the performance of various methods, in terms ofPSNR. It can be seen that OTC methods (including TaM,TNNM and MLRT) produce very poor results. In fact, allthese methods predict the future data as zero. The PSNRvalues achieved by LSTM are very low; this is not strange,because there are no extra sequences for training the net-work. As shown in Figure 12, DFT`1 owns certain ability toforecast the future data, but the obtained images are full ofartifacts. By contrast, the quality of the images predicted byCNNM is much higher.

7 CONCLUSION AND FUTURE WORK

In this work, we studied a significant problem called futuredata recovery, with the purpose of tackling the identifiabilityissue of the unseen future observations. We first reformu-lated the problem to a special completion task entitled STC,which targets at restoring a set of missing entries selected—in an arbitrary fashion—from a sequential tensor containingrich spatio-temporal structures. We then showed that theSTC problem, under certain situations, can be well solvedby a novel method termed CNNM. Namely, we proved that,whenever the target is fairly low-rank in some convolutiondomain, CNNM strictly succeeds in solving STC. Even ifthe prior expectation is not met exactly, our method stillensure the success of recovery to some extent. Finally, ex-periments on some realistic datasets illustrated that CNNMis a promising method for forecasting and completion.

While effective, as aforementioned, CNNM contributesonly to a certain case of the STC problem, namely thetarget tensor should own particular structures such thatits convolution matrix is low-rank or approximately so.There are many other sequential tensors cannot be handledby CNNM, for which the idea of graph convolution [60]might be useful. Moreover, with the help of learning-basedoptimization [54, 55], it is possible to generalize CNNM tosome learnable deep architecture. But we would leave theseas future work.

ACKNOWLEDGEMENT

This work is supported in part by New Generation AIMajor Project of Ministry of Science and Technology ofChina under Grant 2018AAA0102501, in part by SenseTimeResearch Fund.

REFERENCES

[1] C. Chatfield, Time-Series Forecasting. New York, USA:Chapman and Hall/CRC, 2000.

[2] J. G. Gooijera and R. J. Hyndmanb, “25 years of timeseries forecasting,” International Journal of Forecasting,vol. 22, no. 3, pp. 443–473, 2006.

[3] N. I. Sapankevych and R. Sankar, “Time series predic-tion using support vector machines: A survey,” IEEEComputational Intelligence Magazine, vol. 4, no. 2, pp. 24–38, 2009.

[4] G. P. Zhang, “Time series forecasting using a hybridarima and neural network model,” Neurocomputing,vol. 50, pp. 159–175, 2003.

[5] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,”in International Conference on Learning Representations,2016, pp. 1–14.

[6] J. Brownlee, Deep Learning for Time Series Forecasting.Ebook, 2018.

[7] S. Hochreiter and J. Schmidhuber, “Long short-termmemory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[8] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, andW.-c. Woo, “Convolutional LSTM network: A machinelearning approach for precipitation nowcasting,” in Ad-vances in Neural Information Processing Systems. CurranAssociates, Inc., 2015, pp. 802–810.

[9] N. Srivastava, E. Mansimov, and R. Salakhutdinov,“Unsupervised learning of video representations usingLSTMs,” in International Conference on Machine Learning,2015, p. 843C852.

[10] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu,“PredRNN++: Towards A resolution of the deep-in-time dilemma in spatiotemporal predictive learning,”in International Conference on Machine Learning, 2018, pp.5123–5132.

[11] S.-F. Wu, C.-Y. Chang, and S.-J. Lee, “Time series fore-casting with missing values,” in International Conferenceon Industrial Networks and Intelligent Systems, 2015, pp.151–156.

[12] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu,“Recurrent neural networks for multivariate time series

Page 15: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

15

with missing values,” Scientific Reports, vol. 8, no. 1, pp.85–98, 2018.

[13] E. Candes and B. Recht, “Exact matrix completionvia convex optimization,” Foundations of ComputationalMathematics, vol. 9, no. 6, pp. 717–772, 2009.

[14] D. Kressner, M. Steinlechner, and B. Vandereycken,“Low-rank tensor completion by riemannian optimiza-tion,” BIT Numerical Mathematics, vol. 54, no. 2, pp. 447–468, Jun 2014.

[15] J. Liu, P. Musialski, P. Wonka, and J. Ye, “Tensor com-pletion for estimating missing values in visual data,”IEEE Transactions on Pattern Analysis and Machine Intel-ligence, vol. 35, no. 1, pp. 208–220, Jan 2013.

[16] M. Signoretto, R. V. de Plas, B. D. Moor, and J. A.Suykens, “Tensor versus matrix completion: A com-parison with application to spectral data,” IEEE SignalProcessing Letters, vol. 18, no. 7, pp. 403–406, 2011.

[17] R. Tomioka, K. Hayashi, and H. Kashima, “Estimationof low-rank tensors via convex optimization,” Arxiv,vol. 1, no. 1, pp. 1–19, 2010.

[18] M. Yuan and C.-H. Zhang, “On tensor completion vianuclear norm minimization,” Foundations of Computa-tional Mathematics, vol. 16, pp. 1031–1068, 2016.

[19] P. Zhou, C. Lu, Z. Lin, and C. Zhang, “Tensor factoriza-tion for low-rank tensor completion,” IEEE Transactionson Image Processing, vol. 27, no. 3, pp. 1152–1163, March2018.

[20] D. Gross, “Recovering low-rank matrices from few co-efficients in any basis,” IEEE Transactions on InformationTheory, vol. 57, no. 3, pp. 1548–1566, 2011.

[21] H.-F. Yu, N. Rao, and I. S. Dhillon, “Temporal regu-larized matrix factoriztion for high-dimensional timeseries prediction,” in Advances in Neural InformationProcessing Systems 28, 2016, pp. 847–855.

[22] C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan,“Tensor robust principal component analysis with anew tensor nuclear norm,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 1, no. 1, pp. 1–14,2019.

[23] D. Donoho, “Compressed sensing,” IEEE Transactionson Information Theory, vol. 52, no. 4, pp. 1289–1306, 2006.

[24] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, “Simpleand practical algorithm for sparse fourier transform,”in Twenty-third Annual ACM-SIAM Symposium on Dis-crete Algorithms, 2012, pp. 1183–1194.

[25] E. Candes, J. Romberg, and T. Tao, “Robust uncertaintyprinciples: exact signal reconstruction from highly in-complete frequency information,” IEEE Transactions onInformation Theory, vol. 52, no. 2, pp. 489–509, 2006.

[26] E. Candes and M. Wakin, “An introduction to com-pressive sampling,” IEEE Signal Processing Magazine,vol. 25, no. 2, pp. 21–30, 2008.

[27] D. Gabay and B. Mercier, “A dual algorithm for thesolution of nonlinear variational problems via finiteelement approximation,” Computers and Mathematicswith Applications, vol. 2, no. 1, pp. 17–40, 1976.

[28] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrangemultiplier method for exact recovery of corrupted low-rank matrices,” UIUC Technical Report UILU-ENG-09-2215, 2009.

[29] G. Liu, S. Chang, and Y. Ma, “Blind image deblurring

using spectral properties of convolution operators,”IEEE Transactions on Image Processing, vol. 23, no. 12,pp. 5047–5056, 2014.

[30] E. Candes and T. Tao, “Decoding by linear program-ming,” IEEE Transactions on Information Theory, vol. 51,no. 12, pp. 4203–4215, 2005.

[31] M. Fazel, “Matrix rank minimization with applica-tions,” PhD thesis, 2002.

[32] B. Recht, M. Fazel, and P. Parrilo, “Guaranteedminimum-rank solutions of linear matrix equations vianuclear norm minimization,” SIAM Review, vol. 52,no. 3, pp. 471–501, 2010.

[33] L. D. Lathauwer, B. D. Moor, and J. Vandewalle, “Amultilinear singular value decomposition,” SIAM Jour-nal on Matrix Analysis and Applications, vol. 21, no. 4,pp. 1253–1278, 2000.

[34] M. F. Fahmy, G. M. A. Raheem, U. S. Mohamed, andO. F. Fahmy, “A new fast iterative blind deconvolutionalgorithm,” Journal of Signal and Information Processing,vol. 3, no. 1, pp. 98–108, 2012.

[35] M. K. Ng, R. H. Chan, and W.-C. Tang, “A fast algo-rithm for deblurring models with neumann boundaryconditions,” SIAM J. Sci. Comput., vol. 21, no. 3, pp.851–866, 1999.

[36] Y. Wang, J. Yang, W. yin, and Y. Zhang, “A new alter-nating minimization algorithm for total variation im-age reconstruction,” SIAM Journal on Imaging Sciences,vol. 1, no. 3, pp. 248–272, 2008.

[37] M. Combescure, “Block-circulant matrices with circu-lant blocks, weil sums, and mutually unbiased bases. ii.the prime power case,” Journal of Mathematical Physics,vol. 50, no. 3, pp. 1–14, 2009.

[38] J. Cai, E. Candes, and Z. Shen, “A singular valuethresholding algorithm for matrix completion,” SIAMJ. on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.

[39] C. Fang, F. Cheng, and Z. Lin, “Faster and non-ergodico(1/k) stochastic alternating direction method of mul-tipliers,” in Advances in Neural Information ProcessingSystems, 2017, pp. 4476–4485.

[40] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robustrecovery of subspace structures by low-rank represen-tation,” IEEE Transactions on Pattern Recognition andMachine Intelligence, vol. 35, no. 1, pp. 171–184, 2013.

[41] M. Frigo and S. G. Johnson, “Fftw: an adaptive softwarearchitecture for the fft,” in International Conference onAcoustics, Speech and Signal Processing, vol. 3, 1998.

[42] G. Liu, Q. Liu, and X.-T. Yuan, “A new theory for matrixcompletion,” in Neural Information Processing Systems,2017, pp. 785–794.

[43] G. Liu, Q. Liu, X.-T. Yuan, and M. Wang, “Matrixcompletion with deterministic sampling: Theories andmethods,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 1, no. 1, pp. 1–18, 2019.

[44] X. Liang, X. Ren, Z. Zhang, and Y. Ma, “Repairingsparse low-rank texture,” in European Conference onComputer Vision, 2012, pp. 482–495.

[45] G. Liu and P. Li, “Low-rank matrix completion in thepresence of high coherence,” IEEE Transactions on SignalProcessing, vol. 64, no. 21, pp. 5623–5633, 2016.

[46] R. Ge, J. D. Lee, and T. Ma, “Matrix completion hasno spurious local minimum,” in Neural Information

Page 16: 1 Recovery of Future Data via Convolution Nuclear Norm ...1 Recovery of Future Data via Convolution Nuclear Norm Minimization Guangcan Liu, Senior Member, IEEE, Wayne Zhang F Abstract—This

16

Processing Systems, 2016, pp. 2973–2981.[47] J. Yoon, J. Jordon, and M. van der Schaar, “GAIN:

Missing data imputation using generative adversarialnets,” in International Conference on Machine Learning,2018, pp. 5689–5698.

[48] Y. Chen, “Incoherence-optimal matrix completion,”IEEE Transactions on Information Theory, vol. 61, no. 5,pp. 2909–2923, 2015.

[49] Y. Ma, H. Derksen, W. Hong, and J. Wright, “Segmen-tation of multivariate mixed data via lossy data codingand compression,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 29, no. 9, pp. 1546–1562,2007.

[50] N. Hurley and S. Rickard, “Comparing measuresof sparsity,” IEEE Transactions on Information Theory,vol. 55, no. 10, pp. 4723–4741, 2009.

[51] A. Chandrakasan, V. Gutnik, and T. Xanthopoulos,“Data driven signal processing: an approach for energyefficient computing,” in International Symposium on LowPower Electronics and Design, 1996, pp. 347–352.

[52] R. O’Donnell, “Some topics in analysis of booleanfunctions,” in Annual ACM Symposium on Theory ofComputing, 2008, pp. 569–578.

[53] A. P. V. Mengda Lin, “A low complexity high resolutioncooperative spectrum-sensing scheme for cognitive ra-dios,” Circuits, Systems, and Signal Processing, vol. 31,no. 3, pp. 1127–1145, 2012.

[54] K. Gregor and Y. LeCun, “Learning fast approximationsof sparse coding,” in International Conference on Interna-tional Conference on Machine Learning, 2010, pp. 399–406.

[55] X. Xie, J. Wu, Z. Zhong, G. Liu, and Z. Lin, “Differen-tiable linearized ADMM,” in International Conference onMachine Learning, vol. 97, 2019, pp. 6902–6911.

[56] I. Daubechies, M. Defrise, and C. D. Mol, “An itera-tive thresholding algorithm for linear inverse problemswith a sparsity constraint,” Communications on Pure andApplied Mathematics, vol. 57, no. 11, pp. 1413–1457, 2004.

[57] L. Yann, B. Boser, J. S. Denker, D. Henderson, R. E.Howard, W. Hubbard, and L. D. Jackel, “Handwrittendigit recognition with a back-propagation network,” inAdvances in Neural Information Processing Systems, 1990,pp. 396–404.

[58] R. T. Rockafellar, Convex Analysis. Princeton, NJ, USA:Princeton University Press, 1970.

[59] Y. Wang, P.-M. Jodoin, F. Porikli, J. Konrad, Y. Benezeth,and P. Ishwar, “Cdnet 2014: An expanded changedetection benchmark dataset,” in IEEE Conference onComputer Vision and Pattern Recognition Workshops, 2014,pp. 393–400.

[60] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun,“Spectral networks and locally connected networks ongraphs,” in International Conference on Learning Repre-sentations, 2014, pp. 1–14.