kama-nns: low-dimensional rotation based neural networks · kama-nns: low-dimensional rotation...

10
KAMA-NNs: low-dimensional rotation based neural networks Krzysztof Choromanski * Aldo Pacchiano * Jeffrey Pennington * Yunhao Tang * Google Brain Robotics UC Berkeley Google Brain Columbia University Abstract We present new architectures for feedforward neural networks built from products of learned or random low-dimensional rotations that of- fer substantial space compression and com- putational speedups in comparison to the un- structured baselines. Models using them are also competitive with the baselines and often, due to imposed orthogonal structure, outper- form baselines accuracy-wise. We propose to use our architectures in two settings. We show that in the non-adaptive scenario (random neural networks) they lead to asymptotically more accurate, space-efficient and faster esti- mators of the so-called PNG-kernels (for any activation function defining the PNG). This generalizes several recent theoretical results about orthogonal estimators (e.g. orthogonal JLTs, orthogonal estimators of angular ker- nels and more). In the adaptive setting we propose efficient algorithms for learning prod- ucts of low-dimensional rotations and show how our architectures can be used to improve space and time complexity of state of the art reinforcement learning (RL) algorithms (e.g. PPO, TRPO). Here they offer up to 7x compression of the network in comparison to the unstructured baselines and outperform reward-wise state of the art structured neural networks offering similar computational gains and based on low displacement rank matrices. 1 Introduction Structured transforms play an important role in many machine learning algorithms. Several re- cently proposed scalable kernel methods using random feature maps [Rahimi and Recht, 2007] Proceedings of the 22 nd International Conference on Ar- tificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by the author(s). * Equal contribution. apply structured matrices to either reduce time & space complexity of kernels’ estimators or im- prove their accuracy [Choromanska et al., 2016, Choromanski et al., 2018b, Bojarski et al., 2017, Choromanski et al., 2017, Choromanski et al., 2018a, Yu et al., 2016, Choromanski and Sindhwani, 2016, Vybíral, 2011, Zhang and Cheng, 2013]. Struc- tured matrices are also applied in some of the fastest known cross-polytope LSH algo- rithms [Andoni et al., 2015] and neural networks [Sindhwani et al., 2015, Choromanski et al., 2018a, Choromanski et al., 2018c]. In the latter set- ting they were used in particular to scale up architectures for mobile speech recognition [Sindhwani et al., 2015], predictive state recurrent neural networks [Choromanski et al., 2018a] and more recently, to encode policy architectures for RL tasks [Choromanski et al., 2018c]. Compressed neural networks encoded by structured matrices enable practitioners to train RL policies with the use of evolutionary strategy algorithms (recently becoming a serious alternative to state of the art policy gradient methods [Salimans et al., 2017, Mania et al., 2018]) on a single machine instead of clusters of thou- sands of machines. Time & space complexity reduction is obtained by applying structured ma- trices where matrix-vector multiplication can be conducted in sub-quadratic time with the use of Fast Fourier Transform (e.g. low displacement rank matrices from [Choromanski and Sindhwani, 2016, Sindhwani et al., 2015, Choromanski et al., 2018c]) or Fast Walsh-Hadamard Transform (e.g. ran- dom Hadamard matrices from [Andoni et al., 2015, Choromanski et al., 2017]). However, time & space complexity reduction as well as accuracy improvements over unstructured baselines at the same time were recently theoretically proven only for random Hadamard matrices and only for the linear kernel (see: dimensionality reduction mechanisms in [Choromanski et al., 2017]). Furthermore, it is known that obtained accuracy gains are due to the orthogonal- ity and similar guarantees cannot be achieved for low displacement rank matrices. Other orthogonal trans- forms for which accuracy improvements were proven

Upload: others

Post on 23-Aug-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

KAMA-NNs: low-dimensional rotation based neural networks

Krzysztof Choromanski* Aldo Pacchiano* Jeffrey Pennington* Yunhao Tang*

Google Brain Robotics UC Berkeley Google Brain Columbia University

Abstract

We present new architectures for feedforwardneural networks built from products of learnedor random low-dimensional rotations that of-fer substantial space compression and com-putational speedups in comparison to the un-structured baselines. Models using them arealso competitive with the baselines and often,due to imposed orthogonal structure, outper-form baselines accuracy-wise. We propose touse our architectures in two settings. We showthat in the non-adaptive scenario (randomneural networks) they lead to asymptoticallymore accurate, space-efficient and faster esti-mators of the so-called PNG-kernels (for anyactivation function defining the PNG). Thisgeneralizes several recent theoretical resultsabout orthogonal estimators (e.g. orthogonalJLTs, orthogonal estimators of angular ker-nels and more). In the adaptive setting wepropose efficient algorithms for learning prod-ucts of low-dimensional rotations and showhow our architectures can be used to improvespace and time complexity of state of theart reinforcement learning (RL) algorithms(e.g. PPO, TRPO). Here they offer up to7x compression of the network in comparisonto the unstructured baselines and outperformreward-wise state of the art structured neuralnetworks offering similar computational gainsand based on low displacement rank matrices.

1 Introduction

Structured transforms play an important role inmany machine learning algorithms. Several re-cently proposed scalable kernel methods usingrandom feature maps [Rahimi and Recht, 2007]

Proceedings of the 22nd International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s). * Equal contribution.

apply structured matrices to either reduce time& space complexity of kernels’ estimators or im-prove their accuracy [Choromanska et al., 2016,Choromanski et al., 2018b, Bojarski et al., 2017,Choromanski et al., 2017, Choromanski et al., 2018a,Yu et al., 2016, Choromanski and Sindhwani, 2016,Vybíral, 2011, Zhang and Cheng, 2013]. Struc-tured matrices are also applied in some ofthe fastest known cross-polytope LSH algo-rithms [Andoni et al., 2015] and neural networks[Sindhwani et al., 2015, Choromanski et al., 2018a,Choromanski et al., 2018c]. In the latter set-ting they were used in particular to scaleup architectures for mobile speech recognition[Sindhwani et al., 2015], predictive state recurrentneural networks [Choromanski et al., 2018a] andmore recently, to encode policy architectures for RLtasks [Choromanski et al., 2018c]. Compressed neuralnetworks encoded by structured matrices enablepractitioners to train RL policies with the use ofevolutionary strategy algorithms (recently becoming aserious alternative to state of the art policy gradientmethods [Salimans et al., 2017, Mania et al., 2018])on a single machine instead of clusters of thou-sands of machines. Time & space complexityreduction is obtained by applying structured ma-trices where matrix-vector multiplication can beconducted in sub-quadratic time with the use ofFast Fourier Transform (e.g. low displacement rankmatrices from [Choromanski and Sindhwani, 2016,Sindhwani et al., 2015, Choromanski et al., 2018c])or Fast Walsh-Hadamard Transform (e.g. ran-dom Hadamard matrices from [Andoni et al., 2015,Choromanski et al., 2017]).

However, time & space complexity reduction as well asaccuracy improvements over unstructured baselines atthe same time were recently theoretically proven onlyfor random Hadamard matrices and only for the linearkernel (see: dimensionality reduction mechanisms in[Choromanski et al., 2017]). Furthermore, it is knownthat obtained accuracy gains are due to the orthogonal-ity and similar guarantees cannot be achieved for lowdisplacement rank matrices. Other orthogonal trans-forms for which accuracy improvements were proven

Page 2: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

KAMA-NNs: low-dimensional rotation based neural networks

(angular kernel estimators and asymptotically estima-tors of certain classes of RBF-kernels) were built fromrandom orthogonal matrices constructed via Gram-Schmidt orthogonalization [Yu et al., 2016]. These donot offer any compression of the number of parametersand computational gains.

We propose here a class of structured neural networkarchitectures, called, KAMA-NNs, where matrices ofconnections can be decomposed as products of low-dimensional learned or random rotations. We showthat the best features of all the aforementioned struc-tured families are encapsulated in this class. First, itprovides orthogonality and we show that in the non-adaptive scenario (random neural networks) it con-sequently leads to the asymptotically more accurate,space-efficient and faster estimators of the so-calledPNG (Pointwise Nonlinear Gaussian) kernels (for anyactivation function defining the PNG) and RBF-kernels.This generalizes several recent theoretical results aboutorthogonal estimators (e.g. orthogonal JLTs, orthogo-nal estimators for angular, gaussian kernels and more).Furthermore, it achieves time & space complexity gainsover unstructured baselines, matching or outperform-ing accuracy-wise low displacement rank matrices. Fi-nally, in the adaptive setting, where low-dimensionalrotations are learned, it improves state of the art rein-forcement learning algorithms (e.g. PPO, TRPO). Inthe RL setting our architectures offer up to 7x com-pression of the network in comparison to the unstruc-tured baselines and outperform reward-wise state ofthe art structured neural networks offering similar com-putational gains and based on low displacement rankmatrices [Choromanski et al., 2018c]. In the adaptivesetting, we also give an interesting geometric interpre-tation of our algorithms. We explain how optimizingover a sequence of low dimensional rotations is akinto performing coordinate descent/ascent on the man-ifold of all rotation-matrices. This sheds light on theeffectiveness of the KAMA-NN mechanism also in theadaptive setting.

We highlight our main contributions below:

• In Section 2 we formally introduce KAMA-NNarchitectures and discuss their space and timecomplexity.

• In Section 3 we discuss the capacity of modelsbased on products of low dimensional rotations inboth: adaptive and non-adaptive setting and theconnection to random matrix theory.

• In Section 4 we establish the connection betweenrandom neural networks and PNG kernels andshow that random KAMA-NNs lead to asymptoti-cally more accurate estimators of these kernels.

• In Section 5 we analyze adaptive mechanism,where low dimensional rotations defining KAMA-NN architectures are trained and provide conver-gence results for optimizing certain classes of black-box functions via KAMA-NNs.

• In Section 6 we given an exhaustive empiricalevaluation of KAMA-NN architectures. In thenon-adaptive setting (random KAMA-NNs) weperform an empirical study of the accuracy ofPNG kernels’ estimators with KAMA-NNs. In theadaptive setting we apply KAMA-NNs to encodeRL policies and compare them to unstructured andother structured architectures for different policygradient algorithms and on several RL tasks.

2 KAMA-NN architectures

KAMA-NNs are feedforward neural networks, withmatrices of connections constructed from products of 2-dimensional learned or random rotations called Givensrotations. For fixed I, J ∈ 0, 1, ..., d− 1 (I 6= J) andΘ ∈ [0, 2π) we define a Givens rotation GΘ

I,J ∈ Rd×das follows:

GΘI,J [i, j] =

1, if i = j and i 6∈ I, J0, if i 6= j and i, j 6= I, Jcos Θ, if i = j and i ∈ I, Jsin Θ, if i = J, j = I− sin Θ, if i = I, j = J

,

Givens random rotation is a Givens rotation, where Θ ∼Unif [0, 2π) and I, J are chosen uniformly at random.

Each matrix M ∈ Rd1×d2 of connections of the KAMA-NN is obtained from the product of k Givens rotationsof max(d1, d2) rows and columns each (where k maydiffer form layer to layer) by taking its first d1 rows (d2

columns) if d1 ≤ d2 (d1 > d2) and then renormalizingthese min(d1, d2) rows (columns). The renormaliza-tion is conducted by multiplying these rows (columns)by fixed min(d1, d2) scalars: s1, ..., smin(d1,d2). Ran-dom KAMA-NNs apply matrices M using Givens ran-dom rotations chosen independently and with scalarss1, ..., smin(d1,d2) chosen independently from a givenprobabilistic 1D-distribution Φ. In general these aswell as angles Θ and indices I, J of Givens rotations arelearned. Notice that products of independent Givensrandom rotations are sometimes called Kac’s randomwalk matrices since they correspond to the Kac’s ran-dom walk Markov Chain [Kac, 1954]. KAMA standsfor: Kacs Asymptotic Matrix Approximators, since,as we will see in Section 3, these constructions can beused to approximate many classes of matrices.

Page 3: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*, Yunhao Tang*

Space and time complexity gains: Matrix-vectormutliplication with matrices M = G1 · ...Gk ∈ Rd,where Gis are Givens rotations, can be conducted intime O(k), since mutliplication by each Givens rotationcan be trivially done in time O(1) (it requires exactlyfour scalar multiplications and two additions). Further-more, the total number of parameters needed to encodematrix M together with renormalization parameterssi (see: above discussion) equals 3k + d (two indicesand one angle per Givens rotation and d renormaliza-tion scalars). We will see later (see: Section 3) thatin practice k = O(d log(d)) Givens matrices suffice toencode architectures capable of learning good qualitymodels. Thus KAMA-NNs provide faster inferencethan unstructured counterparts and form a class ofcompact yet expressible neural network architectures.

3 Capacity of KAMA-NNs

Products of low dimensional rotations have beenthe subject of voluminous research, partially be-cause of their applications in physics [Kac, 1954,Janvresse, 2001, Mischler and Mouhot, 2013]. Kac’srandom walk, where transitions from previous to nextd-dimensional states are defined by independent Givensrandom rotations, was introduced in [Kac, 1954]. Itwas shown that products of such rotations converge(in certain sense) to the truly random rotation, yet aremuch more efficient to compute. For instance, it was re-cently proven that Kac’s random walk on the d-spheremixes in d log(d) steps [Pillai and Smith, 2015]. Thisresult suggests that products of relatively small numberof low dimensional rotations may serve as a good proxyfor truly random rotation matrices sampled from thedistribution corresponding to the Haar measure andproviding solid theoretical guarantees at the same time(as opposed to other structured random matrices giv-ing computational speedups such as random Hadamardmatrices, but for which only vague theoretical guaran-tees were given so far). This suggest straightforwardapplications in machine learning (for instance to pro-duce fast random feature map based estimators of RBFor PNG kernels [Rahimi and Recht, 2007]), yet surpris-ingly to the best of our knowledge, so far mechanismsbased on Givens random rotations were proposed onlyin the context of dimensionality reduction and Johnson-Lindenstrauss Transforms [Ailon and Chazelle, 2006].

Not much is also known regarding the adaptivesetting, where Givens rotations are learned. In[Mathieu and LeCun, 2014] learned Givens rotationswere applied to approximate Hessian matrices for cer-tain optimization problems. It is believed though thatproducts of relatively small number of learned Givensrotations can accurately approximate matrices frommany classes of rotations. In particular, we will focus

on the following family.Definition 1. Denote by GIV(d) the class of allGivens rotations from Rd × Rd. For a constant C > 0,let GC be a family of matrices in Rd×d defined as:GC = G1 · ... ·Gk : Gi ∈ GIV(d), i = 1, ..., k, wherek = dC d log(d)

2 e.

Even though those families are not dense in the set ofall rotation matrices, as we will show later, in prac-tice they can accurately approximate many rotationmatrices in both adaptive and non-adaptive setting.The renormalization scalars si defined by us in Section2 can then "stretch" certain dimensions of the inputvectors rotated by such matrices. This mechanism hassufficient capacity to provide accurate and superiorperformance to unstructured baselines estimators ofall PNG kernels that correspond to random neural net-works, as we will see next. We then show that it alsosuffices in the adaptive setting to learn good qualityRL policies.

4 Random KAMA-NNs

Consider a random neural network with inputand output layer of size d, nonlinearity f ap-plied to neurons in the output layer and weightstaken independently at random from the gaus-sian distribution N (0, 1√

d). This is a standard

choice for weights initialization in feedforward neu-ral networks. We call it unstructured random NN.Several recent results [Pennington and Worah, 2017],[Pennington et al., 2017], [Pennington et al., 2018] fo-cus on understanding statistical properties of unstruc-tured random NNs and their connection to randommatrix theory. Recent work [Pennington et al., 2017,Xiao et al., 2018, Chen et al., 2018] shows also that or-thogonal random initialization of neural networks leadsto better learning profiles, even though not much isknown about this phenomenon from the theoreticalpoint of view. We shed light on it, by showing thatrandom KAMA-NNs as well as previously analyzed ran-dom orthogonal constructions lead to asymptoticallyas d → ∞ more accurate estimators of the so-calledPNG (Pointwise nonlinear Gaussian) kernels.Definition 2 (PNG-kernels). The PNG kernel(shortly: PNG) defined by the mapping f is a func-tion: K : Rd × Rd → R given as follows for x,y ∈ Rd:

Kf (x,y) = Eg∼N (0,Id)[f(g>x)f(g>y)]. (1)

PNG kernels are important in the analysis of unstruc-tured random NNs since such networks can be equiv-alently thought of as transformations that translateone of the most basic similarity measures between fea-ture vectors, namely the linear (dot-product) kernel

Page 4: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

KAMA-NNs: low-dimensional rotation based neural networks

by the PNG corresponding to the particular mappingf . Indeed, consider a linear kernel between activationvectors a(x) and a(y) corresponding to given inputvectors x,y ∈ Rd. The following is true:

a(x)>a(y) =1

d

d∑i=1

f(g>i x)f(g>i y), (2)

where gi is the ith row of the connection matrix. There-fore unstructured random NNs become unbiased MonteCarlo (MC) estimators of the values of particularPNG kernels. We denote these unbiased estimatorsas URNf (x,y). Notice that URNs choose values ofweights independently from N (0, 1). The so-calledorthogonal random NN is obtained by replacing Gaus-sian matrix of connections G by its orthogonal variantGort. Matrix Gort is obtained from G by conduct-ing Gram-Schmidt orthogonalization and then usingscalars s1, ..., sd to renormalize rows of the obtainedorthonormal matrix, where si are sampled indepen-dently from ‖g‖2 for g ∼ N (0, Id). We will denotethe corresponding estimator as ORNf (x,y). Finally, ifinstead we use random KAMA-NNs constructed fromk blocks and with scalars si chosen in the same wayas for orthogonal random NNs, then the correspondingestimator will be denoted as KRNk

f (x,y).

For ε > 0, we denote by B(ε) a ball centered at 0 andof radius ε. Our main theoretical results in this sectionare given below.Theorem 1 (orthogonal random NNs for PNGs). Letf : R→ R be a function and B ⊆ Rd be bounded region(for instance a unit sphere). Then for every constantε > 0 there exists a constant K(ε) > 0 such that forevery x,y ∈ B\B(ε) the following holds for d largeenough:

MSE(ORNf (x,y)) ≤ MSE(URNf (x,y))−K(ε)

d, (3)

where MSE stands for the mean squared error.

If instead of orthogonal random NNs, we use KAMA-NNs then the following is true:Theorem 2 (KAMA-NNs for PNGs). Let f : R →R be a function and B ⊆ Rd be bounded region (forinstance a unit sphere). Then for every constant ε > 0there exist constants K(ε), L(ε) > 0 such that for k =L(ε)d log(d) and for every x,y ∈ B\B(ε) the followingholds for d large enough:

MSE(KRNkf (x,y)) ≤ MSE(URNf (x,y))−K(ε)

d. (4)

The above theorems show that orthogonal randomNNs as well as random KAMA-NNs provide asymp-totically as d → ∞ more accurate estimators of

PNG kernels for any nonlinear function f . Previ-ously these results were known only for orthogonalrandom NNs with sin / cos nonlinear mappings corre-sponding to RBF kernels [Choromanski et al., 2018b]and with f(x) = sgn(x) corresponding to angular PNGkernels [Choromanski et al., 2017]. Not only do ran-dom KAMA-NNs give accuracy gains, but they alsolead to faster inference (O(d log(d)) versus O(d2) time)and compression of the model (O(d log(d)) versus O(d2)space) that orthogonal random NNs are not capable of.Our empirical results in Section 6 confirm all our theo-retical findings and show also that mean squared errorguarantees translate to more downstream guarantees.

5 Learning KAMA-NNs

5.1 Givens rotations and manifold coordinateascent

The space of d×d dimensional orthogonal matrices withpositive determinant form a connected manifold knownas the special orthogonal matrix group, also denotedas SO(d) [Gallier and Xu, 2003]. This set is a (d−1)d

2dimensional manifold with the property that each pointM ∈ SO(d) has an associated tangent space, which isa (d+1)d

2 dimensional vector space where the tangentdirections to SO(d) live. The tangent space is denotedas TMSO(d) is defined as TMSO(d) = A ∈ Rd×d, A =MΩ : Ω = −Ω>, the set of Skew Symmetric matricespremultiplied by M.

We show that maximizing a function F : SO(d) →R by performing coordinate gradient ascent over themanifold SO(d) naturally yields a solution that equalsa product of Givens rotations.

5.1.1 Ascent and descent directions in themanifold.

When moving along a manifold, the right generalizationof a straight line between two points is the notion ofgeodesic curves. For any given point M ∈ SO(d) anddirection MΩ ∈ TMSO(d), there is a single geodesicthat passes through M in direction MΩ. In the case ofthe Special Orthogonal Group of matrices these curvescan be written in the parameteric form γΩ : R→ SO(d),such that γΩ(Θ) = M exp(ΘΩ). The expression exp(A)for A ∈ Rd×d denotes the matrix exponential of A.

exp(·) maps any skew symmetric matrix Ω to an orthog-onal matrix. As a consequence, γΩ(Θ) = M exp(ΘΩ)is an orthogonal matrix for all Θ ∈ R providedM ∈ SO(d). Let AI,J1≤I<J≤d be the following basisfor the space of skew symmetric d× d matrices:

Page 5: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*, Yunhao Tang*

AI,J [i, j] =

1 if i = I, j = J

−1 if i = J, j = I

0 o.w.

The exponential of scalar multiples of these basis ele-ments equal Givens rotations: exp(−ΘAI,J) = GΘ

I,J .As a result, the geodesic passing throughM in directionAI,J equals γI,J(Θ) = MGΘ

J,I [Gallier and Xu, 2003].

Let F : SO(d)→ R be a differentiable function over themanifold of Special Orthogonal matrices. Similar to theEuclidean space definition, the directional derivativealong AI,J and evaluated at M ∈ SO(d) is a scalartaking the value:

∇I,JF (M) :=d

dΘF (γI,J(Θ))|Θ=0

:=d

dΘF (MGΘ

J,I)|Θ=0

The Reimannian gradient of F at M, denoted as∇F (M) is a matrix in TMSO(d) of the form MΩ withΩ =

∑1≤I<J≤dAI,J∇I,JF (M). The Frobenius norm

of ∇F (M) equals the Frobenius norm of Ω.

Algorithm 1 Givens coordinate gradient descent onSO(d).Input: M0 ∈ SO(d), F : SO(d)→ Rfor t = 1, 2, . . . do

for I, J s.t. 1 ≤ I < J ≤ d do.2. ΘI,J

t = arg minΘ F (Mt−1GΘJ,I).

3. Let BI,Jt = F (Mt−1GΘI,J

t

J,I )

4. Let It, Jt = arg minI,J BI,Jt

5. Mt = Mt−1GΘ

It,Jtt

Jt,It

Assume that for all I, J and M ∈ SO(d), the func-tion ΦI,J(Θ) = F (MGΘ

I,J) has second derivativebounded by a constant B. The following theo-rem holds, a derandomized version of Theorem 2 in[Shalit and Chechik, 2014]:Theorem 3.

F (M0)− F (Mt) ≥t−1∑u=0

‖∇F (Mu)‖2

2d(d− 1)B

As a consequence, if throughout the algorithm’s run‖∇F (Mu)‖2 ≥ D for some constant D, then an objec-tive gain of Ω(1/d2) is guaranteed at each step. SinceSO(d) is compact and F continuous, it must be lowerbounded by a finite value and therefore Algorithm 1must converge to a stationary point M∗ with gradient‖∇F (M∗)‖2F = 0.

(a) boston (b) wine

Figure 1: Empirical MSE (mean squared error) forthe pointwise evaluation of the angular kernel. Thefollowing estimators are compared: baseline using in-dependent Gaussian vectors (URN), structured usingrandom Hadamard matrices with renormalized rows(HORN), structured using k = 5d log(d) Givens ran-dom rotations (KAMA) and structured using matricesGort (ORN). We use datasets: boston and wine.

6 Experiments

6.1 The non-adaptive setting: randomKAMA-NNs

Here we consider KAMA-NNs using products of ran-dom Givens rotations (Kac’s random walk matrices)and show that they outperform unstructured PNG ker-nel estimators on the example of the angular kernel,matching the most accurate orthogonal estimators pro-posed recently, yet superior to them in terms of spaceand time complexity.

Pointwise kernel approximation: In the first setof experiments we computed empirical mean squarederror (MSE) of several random matrix based estima-tors of the angular kernel defined as Kang(x,y) =

1 − 2θx,y

π , where θx,y stands for an angle betweenx and y. This kernel can be equivalently rewrittenas: Kang(x,y) = Eg∼N (0,Id)[f(g>x)f(g>y)], wheref(x) = sgn(x). We compared the following estimators:KRNk

f with k = 5d log(d) Givens random rotations andcorresponding to random KAMA-NNs, baseline URNusing unstructured Gaussian matrices, estimator ORNbuilt on matrices Gort as well as estimators applyingrandom Hadamard matrices (HORN) [Yu et al., 2016].Experiments were conducted on the following datasets:boston and wine. Results are presented on Figure 1.We see that KAMA-NNs provide moderate accuracygains over unstructured baselines. Next we show thatthese moderate gains translate to more substantialaccuracy gains on more downstream tasks.

Approximating kernel matrices: Here we test therelative error of kernel matrix estimation via differentangular kernel estimators based on random matices, in

Page 6: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

KAMA-NNs: low-dimensional rotation based neural networks

particular those applying random KAMA-NNs.

For a given dataset X = x1, ...,xN andan angular kernel Kang denote by K(X ) =Kang(xi,xj)i,j∈1,...,N the corresponding kernel ma-trix and by K(X ) its approximate version obtained byusing values proposed by a given estimator. The rel-ative error of the kernel matrix estimation is givenas: ε = ‖K(X )−K(X )‖F

‖K(X)‖F, where ‖‖F stands for the

Frobenius norm. We use the following datasets: g50,boston, cpu, insurance, wine and parkinson. Following[Choromanski et al., 2017], we plot the mean error ob-tained from r = 1000 repetitions for each mechanism.Kernel matrices are computed on a randomly selectedsubset of N = 550 datapoint from each datasets.

Results are presented on Figure 2. In all plots differ-ent orthogonal mechanisms show similar performance(almost identical curves substantially better than forthe URN mechanism), while KAMA-NNs outperformother orthogonal transforms speed-wise.

6.2 The adaptive setting: learning RLpolicies

We show that KAMA-NNs can substantially reducethe number of policy parameters in reinforcementlearning (RL) benchmark tasks, while still provid-ing good performance. We choose standard un-structured fully connected feedforward neural net-work policy architectures and structured neural net-work policies with Toeplitz matrices as baselines[Choromanski et al., 2018c]. With many more param-eters, fully-connected policies can represent a muchlarger policy space, which facilitates easier optimiza-tion and leads to better performance in practice. On theother hand, Toeplitz policies greatly compress parame-ters’ space, but at the cost of significant degradationof policy performance. We show that KAMA-NNs pol-icy achieves a desirable middle ground between thesetwo extremes: drastically reducing the number of pa-rameters compared to a fully-connected policy, whileachieving better performance than Toeplitz policy andproviding the same computational speed-ups.

Our fully-connected neural network policies consist oftwo hidden layers, each with h = 64 hidden units forPPO algorithm and h = 32 hidden units for TRPO al-gorithm (see: below). Let x and y = σ(Wx+b) be theactivations at the first and second hidden layer respec-tively, where W ∈ Rh×h is a weight matrix, b ∈ Rhis a bias vector and σ(·) is the non-linear activationfunction. To construct a compact policy using KAMAmechanism, we replace the unstructured weight matrixW by a sequence of K Givens rotations. We do thesame for the first and last matrix of connections, but

(a) g50 (b) boston

(c) cpu (d) insurance

(e) wine (f) parkinson

Figure 2: Normalized Frobenius norm error for theangular PNG kernel matrix approximation. The follow-ing estimators are compared: baseline using indepen-dent Gaussian vectors (URN), structured using randomHadamard matrices with renormalized rows (HORN),structured using k = 5d log(d) Givens random rotations(KAMA) and structured using matrices Gort (ORN).Experiments are run on six datasets: g50, boston, cpu,insurance, wine and parkinson.

this time apply also the truncation mechanism, as de-scribed in Section 2. While using Toeplitz mechanism,we replace all three matrices of connections by Toeplitzmatrices.

Learning low dimensional rotations: We now in-troduce the way to parameterize and learn rotationsfor the KAMA mechanism. Upon initialization, werandomly sample 2D-linear subspaces, where rotationsdefined by Givens matrices Gi are conducted, fromthe set of subspaces spanned by two vectors from thecanonical basis e1, ..., en. For each Gi only angle Θi

of the rotation is learned. Unstructured matrices are re-placed by (truncated) matrices of the form G1G2...GK .All rotation angles Θi are learned by back-propagation.For each matrix we also learn renormalization scalarssi (see: Section 2).

Remark 1. Note that in the above setting we do

Page 7: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*, Yunhao Tang*

not need to explicitly store structured matrices S =G1G2...GK . It suffices to keep: θ1, ..., θK to efficientlycompute Sx for any input x.

Algorithms and Tasks: We test our policy forstate of the art RL algorithms: Trust Region PolicyOptimization [Schulman et al., 2015] (TRPO) andProximal Policy Optimization [Schulman et al., 2017](PPO). Using these two algorithms, we comparedifferent policy architectures. All implementations areusing OpenAI baseline [Dhariwal et al., 2017].The benchmark tests are based on MuJoColocomotion tasks provided by OpenAI Gym[Brockman et al., 2016, Todorov et al., 2012] andRoboschool [Schulman et al., 2017]. We take thefollowing environments: Double Pendulum, InvertedPendulum, Swimmer, Hopper, HalfCheetah.

6.2.1 Proximal Policy Optimization (PPO)

In Figure 3, we show training results on MuJoCo bench-mark tasks with Proximal Policy Optimization (PPO)[Schulman et al., 2017]. Here we use K = 200 Givensrotations to construct all three structured matrices inthe policy. We train the policy on each task for afixed number of time steps and record the cumulativerewards during training. We show the mean ± stdperformance across 5 random seeds. As seen from Fig-ure 3, across most tasks KAMA-NNs policies achievebetter performance than Toeplitz policies.

6.2.2 Trust Region Policy Optimization(TRPO)

In Figure 4, we show training results on MuJoCo bench-marks with Trust Region Policy Optimization (TRPO)[Schulman et al., 2015]. Here we use K = 100 Givensrotations to construct all three structured matrices. Asbefore, we train the policy on each task for a fixed num-ber of time steps and record the cumulative rewardsduring training. We show the mean ± std performanceacross 5 random seeds. As seen from Figure 4, acrossmost tasks KAMA-NNs policies achieve significantlybetters performance than Toeplitz policies.

6.2.3 Parameter Compression

By replacing unstructured matrices in the fully con-nected architecture, structured policies can achievesignificant compression in the number of parameters.In the settings where unstructured models are large,structured models can offer much faster inference dur-ing training and require much less storage. In Table 1,we list the ratio of the total number of parameters usedby structured policies relative to unstructured policies.The two structured policies that we compare (built

(a) DoublePen. (b) InvertedPen.

(c) Swimmer (d) Hopper

(e) HalfCheetah (f) Walker

Figure 3: Illustration of KAMA-NNs policies on MuJoCobenchmarks with PPO. KAMA-NNs are compared withunstructured baselines and architectures based on low dis-placement rank matrices (Toeplitz). For each task we trainthe policy with PPO for a fixed number of steps and showthe mean ± std performance. Vertical axis is the cumulativereward and horizontal axis stands for the # of time steps.

from KAMA-NNs and Toeplitz networks) provide thesame computational speed-ups for the inference (similarnumber of floating point multiplications).

On benchmark tasks, KAMA-NN based policies achieve7x compression relative to the unstructured model.Though Toeplitz policy reduces the number of parame-ters even further, the significant drop in performanceobserved in Figure 3 and Figure 4 is not desirable. Wealso show in the Appendix that we can further reducethe number of the parameters in the KAMA-NNs bydecreasing the number of Givens rotations in the firstand third structured matrix, without affecting learnedpolicy and at the same time, further compressing themodel.

6.2.4 Ablation Analysis

One advantage of KAMA-NN architectures over struc-tured architectures based on low displacement rankmatrices (such as Toeplitz), is that it easily allowsto adjust the trade-off between the capacity of the

Page 8: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

KAMA-NNs: low-dimensional rotation based neural networks

(a) Swimmer (b) Walker

(c) Hopper (d) HalfCheetah

(e) Hopper (R) (f) HalfCheetah (R)

Figure 4: Illustration of KAMA-NNs policies on MuJoCobenchmarks with TRPO. The same setup as for PPO exper-iments from Fig 3. Experiments with (R) are taken fromRoboschool.

model and its compactness by simply varying the num-ber of rotations K. Intuitively, when K is large, thearchitecture becomes more expensive to use and theperformance improves; when K is small, it becomesmore compact at the cost of worse performance.

We carry out an ablation study on the effect of thenumber of rotations K. In Figure 5, we show thetraining curves of varying K ∈ 10, 20, 50, 100, 200with both PPO and TRPO on a set of benchmark tasksfrom OpenAI Gym. We see that the policy performanceimproves as the number of rotations K increases.

PPO HalfCheetah Walker Hopper

KAMA-NN 15 % 15% 17%Toeplitz 7 % 7% 8%

TRPO HalfCheetah Walker Hopper

KAMA-NN 24 % 24% 28%Toeplitz 12 % 12% 13%

Table 1: Ratio of the total number of parameters used inthe structured matrices relative to the unstructured model.KAMA-NN architectures apply K = 200 Givens rotations

for PPO and K = 100 rotations for TRPO.

(a) PPO-Hopper (b) TRPO-Hopper

(c) PPO-HalfCheetah (d) TRPO-HalfCheetah

Figure 5: Ablation study on the effect of the number ofrotations K. The experiments are performed with bothPPO and TRPO and on a set of benchmark tasks fromOpenAI Gym. Vertical axis is the cumulative reward andhorizontal axis stands for the # of time steps.

6.3 Learning low-dimensional subspaces forrotations

We also conducted experiments, where not only rota-tion angles θ, but also 2-dimensional subspaces whererotations were conducted, were learned. We did not ob-serve any quality gains in comparison to the proposedalgorithm (where 2-dimensional subspaces were cho-sen randomly), proving empirically that models withrandom subspaces and learned angles have sufficientcapacity.

7 Conclusions

We presented a new class of compact architecturesfor feedforward fully connected neural networks basedon low dimensional learned or random rotations. Weempirically showed their advantages over state of the artin both: adaptive (where the parameters are learned)as well as non-adaptive regime on various tasks such asPNG-kernel approximation and RL policies learning.We further provided theoretical guarantees shedding anew light on the effectiveness of (random) orthogonalcompact transforms in machine learning. In particular,we showed that KAMA-NNs lead to asymptoticallyfaster and more accurate estimators of PNG-kernelsrelated to random neural networks. Our architecturesprovide practitioners with an easy way of adjustingthe complexity (and thus also capacity) of the neuralnetwork model to their needs by changing the numberof low dimensional rotations used (that are buildingblocks of KAMA-NNs).

Page 9: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

Krzysztof Choromanski*, Aldo Pacchiano*, Jeffrey Pennington*, Yunhao Tang*

Acknowledgements. The authors would like to ac-knowledge the cloud credits provided by Amazon WebServices.

References

[Abadi et al., 2016] Abadi, M., Barham, P., Chen, J.,Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat,S., Irving, G., Isard, M., et al. (2016). Tensorflow:a system for large-scale machine learning. In OSDI,volume 16, pages 265–283.

[Ailon and Chazelle, 2006] Ailon, N. and Chazelle, B.(2006). Approximate nearest neighbors and the fastJohnson-Lindenstrauss transform. In STOC.

[Andoni et al., 2015] Andoni, A., Indyk, P.,Laarhoven, T., Razenshteyn, I. P., and Schmidt,L. (2015). Practical and optimal LSH for angulardistance. In NIPS.

[Bojarski et al., 2017] Bojarski, M., Choromanska, A.,Choromanski, K., Fagan, F., Gouy-Pailler, C., Mor-van, A., Sakr, N., Sarlos, T., and Atif, J. (2017).Structured adaptive and random spinners for fastmachine learning computations. In AISTATS.

[Brockman et al., 2016] Brockman, G., Cheung, V.,Pettersson, L., Schneider, J., Schulman, J., Tang,J., and Zaremba, W. (2016). Openai gym. arXivpreprint arXiv:1606.01540.

[Chen et al., 2018] Chen, M., Pennington, J., andSchoenholz, S. S. (2018). Dynamical isometry anda mean field theory of rnns: Gating enables signalpropagation in recurrent neural networks. arXivpreprint arXiv:1806.05394.

[Choromanska et al., 2016] Choromanska, A., Choro-manski, K., Bojarski, M., Jebara, T., Kumar, S.,and LeCun, Y. (2016). Binary embeddings withstructured hashed projections. In Proceedings of the33nd International Conference on Machine Learning,ICML 2016, New York City, NY, USA, June 19-24,2016, pages 344–353.

[Choromanski et al., 2018a] Choromanski, K.,Downey, C., Boots, B., Holtmann-Rice, D.,and Kumar, S. (2018a). Initialization matters: Or-thogonal predictive state recurrent neural networks.In to appear at ICLR 2018.

[Choromanski et al., 2018b] Choromanski, K., Row-land, M., Sarlos, T., Sindhwani, V., Turner, R.,and Weller, A. (2018b). The geometry of randomfeatures. In AISTATS 2018.

[Choromanski et al., 2018c] Choromanski, K., Row-land, M., Sindhwani, V., Turner, R. E., and Weller,

A. (2018c). Structured evolution with compact archi-tectures for scalable policy optimization. In Pro-ceedings of the 35th International Conference onMachine Learning, ICML 2018, Stockholmsmässan,Stockholm, Sweden, July 10-15, 2018, pages 969–977.

[Choromanski and Sindhwani, 2016] Choromanski, K.and Sindhwani, V. (2016). Recycling randomnesswith structure for sublinear time kernel expansions.In ICML.

[Choromanski et al., 2017] Choromanski, K. M., Row-land, M., and Weller, A. (2017). The unreasonableeffectiveness of structured random orthogonal embed-dings. In Advances in Neural Information ProcessingSystems 30: Annual Conference on Neural Informa-tion Processing Systems 2017, 4-9 December 2017,Long Beach, CA, USA, pages 218–227.

[Dhariwal et al., 2017] Dhariwal, P., Hesse, C.,Klimov, O., Nichol, A., Plappert, M., Rad-ford, A., Schulman, J., Sidor, S., Wu, Y.,and Zhokhov, P. (2017). Openai baselines.https://github.com/openai/baselines.

[Gallier and Xu, 2003] Gallier, J. and Xu, D. (2003).Computing exponentials of skew-symmetric matricesand logarithms of orthogonal matrices. InternationalJournal of Robotics and Automation, 18(1):10–20.

[Janvresse, 2001] Janvresse, E. (2001). Spectral gapfor kac’s model of boltzmann equat ion. The Annalsof Probability, 29(1).

[Kac, 1954] Kac, M. (1954). Foundations of kinetictheory. Proceedings of the Third Berkeley Symposiumon Mathematical Statistics and Probability, 3.

[Mania et al., 2018] Mania, H., Guy, A., and Recht, B.(2018). Simple random search provides a competitiveapproach to reinforcement learning. arXiv preprintarXiv:1803.07055.

[Mathieu and LeCun, 2014] Mathieu, M. and LeCun,Y. (2014). Fast approximation of rotations and hes-sians matrices. CoRR, abs/1404.7195.

[Mischler and Mouhot, 2013] Mischler, S. and Mouhot,C. (2013). Kac’s program in kinetic theory. Inven-tiones mathematicae, 193(1).

[Patrascu and Necoara, 2015] Patrascu, A. andNecoara, I. (2015). Efficient random coordinatedescent algorithms for large-scale structured noncon-vex optimization. Journal of Global Optimization,61(1):19–46.

[Pennington et al., 2017] Pennington, J., Schoenholz,S., and Ganguli, S. (2017). Resurrecting the sigmoid

Page 10: KAMA-NNs: low-dimensional rotation based neural networks · KAMA-NNs: low-dimensional rotation based neural networks KrzysztofChoromanski *AldoPacchiano JeffreyPennington YunhaoTang*

KAMA-NNs: low-dimensional rotation based neural networks

in deep learning through dynamical isometry: theoryand practice. In Advances in neural informationprocessing systems, pages 4785–4795.

[Pennington et al., 2018] Pennington, J., Schoenholz,S. S., and Ganguli, S. (2018). The emergence of spec-tral universality in deep networks. In InternationalConference on Artificial Intelligence and Statistics,AISTATS 2018, 9-11 April 2018, Playa Blanca, Lan-zarote, Canary Islands, Spain, pages 1924–1932.

[Pennington and Worah, 2017] Pennington, J. andWorah, P. (2017). Nonlinear random matrix theoryfor deep learning. In Advances in Neural Informa-tion Processing Systems 30: Annual Conference onNeural Information Processing Systems 2017, 4-9December 2017, Long Beach, CA, USA, pages 2634–2643.

[Pillai and Smith, 2015] Pillai, N. and Smith, A.(2015). Kac’s walk on n-sphere mixes in n log(n)steps. In arxiv.

[Rahimi and Recht, 2007] Rahimi, A. and Recht, B.(2007). Random features for large-scale kernel ma-chines. In NIPS.

[Salimans et al., 2017] Salimans, T., Ho, J., Chen, X.,and Sutskever, I. (2017). Evolution strategies as ascalable alternative to reinforcement learning. CoRR,abs/1703.03864.

[Schulman et al., 2015] Schulman, J., Levine, S.,Abbeel, P., Jordan, M., and Moritz, P. (2015). Trustregion policy optimization. In International Confer-ence on Machine Learning, pages 1889–1897.

[Schulman et al., 2017] Schulman, J., Wolski, F.,Dhariwal, P., Radford, A., and Klimov, O. (2017).Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347.

[Shalit and Chechik, 2014] Shalit, U. and Chechik, G.(2014). Coordinate-descent for learning orthogonalmatrices through givens rotations. In InternationalConference on Machine Learning, pages 548–556.

[Sindhwani et al., 2015] Sindhwani, V., Sainath, T. N.,and Kumar, S. (2015). Structured transforms forsmall-footprint deep learning. In Advances in NeuralInformation Processing Systems 28: Annual Confer-ence on Neural Information Processing Systems 2015,December 7-12, 2015, Montreal, Quebec, Canada,pages 3088–3096.

[Todorov et al., 2012] Todorov, E., Erez, T., andTassa, Y. (2012). Mujoco: A physics engine formodel-based control. In Intelligent Robots and Sys-tems (IROS), 2012 IEEE/RSJ International Confer-ence on, pages 5026–5033. IEEE.

[Vybíral, 2011] Vybíral, J. (2011). A variant of theJohnson-Lindenstrauss lemma for circulant matrices.Journal of Functional Analysis, 260(4):1096–1105.

[Xiao et al., 2018] Xiao, L., Bahri, Y., Sohl-Dickstein,J., Schoenholz, S. S., and Pennington, J. (2018). Dy-namical isometry and a mean field theory of cnns:How to train 10,000-layer vanilla convolutional neu-ral networks. arXiv preprint arXiv:1806.05393.

[Yu et al., 2016] Yu, F., Suresh, A., Choromanski, K.,Holtmann-Rice, D., and Kumar, S. (2016). Orthogo-nal random features. In NIPS, pages 1975–1983.

[Zhang and Cheng, 2013] Zhang, H. and Cheng, L.(2013). New bounds for circulant Johnson-Lindenstrauss embeddings. CoRR, abs/1308.6339.