supplementary materials: controllable orthogonalization in ... · oni method with the centering and...

11
Supplementary Materials: Controllable Orthogonalization in Training DNNs Lei Huang 1 Li Liu 1 Fan Zhu 1 Diwen Wan 1,2 Zehuan Yuan 3 Bo Li 4 Ling Shao 1 1 Inception Institute of Articial Intelligence (IIAI), Abu Dhabi, UAE 2 University of Electronic Science and Technology of China, Chengdu, China 3 ByteDance AI Lab, Beijing, China 4 University of Illinois at Urbana-Champaign Illinois, USA Algorithm I Orthogonalization by Newton’s Iteration. 1: Input: proxy parameters Z R n×d and iteration num- bers T . 2: Bounding Z’s singular values: V = Z ZF . 3: Calculate covariance matrix: S = VV T . 4: B 0 = I. 5: for t =1 to T do 6: B t = 3 2 B t1 1 2 B 3 t1 S. 7: end for 8: W = B T V. 9: Output: orthogonalized weight matrix: W R n×d . A. Derivation of Back-Propagation Given the layer-wise orthogonal weight matrix W, we can perform the forward pass to calculate the loss of the deep neural networks (DNNs). It’s necessary to back-propagate through the orthogonalization transformation, because we aim to update the proxy parameters Z. For illustration, we rst describe the proposed orthogonalization by Newton’s iteration (ONI) in Algorithm I. Given the gradient with re- spect to the orthogonalized weight matrix L W , we target to compute L Z . The back-propagation is based on the chain rule. From Line 2 in Algorithm I, we have: L Z = 1 ZF L V + tr( L V T Z) 1 ZF ZF ZF Z = 1 ZF L V tr( L V T Z) 1 Z2 F Z ZF = 1 ZF ( L V tr( L V T Z) Z2 F Z), (1) where tr(·) indicates the trace of the corresponding matrix and L V can be calculated from Lines 3 and 8 in Algorithm I: L V =(B T ) T L W +( L S + L S T )V. (2) We thus need to calculate L S , which can be computed Algorithm II Back-propagation of ONI. 1: Input: L W R n×d and variables from respective for- ward pass: Z, V, S, {B t } T t=1 . 2: L B T = L W V T . 3: for t = T down to 1 do 4: L Bt1 = 1 2 ( L Bt (B 2 t1 S) T +(B 2 t1 ) T L Bt S T + B T t1 L Bt (B t1 S) T )+ 3 2 L Bt . 5: end for 6: L S = 1 2 T t=1 (B 3 t1 ) T L Bt . 7: L V =(B T ) T L W +( L S + L S T )V. 8: L Z = 1 ZF ( L V tr( L V T Z) Z2 F Z). 9: Output: L Z R n×d . from Lines 5, 6 and 7 in Algorithm I: L S = 1 2 T t=1 (B 3 t1 ) T L B t , (3) where L B T = L W V T and { L Bt1 ,t = T , ..., 1} can be iteratively calculated from Line 6 in Algorithm I as follows: L Bt1 = 1 2 ( L Bt (B 2 t1 S) T +(B 2 t1 ) T L Bt S T + B T t1 L Bt (Bt1S) T )+ 3 2 L Bt . (4) In summary, the back-propagation of Algorithm I is shown in Algorithm II. We further derive the back-propagation of the accelerated ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa- per. For illustration, Algorithm III describes the forward pass of the accelerated ONI. Following the calculation in Algorithm II, we can obtain L V . To simplify the deriva- tion, we represent Line 3 of Algorithm III as the following

Upload: others

Post on 15-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

Supplementary Materials: Controllable Orthogonalization in Training DNNs

Lei Huang1 Li Liu1 Fan Zhu1 Diwen Wan1,2 Zehuan Yuan3 Bo Li4 Ling Shao1

1Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, UAE2University of Electronic Science and Technology of China, Chengdu, China

3ByteDance AI Lab, Beijing, China4University of Illinois at Urbana-Champaign Illinois, USA

Algorithm I Orthogonalization by Newton’s Iteration.1: Input: proxy parameters Z ∈ Rn×d and iteration num-

bers T .2: Bounding Z’s singular values: V = Z

�Z�F.

3: Calculate covariance matrix: S = VVT .4: B0 = I.5: for t = 1 to T do6: Bt =

32Bt−1 − 1

2B3t−1S.

7: end for8: W = BTV.9: Output: orthogonalized weight matrix: W ∈ Rn×d.

A. Derivation of Back-PropagationGiven the layer-wise orthogonal weight matrix W, we

can perform the forward pass to calculate the loss of the deepneural networks (DNNs). It’s necessary to back-propagatethrough the orthogonalization transformation, because weaim to update the proxy parameters Z. For illustration, wefirst describe the proposed orthogonalization by Newton’siteration (ONI) in Algorithm I. Given the gradient with re-spect to the orthogonalized weight matrix ∂L

∂W , we target tocompute ∂L

∂Z . The back-propagation is based on the chainrule. From Line 2 in Algorithm I, we have:

∂L∂Z

=1

�Z�F∂L∂V

+ tr(∂L∂V

T

Z)∂ 1

�Z�F∂�Z�F

∂�Z�F∂Z

=1

�Z�F∂L∂V

− tr(∂L∂V

T

Z)1

�Z�2FZ

�Z�F

=1

�Z�F(∂L∂V

− tr( ∂L∂V

TZ)

�Z�2FZ), (1)

where tr(·) indicates the trace of the corresponding matrixand ∂L

∂V can be calculated from Lines 3 and 8 in AlgorithmI:

∂L∂V

= (BT )T ∂L∂W

+ (∂L∂S

+∂L∂S

T

)V. (2)

We thus need to calculate ∂L∂S , which can be computed

Algorithm II Back-propagation of ONI.

1: Input: ∂L∂W ∈ Rn×d and variables from respective for-

ward pass: Z, V, S, {Bt}Tt=1.2: ∂L

∂BT= ∂L

∂WVT .3: for t = T down to 1 do4: ∂L

∂Bt−1= − 1

2 (∂L∂Bt

(B2t−1S)

T + (B2t−1)

T ∂L∂Bt

ST +

BTt−1

∂L∂Bt

(Bt−1S)T ) + 3

2∂L∂Bt

.5: end for6: ∂L

∂S = − 12

�Tt=1(B

3t−1)

T ∂L∂Bt

.

7: ∂L∂V = (BT )

T ∂L∂W + (∂L∂S + ∂L

∂S

T)V.

8: ∂L∂Z = 1

�Z�F( ∂L∂V − tr( ∂L

∂VTZ)

�Z�2F

Z).

9: Output: ∂L∂Z ∈ Rn×d.

from Lines 5, 6 and 7 in Algorithm I:

∂L

∂S= −1

2

T�

t=1

(B3t−1)

T ∂L

∂Bt, (3)

where ∂L∂BT

= ∂L∂WVT and { ∂L

∂Bt−1, t = T, ..., 1} can be

iteratively calculated from Line 6 in Algorithm I as follows:

∂L∂Bt−1

= −1

2(∂L∂Bt

(B2t−1S)

T + (B2t−1)

T ∂L∂Bt

ST

+BTt−1

∂L∂Bt

(Bt−1S)T ) +

3

2

∂L∂Bt

. (4)

In summary, the back-propagation of Algorithm I isshown in Algorithm II.

We further derive the back-propagation of the acceleratedONI method with the centering and more compact spectralbounding operation, as described in Section 3.3 of the pa-per. For illustration, Algorithm III describes the forwardpass of the accelerated ONI. Following the calculation inAlgorithm II, we can obtain ∂L

∂V . To simplify the deriva-tion, we represent Line 3 of Algorithm III as the following

Page 2: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

Algorithm III ONI with acceleration.1: Input: proxy parameters Z ∈ Rn×d and iteration num-

bers N .2: Centering: Zc = Z− 1

dZ11T .

3: Bounding Z’s singular values: V = Zc√�ZcZT

c �F

.

4: Execute Step. 3 to 8 in Algorithm I.5: Output: orthogonalized weight matrix: W ∈ Rn×d.

formulations:

M =ZcZTc (5)

δ =�

�M�F (6)

V =Zc

δ. (7)

It’s easy to calculate ∂L∂Zc

from Eqn.5 and Eqn.7 as fol-lows:

∂L∂Zc

=1

δ

∂L∂V

+ (∂L∂M

+∂L∂M

T

)Zc, (8)

where ∂L∂M can be computed based on Eqn. 6 and Eqn. 7:

∂L∂M

=∂L∂δ

∂δ

∂�M�F∂�M�F∂M

= tr(∂L∂V

T

Zc)(− 1

δ2)

1

2�

�M�FM

�M�F

= − tr( ∂L∂V

TZc)

2δ5M. (9)

Based on Line 2 in Algorithm III, we can achieve ∂L∂Z as

follows:

∂L∂Z

=∂L∂Z c

− 1

d11T ∂L

∂Z c. (10)

In summary, Algorithm IV describes the back-propagation of the Algorithm III.

B. Proof of Convergence Condition for New-ton’s Iteration

In Section 3 of the paper, we show that bounding thespectral of the proxy parameters matrix by

V = φN (Z) =Z

�Z�F(11)

andV = φN (Z) =

Z��ZZT �F

(12)

can satisfy the convergence condition of Newton’s Iterationsas follows:

�I− S�2 < 1, (13)

Algorithm IV Back-propagation of ONI with acceleration.

1: Input: ∂L∂W ∈ Rn×d and variables from respective for-

ward pass: Zc, V, S, {Bt}Tt=1.2: Calculate ∂L

∂V from Line 2 to Line 7 in Algorithm II.3: Calculate M and δ from Eqn. 5 and Eqn. 6.4: Calculate ∂L

∂M based on Eqn. 9.5: Calculate ∂L

∂Zcbased on Eqn. 8.

6: Calculate ∂L∂Z based on Eqn. 10.

7: Output: ∂L∂Z ∈ Rn×d.

where S = VVT and the singular values of Z are nonzero.Here we will prove this conclusion, and we also prove that�Z�F >

��ZZT �F .

Proof. By definition, �Z�F can be represented as �Z�F =�tr(ZZT ). Given Eqn.11, we calculate

S = VVT =ZZT

tr(ZZT ). (14)

Let’s denote M = ZZT and the eigenvalues of M are{λ1, ...,λn}. We have λi > 0, since M is a real symmetricmatrix and the singular values of Z are nonzero. We alsohave S = M

tr(M) and the eigenvalues of S are λi�ni=1 λi

. Fur-

thermore, the eigenvalues of (I− S) are 1− λi�ni=1 λi

, thussatisfying the convergence condition described by Eqn.13.

Similarly, given V = φN (Z) = Z√�ZZT �F

, we have

S = ZZT

�ZZT �F= M

�M�Fand its corresponding eigenvalues

are λi√�ni=1 λ2

i

. Therefore, the singular values of (I− S) are

1 − λi√�ni=1 λ2

i

, also satisfying the convergence condition

described by Eqn.13.We have �Z�F =

�tr(M) =

��ni=1 λi and�

�ZZT �F =�

�M�F = 4��n

i=1 λ2i . It’s easy to

demonstrate that �Z�F >��ZZT �F , since (

�ni=1 λi)

2 >�ni=1 λ

2i .

In Section 3 of the paper, we show that the Newton’siteration by bounding the spectrum with Eqn. 11 is equivalentto the Newton’s iteration proposed in [11]. Here, we providethe details. In [11], they bound the covariance matrix M =ZZT by the trace of M as M

tr(M) . It’s clear that S used inAlgorithm I is equal to M

tr(M) , based on Eqn. 14 shown inthe previous proof.

C. Orthogonality for Group Based MethodIn Section 3.4 of the paper, we argue that group based

methods cannot ensure the whole matrix W ∈ Rn×d to beeither row or column orthogonal, when n > d. Here weprovide more details.

Page 3: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

configurations cudnn cudnn + ONI-T1 cudnn + ONI-T3 cudnn + ONI-T5 cudnn + ONI-T7Fh = Fw = 3, n=d=256, m=256 118.6 122.1 122.9 124.4 125.7Fh = Fw = 3, n=d=256, m=32 15.8 18.3 18.9 19.5 20.8Fh = Fw = 3, n=d=1024, m=32 71.1 81.7 84.3 89.5 94.2Fh = Fw = 1, n=d=256, m=256 28.7 31.5 32.1 33.7 34.6Fh = Fw = 1, n=d=256, m=32 10.1 13 13.6 14.2 15.3Fh = Fw = 1, n=d=1024, m=32 22.2 27.6 29.7 32.9 37.0

Table A. Comparison of wall-clock time (ms). We fix the input with size h = w = 32. We evaluate the total wall-clock time of training foreach iteration (forward pass + back-propagation pass). Note that ‘cudnn + ONI-T5’ indicates the ‘cudnn’ convolution wrapped in our ONImethod, using an iteration number of 5.

δRow δColumn

ONI-Full 5.66 0OLM-G32 8 5.66OLM-G16 9.85 8.07OLM-G8 10.58 8.94

Table B. Evaluation for row and column orthogonalization with thegroup based methods. The entries of proxy matrix Z ∈ R32×64

are sampled from the Gaussian distribution N(0, 1). We evaluatethe row orthogonality δRow = �WWT − I�F and column orthog-onality δColumn = �WTW − I�F . ‘OLM-G32’ indicates theeigen decomposition based orthogonalization method described in[10], with a group size of 32.

We follow the experiments described in Figure.3 of thepaper, where we sample the entries of proxy matrix Z ∈R64×32 from the Gaussian distribution N(0, 1). We applythe eigen decomposition based orthogonalization method[10] with group size G, to obtain the orthogonalized matrixW. We vary the group size G ∈ {32, 16, 8}. We evaluate thecorresponding row orthogonality δRow = �WWT − I�Fand column orthogonality δColumn = �WTW − I�F . Theresults are shown in Table B. We observe that the groupbased orthogonalization method cannot ensure the wholematrix W to be either row or column orthogonal, while ourONI can ensure column orthogonality. We also observe thatthe group based method has degenerated orthogonality, withdecreasing group size.

We also conduct an experiment when n = d, wherewe sample the entries of proxy matrix Z ∈ R64×64 fromthe Gaussian distribution N(0, 1). We vary the group sizeG ∈ {64, 32, 16, 8}. Note that G = 64 represents fullorthogonalization. Figure I shows the distribution of theeigenvalues of WWT . We again observe that the groupbased method cannot ensure the whole weight matrix tobe row orthogonal. Furthermore, orthogonalization withsmaller group size tends to be worse.

D. Comparison of Wall Clock TimesIn Section 3.6 of the paper, we show that, given a con-

volutional layer with filters W ∈ Rn×d×Fh×Fw and mmini-batch data {xi ∈ Rd×h×w}mi=1, the relative com-

0 20 40 60Index of eigenvalue

0

1

2

3

4

Eig

enva

lue

G8G16G32G64

Figure I. The distribution of the eigenvalues of WWT with dif-ferent group size G. The entries of proxy matrix Z ∈ R64×64 aresampled from the Gaussian distribution N(0, 1).

putational cost of ONI over the convolutional layer is2n

mhw + 3Nn2

mdhwFhFw. In this section, we compare the of wall

clock time between the convolution wrapping with our ONIand the standard convolution. In this experiment, our ONIis implemented based on Torch [4] and we wrap it to the‘cudnn’ convolution [3]. The experiments are run on a TI-TAN Xp.

We fix the input to size h = w = 32, and vary the kernelsize (Fh and Fw), the feature dimensions (n and d) and thebatch size m. Table A shows the wall clock time underdifferent configurations. We compare the standard ‘cudnn’convolution (denoted as ‘cudnn’) and the ‘cudnn’ wrappedwith our ONI (denoted as ‘cudnn + ONI’).

We observe that our method introduces negligible com-putational costs when using a 3 × 3 convolution, featuredimension n = d = 256 and batch size of m = 256. Ourmethod may degenerate in efficiency with a smaller ker-nel size, larger feature dimension and smaller batch size,based on the computational complexity analysis. However,our method (with iteration of 5) ‘cudnn + ONI-T5’ onlycosts 1.48× over the standard convolution ‘cudnn’, underthe worst configuration, Fh = Fw = 1, n = d = 1024 andm=32.

Page 4: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

E. Proof of Theorems

Here we prove the two theorems described in Section 3.5and 3.6 of the paper.

Theorem 1. Let h = Wx, where WWT = I andW ∈ Rn×d. Assume: (1) Ex(x) = 0, cov(x) = σ2

1I,and (2) E ∂L

∂h(∂L∂h

) = 0, cov(∂L∂h

) = σ22I. If n = d, we have

the following properties: (1) �h� = �x�; (2) Eh(h) = 0,cov(h) = σ2

1I; (3) �∂L∂x � = �∂L

∂h�; (4) E ∂L

∂x(∂L∂x ) = 0,

cov(∂L∂x ) = σ22I. In particular, if n < d, property (2) and (3)

hold; if n > d, property (1) and (4) hold.

Proof. Based on n = d and WWT = I, we have that Wis a square orthogonal matrix. We thus have WTW = I.Besides, we have ∂L

∂x = ∂L∂h

W1.(1) Therefore, we have

�h�2 = hT h = xTWTWx = xTx = �x�2. (15)

We thus get �h� = �x�.(2) It’s easy to calculate:

Eh(h) = Ex(Wx) = WEx(x) = 0. (16)

The covariance of h is given by

cov(h) = Eh((h− Eh(h)) · (h− Eh(h))T )

= Ex(W(x− Ex(x)) · (W(x− Ex(x)))T )

= WEx((x− Ex(x)) · (x− Ex(x))T )WT

= Wcov(x)WT

= Wσ21IW

T

= σ21WWT

= σ21 . (17)

(3) Similar to the proof of (1),

�∂L∂x

�2 =∂L∂x

∂L∂x

T

=∂L∂h

WWT ∂L∂h

T

=∂L∂h

∂L∂h

T

= �∂L∂h

�2. (18)

We thus have �∂L∂x � = �∂L

∂h�.

(4) Similar to the proof of (2), we have

E ∂L∂x

(∂L∂x

) = E ∂L∂h

(∂L∂h

W) = E ∂L∂h

(∂L∂h

)W = 0. (19)

1We follow the common setup where the vectors are column vectorswhen their derivations are row vectors.

The covariance of ∂L∂x is given by

cov(∂L∂x

) = E ∂L∂x

((∂L∂x

− E ∂L∂x

(∂L∂x

))T (∂L∂x

− E ∂L∂x

(∂L∂x

)))

= E ∂L∂h

(((∂L∂h

− E ∂L∂h

(∂L∂h

))W)T (∂L∂h

− E ∂L∂h

(∂L∂h

))W)

= WTE ∂L∂h

((∂L∂h

− E ∂L∂h

(∂L∂h

))T (∂L∂h

− E ∂L∂h

(∂L∂h

)))W

= WT cov(∂L∂h

)W

= WTσ22IW

= σ22W

TW

= σ22 . (20)

Besides, if n < d, it is easy to show that properties (2) and(3) hold; if n > d, properties (1) and (4) hold.

Theorem 2. Let h = max(0,Wx), where WWT = σ2Iand W ∈ Rn×d. Assume x is a normal distribution withEx(x) = 0, cov(x) = I. Denote the Jacobian matrix asJ = ∂h

∂x . If σ2 = 2, we have Ex(JJT ) = I.

Proof. For denotation, we use Ai: and A:j to represent thei-th row and the j-th column of A, respectively. Based onWWT = σ2I, we obtain Wi:(Wj:)

T = 0 for i �= j andWi:(Wi:)

T = σ2 otherwise. Let h = Wx. This yields

Ji: =∂hi

∂hi

∂hi

x=

∂hi

∂hi

Wi:. (21)

Denote M = JJ. This yields the following equationfrom Eqn. 21:

Mij = Ji:(Jj:)T

=∂hi

∂hi

Wi:(Wj:)T ∂hj

∂hj

=∂hi

∂hi

∂hj

∂hj

(Wi:(Wj:)T ). (22)

If i �= j, we obtain Mij = 0. For i = j, we have:

Mii = Ji:(Ji:)T = (

∂hi

∂hi

)2σ2 = 1(hi > 0)σ2, (23)

where 1(hi > 0) indicates 1 for hi > 0 and 0 other-wise. Since x is a normal distribution with Ex(x) = 0

and cov(x) = I, we have that h is also a normal distribution,with Eh(h) = 0 and cov(h) = I, based on Theorem 1. Wethus obtain

Ex(Mii) = Ehi1(hi > 0)σ2 =

1

2σ2. (24)

Therefore, Ex(JJT ) = Ex(M) = σ2

2 I = I.

Page 5: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

0 5 10 15Layer

10-5

100

Mea

n ab

solu

te v

alue

ONI-NS-InitONI-InitONI-NS-EndONI-End

(a)

0 5 10 15Layer

10-5

100

Mea

n ab

osul

te v

alue

ONI-NS-InitONI-InitONI-NS-EndONI-End

(b)

Figure II. The magnitude of the activation and gradient for eachlayer on a 20-layer MLP. (a) The mean absolute value of the acti-vation σx for each layer; and (b) the mean absolute value of thegradient σ ∂L

∂hfor each layer.

F. Details and More Experimental Results onDiscriminative Classification

F.1. MLPs on Fashion-MNIST

Details of Experimental Setup Fashion-MNIST consistsof 60k training and 10k test images. Each image has a sizeof 28 × 28, and is associated with a label from one of 10classes. We use the MLP with varying depths and the num-ber of neurons in each layer is 256. We use ReLU [17] asthe nonlinearity. The weights in each layer are initialized byrandom initialization [13] and we use an iteration numberof 5 for ONI, unless otherwise stated. We employ stochas-tic gradient descent (SGD) optimization with a batch sizeof 256, and the learning rates are selected, based on thevalidation set (5, 000 samples from the training set), from{0.05, 0.1, 0.5, 1}.

F.1.1 Vanished Activations and Gradients

In Section 4.1.1 of the paper, we observe that the deeperneural networks with orthogonal weight matrices are difficultto train without scaling them by a factor of

√2. We argue

the main reason for this is that the activation and gradientare exponentially vanished. Here we provide the details.

We evaluate the mean absolute value of the activation:σx =

�mi=1

�dj=1 |xij | for the layer-wise input x ∈ Rm×d,

and the mean absolute value of the gradient: σ ∂L∂h

=�mi=1

�nj=1 |∂L∂h ij

| for the layer-wise gradient ∂L∂h ∈ Rm×n.

Figure II show the results on the 20-layer MLP. ‘ONI-NS-Init’ (‘ONI-NS-End’) indicates the ONI method withoutscaling a factor of

√2 during initialization (the end of train-

ing). We observe that ‘ONI-NS’ suffers from a vanishedactivation and gradient during training. Our ‘ONI’ with ascaling factor of

√2 has no vanished gradient.

F.1.2 Effects of Groups

We further explore the group based orthogonalizationmethod [10] on ONI. We vary the group size G in{16, 32, 64, 128, 256}, and show the results in Figure III.We observe that our ONI can achieve slightly better perfor-mance with an increasing group size. The main reason for

0 20 40 60Epochs

0

5

10

15

20

25

Trai

n er

rors

G16G32G64G128G256

(a) Train Error

0 20 40 60Epochs

10

15

20

Test

err

ors

G16G32G64G128G256

(b) Test Error

Figure III. Effects of group size G in proposed ‘ONI’. We evaluatethe (a) training and (b) test errors on a 10-layer MLP.

this is that the group based orthogonalization cannot ensurethe whole weight matrix to be orthogonal.

F.2. CNNs on CIFAR10

We use the official training set of 50, 000 images and thestandard test set of 10, 000 images. The data preprocessingand data augmentation follow the commonly used mean&stdnormalization and flip translation, as described in [8].

F.2.1 VGG-Style Networks

Details of Network Architectures The network startswith a convolutional layer of 32k filters, where k is thevarying width based on different configurations. We thensequentially stack three blocks, each of which has g convo-lutional layers, and the corresponding convolutional layershave a filter numbers of 32k, 64k and 128k, respectively,and feature maps sizes of 32 × 32, 16 × 16 and 8 × 8, re-spectively. We use the first convolution in each block withstride 2 to carry out spatial sub-sampling for feature maps.The network ends with global average pooling and follows alinear transformation. We vary the depth with g in {2, 3, 4}and the width with k in {1, 2, 3}.

Experimental Setup We use SGD with a momentum of0.9 and batch size of 128. The best initial learning rateis chosen from {0.01, 0.02, 0.05} over the validation setof 5,000 samples from the training set, and we divide thelearning rate by 5 at 80 and 120 epochs, ending the training at160 epochs. For ‘OrthReg’, we report the best results using aregularization coefficient λ in {0.0001, 0.0005}. For ‘OLM’,we use the group size of G = 64 and full orthogonalization,and report the best result.

Training Performance In Section 4.1.2 of the paper, wemention ‘ONI’ and ‘OLM-

√2’ converge faster than other

baselines, in terms of training epochs. Figure IV shows thetraining curves under different configurations (depth andwidth). It’s clear that ‘ONI’ and ‘OLM-

√2’ converge faster

than other baselines under all network configurations, interms of training epochs. The results support our conclusionthat maintaining orthogonality can benefit optimization.

Page 6: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

0 50 100 150Epochs

0

10

20

30

40Tr

ain

erro

r

plainOrthInitOrthRegWNOLM-G64SONI

(a) g=2, k=1

0 50 100 150Epochs

0

10

20

30

40

Trai

n er

ror

plainOrthInitOrthRegWNOLM-G64ONI

(b) g=2, k=2

0 50 100 150Epochs

0

10

20

30

40

Trai

n er

ror

plainOrthInitOrthRegWNOLM-G64ONI

(c) g=2, k=3

0 50 100 150Epochs

0

10

20

30

40

Trai

n er

ror

plainOrthInitOrthRegWNOLM-G64ONI

(d) g=3, k=1

0 50 100 150Epochs

0

10

20

30

40

Trai

n er

ror

plainOrthInitOrthRegWNOLM-G64ONI

(e) g=3, k=2

0 50 100 150Epochs

0

10

20

30

40

Trai

n er

ror

plainOrthInitOrthRegWNOLM-G64ONI

(f) g=3, k=3

0 50 100 150Epochs

0

10

20

30

40

Trai

n er

ror

plainOrthInitOrthRegWNOLM-G64ONI

(g) g=4, k=1

0 50 100 150Epochs

0

10

20

30

40

Trai

n er

ror

plainOrthInitOrthRegWNOLM-G64ONI

(h) g=4, k=2

0 50 100 150Epochs

0

10

20

30

40

Trai

n er

ror

plainOrthInitOrthRegWNOLM-G64ONI

(i) g=4, k=3

Figure IV. Comparison of training errors on VGG-style networks for CIFAR-10 image classification. From (a) to (i), we vary the depth3g + 2 and width 32k, with g ∈ {2, 3, 4} and k ∈ {1, 2, 3}.

F.2.2 Residual Network without Batch Normalization

Here we provide the details of the experimental setups andtraining performance of the experiments on a 110-layer resid-ual network [8] without batch normalization (BN) [12], de-scribed in Section 4.1.2 of the paper.

Experimental Setups We run the experiments on oneGPU. We apply SGD with a batch size of 128, a momen-tum of 0.9 and a weight decay of 0.0001. We set the initiallearning rate to 0.1 by default, and divide it by 10 at 80and 120 epochs, and terminate the training at 160 epochs.For Xavier Init [5, 1], we search the initial learning ratefrom {0.1, 0.01, 0.001} and report the best result. For groupnormalization (GN) [23], we search the group size from{64, 32, 16} and report the best result. For our ONI, we usethe data-dependent initialization methods used in [21] toinitial the learnable scale parameters.

For small batch size experiments, we train the networkwith an initial learning rate following the linear learning ratescaling rule [7], to adapt the batch size.

Training Performance Figure V (a) and (b) show thetraining curve and test curve respectively. We observe that

0 50 100 150Epochs

0

20

40

60

Trai

n er

ror

w/BNGNONI

(a) train error

0 50 100 150Epochs

0

20

40

60

Test

err

or

w/BNGNONI

(b) test error

Figure V. Training performance comparison on 110-layer residualnetwork without batch normalization for CIFAR-10 dataset. ‘w/BN’indicates with BN. We show the (a) training error with respect tothe epochs and (b) test error with respect to epochs.

‘ONI’ converges significantly faster than ‘BN’ and ‘GN’, interms of training epochs.

F.3. Details of Experimental Setup on ImageNet

ImageNet-2012 consists of 1.28M images from 1,000classes [19]. We use the official 1.28M labeled images pro-vided for training and evaluate the top-1 and top-5 test clas-sification errors on the validation set, with 50k images.

Page 7: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

We keep almost all the experimental settings the sameas the publicly available PyTorch implementation [18]: weapply SGD with a momentum of 0.9, and a weight decay of0.0001; We train over 100 epochs in total and set the initiallearning rate to 0.1, lowering it by a factor of 10 at epochs30, 60 and 90. For ‘WN’ and ‘ONI’, we don’t us weightdecay on the learnable scalar parameters.

VGG Network We run the experiments on one GPU, witha batch size of 128. Apart from our ’ONI’, all other methods(‘plain’, ‘WN’, ‘OrthInit’ and ‘OrthReg’) suffer from diffi-culty in training with a large learning rate of 0.1. We thus runthe experiments with initial learning rates of {0.01, 0.05}for these, and report the best result.

Residual Network We run the experiments on one GPUfor the 18- and 50-layer residual network, and two GPUsfor the 101-layer residual network. We use a batch size of256. Considering that ‘ONI’ can improve the optimizationefficiency, as shown in the ablation study on ResNet-18, werun the 50- and 101-layer residual network with a weightdecay of {0.0001, 0.0002} and report the best result fromthese two configurations for each method, for a more faircomparison.

F.4. Ablation Study on Iteration Number

We provide the details of the training performance forONI on Fashion MNIST in Figure VI. We vary T , for arange of 1 to 7, and show the training (Figure VI (a)) andtest (Figure VI (b)) errors with respect to the training epochs.We also provide the distribution of the singular values ofthe orthogonalized weight matrix W, using our ONI withdifferent iteration numbers T . Figure VI (c) shows the resultsfrom the first layer, at the 200th iteration. We also obtainsimilar observations for other layers.

G. Details on Training GANs

For completeness, we provide descriptions of the mainconcepts used in the paper, as follows.

Inception score (IS) Inception score (IS) (the higher thebetter) was introduced by Salimans et al. [20]:

ID = exp(ED[KL(p(y|x)�p(y))]), (25)

where KL(·�·) denotes the Kullback-Leibler Divergence,p(y) is approximated by 1

N

�Ni=1 p(y|xi) and p(y|xi) is the

trained Inception model [22]. Salimans et al. [20] reportedthat this score is highly correlated with subjective humanjudgment of image quality. Following [20] and [16], wecalculate the score for 5000 randomly generated examplesfrom each trained generator to evaluate IS. We repeat 10times and report the average and the standard deviation ofIS.

0 20 40 60Epochs

0

10

20

30

40

Trai

n er

rors

T1T2T3T4T5T6T7

(a) Train Error

0 20 40 60Epochs

10

20

30

40

Test

err

ors

T1T2T3T4T5T6T7

(b) Test Error

0 50 100 150 200 250Index of singular values

0

0.2

0.4

0.6

0.8

1

Sin

gula

r val

ues

T1

T2

T3

T4

T5

T6

T7

(c) Distribution of Singular Values

Figure VI. Effects of the iteration number T in proposed ‘ONI’.We evaluate the training errors on a 10-layer MLP. We show (a)the training errors; (b) the test errors and (c) the distribution of thesingular values of the orthogonalized weight matrix W from thefirst layer, at the 200th iteration.

Frechet Inception Distance (FID) [9] Frechet inceptiondistance (FID) [9] (the lower the better) is another measurefor the quality of the generated examples that uses second-order information from the final layer of the inception modelapplied to the examples. The Frechet distance itself is a2-Wasserstein distance between two distributions p1 and p2,assuming they are both multivariate Gaussian distributions:

F (p1, p2) = �µp1 − µp2�22 + tr(Cp1 + Cp2 − 2(Cp1Cp2)12 ), (26)

where {µp1 , Cp1}, {µp2 , Cp2} are the mean and covarianceof samples from generated p1 and p2, respectively, and tr(·)indicates the trace operation. We calculate the FID betweenthe 10K test examples (true distribution) and the 5K ran-domly generated samples (generated distribution).

GAN with Non-saturating Loss The standard non-saturating function for the adversarial loss is:

L(G,D) = Ex∼q(x)[logD(x)] + Ez∼p(z)[1− logD(G(z))], (27)

where q(x) is the distribution of the real data, z ∈ Rdz isa latent variable, p(z) is the standard normal distributionN(0, I), and G is a deterministic generator function. dz isset to 128 for all experiments. Based on the suggestion in[6, 16], we use the alternate cost −Ez∼p(z)[logD(G(z))] toupdate G, while using the original cost defined in Eqn. 27for updating D.

GAN with Hinge Loss The hinge loss for adversariallearning is:

LD(G,D) = Ex∼q(x)[max(0, 1−D(x))]

+ Ez∼p(z)[max(0, 1 +D(G(z)))] (28)

LG(G, D) = −Ez∼p(z)D(G(z)) (29)

Page 8: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

Setting α β1 β2 ndis

A 0.0001 0.5 0.9 5B 0.0001 0.5 0.999 1C 0.0002 0.5 0.999 1D 0.001 0.5 0.9 5E 0.001 0.5 0.999 5F 0.001 0.9 0.999 5

Table C. Hyper-parameter settings in stability experiments on DC-GAN, following [16].

0 50 100 150 200Epochs

3.5

4

4.5

5

5.5

6

6.5

Ince

ptio

n S

core

SN

ONIT0

ONIT1

ONIT2

ONIT3

ONIT5

(a)A B C D E F

5

5.5

6

6.5

Ince

ptio

n S

core

SNONI

(b)

Figure VII. Comparison of SN and ONI on DCGAN. (a) The ISwith respect to training epochs. (b)The stability experiments on thesix configurations described in [16].

for the discriminator and the generator, respectively. Thistype of loss has already been used in [14, 16, 24, 2].

Our code is implemented in PyTorch [18] and the trainedInception model is from the official models in PyTorch [18].The IS and FID for the real training data are 10.20 ± 0.13and 3.07 respectively. Note that we do not use the learnablescalar in any the GAN experiment, and set σ = 1 in ONI,for more consistent comparisons with SN.

G.1. Experiments on DCGAN

The DCGAN architecture follows the configuration in[16], and we provide the details in Table D for completeness.The spectral normalizaiton (SN) and our ONI are only ap-plied on the discriminator, following the experimental setupin the SN paper [16].

Figure VII (a) shows the IS of SN and ONI when varyingNewton’s iteration number T from 0 to 5. We obtain thesame observation as the FID evaluation, shown in Section4.2 of the paper.

As discussed in Section 4.2 the paper, we conduct exper-iments to validate the stability of our proposed ONI underdifferent experimental configurations, following [16]. Ta-ble C shows the corresponding configurations (denoted byA-F) when varying the learning rate α, first momentum β1,second momentum β2, and the number of updates of thediscriminator per update of the generator ndis. The resultsevaluated by IS are shown in Figure VII (b). We observethat our ONI is consistently better than SN under the ISevaluation.

G.2. Implementation Details of ResNet-GAN

The ResNet architecture also follows the configuration in[16], and we provide the details in Table E for completeness.

0 50 100 150 200Epochs

5

5.5

6

6.5

7

Ince

ptio

n S

core

SNONI

(a)

0 50 100 150 200Epochs

5

5.5

6

6.5

7

Ince

ptio

n S

core

SNONI

(b)

Figure VIII. Comparison of SN and ONI on ResNet GAN. We showthe IS with respect to training epochs using (a) the non-saturatingloss and (b) the hinge loss.

The SN and our ONI are only applied on the discriminator,following the experimental setup in the SN paper [16].

We provide the results of SN and ONI in Figure VIII,evaluated by IS.

G.3. Qualitative Results of GAN

We provide the generated images in Figure IX, X and XI.Note that we don’t hand-pick the images, and show all theresults at the end of the training.

References[1] Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q

Weinberger. Understanding batch normalization. In NeurIPS.2018. 6

[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Largescale GAN training for high fidelity natural image synthesis.In ICLR, 2019. 8

[3] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch,Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shel-hamer. cudnn: Efficient primitives for deep learning. CoRR,abs/1410.0759, 2014. 3

[4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: Amatlab-like environment for machine learning. In BigLearn,NIPS Workshop, 2011. 3

[5] Xavier Glorot and Yoshua Bengio. Understanding the dif-ficulty of training deep feedforward neural networks. InAISTATS, 2010. 6

[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In NeurIPS.2014. 7

[7] Priya Goyal, Piotr Dollar, Ross B. Girshick, Pieter Noord-huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,Yangqing Jia, and Kaiming He. Accurate, large minibatchSGD: training imagenet in 1 hour. CoRR, abs/1706.02677,2017. 6

[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In CVPR, 2016.5, 6

[9] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-hard Nessler, and Sepp Hochreiter. Gans trained by a twotime-scale update rule converge to a local nash equilibrium.In NeurIPS. 2017. 7

Page 9: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

(a) Generator

z ∈ R128 ∼ N (0, I)

4× 4, stride=1 deconv. BN 512 ReLU → 4× 4× 512

4× 4, stride=2 deconv. BN 256 ReLU4× 4, stride=2 deconv. BN 128 ReLU4× 4, stride=2 deconv. BN 64 ReLU

3× 3, stride=1 conv. 3 Tanh

(b) Discriminator

RGB image x ∈ R32×32×3

3× 3, stride=1 conv 64 lReLU4× 4, stride=2 conv 64 lReLU3× 3, stride=1 conv 128 lReLU4× 4, stride=2 conv 128 lReLU3× 3, stride=1 conv 256 lReLU4× 4, stride=2 conv 256 lReLU3× 3, stride=1 conv 512 lReLU

dense → 1

Table D. DCGAN architectures for CIFAR10 dataset in our experiments. ‘lReLU‘ indicates the leaky ReLU [15] and its slope is set to 0.1.

(a) Generator

z ∈ R128 ∼ N (0, I)

dense, 4× 4× 128

ResBlock up 128ResBlock up 128ResBlock up 128

BN, ReLU, 3× 3 conv, 3 Tanh

(b) Discriminator

RGB image x ∈ R32×32×3

ResBlock down 128ResBlock down 128

ResBlock 128ResBlock 128

ReLUGlobal sum pooling

dense → 1

Table E. ResNet architectures for CIFAR10 dataset in our experiments. We use the same ResBlock as the SN paper [16].

[10] Lei Huang, Xianglong Liu, Bo Lang, Adams Wei Yu,Yongliang Wang, and Bo Li. Orthogonal weight normal-ization: Solution to optimization over multiple dependentstiefel manifolds in deep neural networks. In AAAI, 2018. 3,5

[11] Lei Huang, Yi Zhou, Fan Zhu, Li Liu, and Ling Shao. Itera-tive normalization: Beyond standardization towards efficientwhitening. In CVPR, 2019. 2

[12] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. In ICML, 2015. 6

[13] Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus-Robert Muller. Effiicient backprop. In Neural Networks:Tricks of the Trade, 1998. 5

[14] Jae Hyun Lim and Jong Chul Ye. Geometric gan. CoRR,abs/1705.02894, 2017. 8

[15] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti-fier nonlinearities improve neural network acoustic models.In in ICML Workshop on Deep Learning for Audio, Speechand Language Processing, 2013. 9

[16] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, andYuichi Yoshida. Spectral normalization for generative adver-sarial networks. In ICLR, 2018. 7, 8, 9, 10, 11

[17] Vinod Nair and Geoffrey E. Hinton. Rectified linear unitsimprove restricted boltzmann machines. In ICML, 2010. 5

[18] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in PyTorch. In NeurIPS Autodiff Workshop,2017. 7, 8

[19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C. Berg, and LiFei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. 6

[20] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, VickiCheung, Alec Radford, Xi Chen, and Xi Chen. Improvedtechniques for training gans. In NeurIPS, 2016. 7

[21] Tim Salimans and Diederik P. Kingma. Weight normalization:A simple reparameterization to accelerate training of deepneural networks. In NeurIPS, 2016. 6

[22] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR, 2015. 7

[23] Yuxin Wu and Kaiming He. Group normalization. In ECCV,2018. 6

[24] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and AugustusOdena. Self-attention generative adversarial networks. InICML, 2019. 8

Page 10: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

(a) T = 0 (b) T = 1 (c) T = 2

(d) T = 3 (e) T = 4 (f) T = 5

Figure IX. Generated images for CIFAR-10 by our ONI with different iterations, using DCGAN [16].

(a) SN-A (b) SN-B (c) SN-C

(d) ONI-A (e) ONI-B (f) ONI-C

Figure X. Generated images for CIFAR-10 by SN and ONI, using DCGAN [16]. We show the results of SN and ONI, with configuration A,B and C.

Page 11: Supplementary Materials: Controllable Orthogonalization in ... · ONI method with the centering and more compact spectral bounding operation, as described in Section 3.3 of the pa-per

(a) SN with non-satruating loss (b) SN with hinge loss

(c) ONI with non-satruating loss (d) ONI with hinge loss

Figure XI. Generated images for CIFAR-10 by SN and ONI, using ResNet [16]. We show the results of SN and ONI, with the non-satruatingand hinge loss.