self-learning monte carlo with deep neural networks · 2018. 5. 30. · self-learning monte carlo...

Self-learning Monte Carlo with Deep Neural Networks

Huitao Shen,1, ∗ Junwei Liu,1, 2, † and Liang Fu1

1Department of Physics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA2Department of Physics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China

Self-learning Monte Carlo (SLMC) method is a general algorithm to speedup MC simulations. Itsefficiency has been demonstrated in various systems by introducing an effective model to proposeglobal moves in the configuration space. In this paper, we show that deep neural networks can benaturally incorporated into SLMC, and without any prior knowledge, can learn the original modelaccurately and efficiently. Demonstrated in quantum impurity models, we reduce the complexityfor a local update from O(β2) in Hirsch-Fye algorithm to O(β lnβ), which is a significant speedupespecially for systems at low temperatures.

As an unbiased method, Monte Carlo (MC) simulationplays an important role in understanding condensed mat-ter systems. Although great successes have been made inthe past several decades, there are still many interestingsystems that are practically beyond the capability of con-ventional MC methods, due to the strong autocorrelationof local updates or due to the heavy computational cost ofa single local update. In the midst of recent developmentsof machine learning techniques in physics [1–17], a gen-eral method called “Self-learning Monte Carlo (SLMC)”was introduced to reduce or solve these problems, first inclassical statistical mechanics models [18, 19], later ex-tended to classical spin-fermion models [20], determinan-tal quantum Monte Carlo (DQMC) [21], continuous-timequantum Monte Carlo [22, 23] and hybrid Monte Carlo[24]. Recently, it helped understand itinerant quantumcritical point by setting up a new record of system sizein DQMC simulations [25].

Designed under the philosophy of “first learn, thenearn”, the central ingredient of SLMC is an effectivemodel that is trained to resemble the dynamics of theoriginal model. The advantage of SLMC is two-fold.First, simulating the effective model is much faster, whichenables the machine to propose global moves to acceler-ate MC simulations on the original model. Second, theeffective model can directly reveal the underlying physics,such as the RKKY interaction in the double-exchangemodel [20] and the localized spin-spin imaginary-timecorrelation [23]. We note that there have been many pre-vious works incorporating effective potentials or propos-ing various kinds of global moves to improve Monte Carlosimulation efficiency [26–30].

The efficiency of SLMC depends on the accuracy ofthe effective model, which is usually invented based onthe human understanding of the original system [18, 20–23, 25]. To further extend SLMC to complex systemswhere an accurate effective model is difficult to writedown, in this work, we employ deep neural networks(DNN) as effective models in SLMC. Instead of treat-ing neural networks as black boxes with a huge number

∗ [email protected]† [email protected]

of parameters and training them blindly, we show how todesign highly efficient neural networks that respect thesymmetry of the system, with very few parameters yetcapturing the dynamics of the original model quantita-tively. The generality of this approach is guaranteed bythe mathematical fact that DNNs are able to accuratelyapproximate any continuous functions given enough fit-ting parameters [31, 32]. Practically, our DNNs canbe trained with ease using back-propagation-based algo-rithms [33], and can be directly evaluated in dedicatedhardwares [34]. Compared with other machine learningmodels in SLMC such as the restricted Boltzmann ma-chine [19, 24], which can be regarded as a fully-connectedneural network with one hidden layer, our DNNs havegreater expressibility (more hidden layers) and flexibility(respecting the symmetry of the model).

As a concrete example, we demonstrate SLMC withDNNs on quantum impurity models. In the following,we first review SLMC for fermion systems. We then im-plement the simplest neural networks and test their per-formances. Next, we show how the visualization of thesenetworks helps design a more sophisticated convolutionalneural network that is more accurate and efficient. Fi-nally, we discuss the complexity of our algorithm.

SLMC for Fermions For an interacting fermion sys-

tem, the partition function is given by Z = Tr[e−βHf ],where β = 1/T is the inverse temperature, and the traceis over the grand-canonical ensemble. One often ap-

plies the Trotter decomposition e−βHf =∏Li=1 e

−∆τHf ,∆τ = β/L, the Hubbard-Stratonovich transformation

Tr[e−∆τHf ] =∑Nsj=±1 Tr[e−∆τH[sj ]], and then integrates

out fermions. We denote sji as the j-th auxiliary Isingspin on the i-th imaginary time slice. At this stage, thepartition function is written purely on the auxiliary Isingspin degrees of freedom S ≡ sji [35–37]:

Z =∑

Sdet

[I +

L∏

i=1

e−∆τH[si]

]≡∑

SW [S]. (1)

The Monte Carlo sampling is in the configuration spaceof S. The probability p of accepting a move, forexample in the Metropolis-Hastings algorithm, is theweight ratio between two configurations: p(S1 → S2) =

arX

iv:1

801.

0112

7v2

[co

nd-m

at.s

tr-e

l] 2

9 M

ay 2

018

mailto:[email protected]

mailto:[email protected]

2

min (1,W [S2]/W [S1]). Generally, one must evaluate thedeterminant in Eq. (1), which is time-consuming.

The idea of SLMC is to introduce an effective modelHeff [S;α] that depends on some trainable parametersα. We would like to optimize α so that Weff [S;α] ≡e−βHeff [S;α] and W [S] are as close as possible. More for-mally, we would like to minimize the mean squared error(MSE) of the logarithmic weight difference:

minα

ES∼W [S]/Z (lnWeff [S;α]− lnW [S])2. (2)

The rationale of minimizing this error will be discussedshortly later. Following the maximum likelihood princi-ple, in practice one could minimize the MSE on a givendata set called the training set S,W [S], which is ob-tained from the Monte Carlo simulation of the originalmodel. This training set is of small size compared withthat of the whole configuration space, but is consideredtypical mimicking the distribution of the original modelbecause it is generated by the importance sampling. Im-portantly, the training data taken from the Markov chainshould be independently distributed in order for the max-imum likelihood estimation to work well.

One then uses the trained effective model to proposeglobal moves. Starting from a given configuration S1,one first performs standard Monte Carlo simulations onthe effective model S1 → S2 → . . . → Sn. ConfigurationSn is accepted by the original Markov chain with theprobability [18, 20]

p(S1 → Sn) = min

(1,W [Sn]

W [S1]

Weff [S1;α]

Weff [Sn;α]

). (3)

As proven in Supplemental Material [38], the acceptancerate 〈p〉, defined as the expectation of configuration ac-ceptation probability p defined in Eq. (3), is directly re-lated to the MSE in Eq. (2) as 〈(ln p)2〉 = MSE. Thismeans MSE serves as a very good estimation of the accep-tance rate. Indeed, we will see in the following that thesetwo quantities correspond with each other very well.

The acceleration of SLMC can be analyzed as follows.Denote the computational cost of computing W [S] andWeff [S;α] given S as T and Teff , and the autocorrelationtime of a measurement without SLMC as τ . SupposeT ≥ Teff , one can always to make enough (& τ) updatesin the effective model so that the proposed configura-tion Sn is uncorrelated with S1. In this way, to obtaintwo independent configurations without or with SLMC,it takes time O(τT ) and O((τTeff +T )/〈p〉). If simulatingeffective model is efficient τTeff T , as demonstrated bycases studied in Ref. [20, 21], the acceleration is of order〈p〉τ .

In principle, there is no limitation on the functionalform of Heff [S;α]. In the following, we choose to con-struct Heff [S;α] using neural networks of different ar-chitectures. To be concrete, we study the asymmetric

Anderson model with a single impurity [39]

H =H0 + H1, (4)

H0 =∑

k

εkc†kck + V

∑

kσ

(c†kdσ + h.c.) + µnd, (5)

H1 =U

(nd,↑ −

1

2

)(nd,↓ −

1

2

), (6)

where dσ and ck are the fermion annihilation operatorfor the impurity and for the conduction electrons respec-

tively. nd,σ ≡ d†σdσ, nd =∑σ=↑/↓ nd,σ. With different

fillings, this model hosts very different low-temperaturebehaviors identified by the three regimes: local moment,mixed valence and empty orbital [40]. The Hubbard-Stratonovich transformation on the impurity site is (upto a constant factor)

e−∆τH1 =1

2

∑

s=±1

eλs(nd,↑−nd,↓), (7)

with coshλ = e∆τU/2. In total there are L auxiliaryspins, one at each imaginary time slice denoted as a vec-tor s ≡ S. For impurity problems, one may integrate outthe continuous band of conducting electrons explicitlyand update with Hirsch-Fye algorithm [41, 42]. In thefollowing, the conduction band is assumed to have semi-circular density of states ρ(ε) = 2

√1− (ε/D)2/(πD).

The half bandwidth D = 1 is set to be the energy unit.Fully-Connected Neural Networks To gain some insight

on how to design the neural network as the effectivemodel, we first implement the simplest neural network,which consists of several fully-connected layers. Its struc-ture is shown schematically in the inset of Fig. 1. Theeffect of i-th fully-connected layer can be summarized asai+1 = fi(Wiai + bi), where ai is the input/output vec-tor of the i-th/(i− 1)-th layer. Wi, bi, fi are the weightmatrix, bias vector, and the nonlinear activation func-tion of such layer. The layer is said to have Ni neuronswhen Wi is of size Ni×Ni−1. This structure as the varia-tional wavefunction in quantum many-body systems hasrecently been studied extensively [43–52].

We take the auxiliary Ising spin as the input vectors ≡ a1. It is propagated through two nonlinear hid-den layers with N1 and N2 neurons, and then a linearoutput layer. The output is a number represents thecorresponding weight lnWeff [s]. We trained the fully-connected neural networks of different architectures andin different physical regimes. The details on the networksand training can be found in Supplemental Material [38].

As shown in Fig. 1, the trained neural network resem-bles the original model very well. It retains high accep-tance rates (> 70%) steadily throughout all the parame-

ter regimes. In addition, e−√

MSE indeed shares the sametrend with the acceptance rate 〈p〉. This suggests that tocompare different effective models, one can directly com-pare the MSE instead of computing the acceptance rateevery time.

3

FIG. 1. The performance of the effective model at differ-ent chemical potentials. The dashed blue line is the es-timation of acceptance rates through MSE computed in a

test data set [38]: e−√MSE. The discrepancy between the

solid and the dashed line is due to the fact that in general

〈p〉 6= e−√〈(ln p)2〉 = e−

√MSE and the effective model is bi-

ased. Here β = 20, U = 3.0, V = 1.0 and L = 120. Thenumber of neurons in the first and second hidden layers areset as N1 = 100 and N2 = 50. The activation function is therectified linear unit f(x) = maxx, 0. Inset: A schematicshow of the fully-connected neural network. The red circlesrepresent neurons in the input layer with L = 4, and the bluecircles represent neurons in two hidden layers with N1 = 6and N2 = 3. The last layer is a linear output layer.

To extract more features from neural networks, we vi-sualize the weight matrix of the first layers in Fig. 2. Themost striking feature is its sparsity. Although the net-work is fully-connected by construction, most of weightmatrix elements vanish after the training, and thus thenetwork is essentially sparsely connected. Clearly, evenwithout any prior knowledge of the system, and usingonly field configurations and their corresponding ener-gies, the neural network can actually “learn” that thecorrelation between auxiliary spins is short ranged inimaginary time.

Weight matrices in neural networks of different chem-ical potentials look similar. The main difference lies inthe magnitude of the matrix elements. Shown in Fig. 2,the neural network could capture the relative effective in-teraction strength. When the chemical potential movesaway from half-filling, less occupation on the impuritysite 〈nd〉 leads to a weaker coupling between the auxil-iary spins and the impurity electrons, according to theHubbard-Stratonovich transformation Eq. (7). This fur-ther causes the decrease of interaction between the aux-iliary spins induced by conducting electrons.

We end this section by briefly discussing the com-plexity of fully-connected neural networks. The for-ward propagation of the spin configuration in the fully-connected network involves three matrix-vector multipli-

FIG. 2. Upper: Weight matrix W1 taken from the fully-connected neural network of µ = 0 in Fig. 1. Lower: Av-erage magnitude of nonzero matrix elements in W1 of fully-connected neural networks in Fig. 1. The “nonzero” matrixelement is defined as the element that is greater than 10% ofthe maximum element in all 7 weight matrices W1 from net-works trained at 7 different chemical potentials. Only around3% elements are nonzero in these weight matrices.

cations. Each multiplication takes computational costO(L2), with L the number of imaginary-time time slicesthat is usually proportional to βU . Thus the runningtime for a local update in the effective model is Teff =O(L2). In the Hirsch-Fye algorithm, each local updatealso takes time T = O(L2) [41, 42]. Since Teff = T ,SLMC based on fully-connected networks has no advan-tage in speed over the original Hirsch-Fye algorithm. Inthe next section, we will show that by taking advantageof the sparse-connection found in the neural network, wecan reduce the complexity and make the acceleration pos-sible.

Exploit the Sparsity The sparsity found in the previ-ous section inspires us to design a neural network thatis sparsely connected by construction. Moreover, it isknown physically that the interaction between the auxil-iary spins in imaginary time is translationally invariant,i.e., only depends on |τi − τj |. A neural network thathas both properties is known as the “(1D) convolutionalneural network” [33]. Instead of doing matrix multiplica-tions as in fully-connected networks, convolutional net-works produce their output by sliding inner product de-noted by ∗: ai+1 = fi(ai ∗ hi + bi) (Fig. 3). hi and bi arecalled the kernel and the bias of such layer. A detailed

4

FIG. 3. The structure of the convolutional neural network.The first layer is a convolutional layer with two kernels (darkand light blue) of size 3. The stride of the sliding inner prod-uct is 2. The second layer is a fully-connected layer and thelast layer is a linear layer.

mathematical description of such networks can be foundin Supplemental Material [38].

The key parameters in convolutional neural networksare the number and size of kernels and the stride of slid-ing inner product. The setup of these parameters can beguided from the fully-connected neural networks: Thenumber and the size of kernels is determined accordingto the pattern of weight matrices in the fully-connectedneural networks, and the stride could be chosen to be halfof the kernel size to avoid missing local correlations. Forexample, for the model whose weight matrix is shown inFig. 2, we could choose 2 kernels of size 9 for the firstconvolutional layer with a stride 3. Then several con-volutional layers designed in the same spirit are stackeduntil the size of the output is small enough. Finally, oneadds a fully-connected layer to produce the final output.

Compared with the fully-connected network, the con-volutional network has much fewer trainable parame-ters, which explicitly reduces the redundancy in theparametrization. The fully-connected networks in Fig. 2typically have 105 trainable parameters, while the con-volutional networks have only 102 — smaller by threeorders of magnitude.

The performance of convolutional networks are shownin Fig. 4. The measured fermion imaginary-time Green’sfunction is shown in the Supplemental Material [38]. In-terestingly, the acceptance rate of global moves proposedby convolutional networks are sometimes even higherthan those proposed by fully-connected networks. Inprinciple, fully-connected networks with more parame-ters have greater expressibility. However, for this specificAnderson model, the parameterization of the effectivemodel has a lot of redundancy, as the auxiliary spin in-teractions are local and are translationally invariant. Theconvolutional networks reduce this redundancy by con-struction, and are easier to train potentially due to thesmaller parameter space.

The fewer parameters not only make the network eas-ier to train, but also faster to evaluate. Each slide inner

(a)

(b)

FIG. 4. (a) The performance of the convolutional networkcompared with the fully-connected network in Fig. 1. Thenumber of trainable parameters is 211 (2 kernels in the firstconvolutional layer) or 291 (6 kernels in the first convolutionallayer). Here β = 20, U = 3.0, V = 1.0 and L = 120. (b)The performance of the convolutional network at differenttemperatures. Here U = 3.0, µ = −1.0, V = 1.0 and L =2βU . The conventional neural network details are describedin Supplemental Material [38].

product has the computational cost O(L). It is impor-tant to notice that the strides of the sliding inner productare greater than one so that the dimensions of interme-diate outputs keep decreasing. In this way, the numberof the convolutional layers is of order O(lnL) becauseof the large stride. The final fully-connected layer issmall. Propagating through such layer only costs a con-stant computational time that is insensitive to L. Tosummarize, each local update on the effective model hascomplexity Teff = O(L lnL), while that in Hirsch-Fye al-gorithm is T = O(L2). Since the autocorrelation timefor the desired observable is at least of order τ = Ω(L)in order for every auxiliary spin to be updated once, i.e.τTeff ≥ T , the acceleration with respect to the originalHirsch-Fye algorithm is then τT/(τTeff +T ) ≈ T/Teff , oforder 〈p〉L/ lnL. It is especially significant for large L.This efficiency allows us to train effective models at verylow temperatures very effectively (Fig. 4(b)), whereastraining a fully-connected network is very costly, if pos-sible at all.

Conclusion In this paper, we showed how to integrateneural networks into the framework of SLMC. Both thearchitecture of the networks and the way we design thesenetworks are general and not restricted to impurity mod-

5

els. This work can help design neural networks as effec-tive models in more complicated systems, thereby intro-ducing the state-of-art deep learning hardwares into thefield of computational physics.

Particularly for impurity models, we demonstratedthat the complexity of the convolutional network for alocal update is improved to O(L lnL). We note thatthere exist continuous-time Monte Carlo algorithms thatgenerally outperform the discrete-time Hirsch-Fye algo-rithm [53, 54]. Although similar self-learning approacheshave already been implemented in these systems [22, 23],designing an accurate effective model in these continuous-time algorithms is not straightforward as the size of thefield configuration keeps changing during the simulation.Looking forward, there have already been attempts intro-ducing machine learning into dynamical mean-field the-

ory (DMFT) [55, 56]. It will be interesting to accelerateDMFT simulation by integrating SLMC into their impu-rity solvers [57]. Moreover, it is worthwhile to developmore advanced network architectures beyond convolu-tional networks, e.g., networks that are invariant underpermutations of the input [58]. We leave all these at-tempts for future work.

ACKNOWLEDGMENTS

H.S. thanks Bo Zeng for helpful discussions. Thiswork is supported by DOE Office of Basic Energy Sci-ences, Division of Materials Sciences and Engineering un-der Award de-sc0010526. J.L. is supported by the start-up funds from HKUST. L.F. is partly supported by theDavid and Lucile Packard Foundation.

[1] J. Carrasquilla and R. G. Melko, Nat. Phys. 13, 431(2017).

[2] L. Wang, Phys. Rev. B 94, 195105 (2016).[3] A. Tanaka and A. Tomiya, J. Phys. Soc. Japan 86,

063001 (2017).[4] T. Ohtsuki and T. Ohtsuki, J. Phys. Soc. Japan 85,

123706 (2016).[5] E. P. L. van Nieuwenburg, Y.-H. Liu, and S. D. Huber,

Nat. Phys. 13, 435 (2017).[6] Y. Zhang and E.-A. Kim, Phys. Rev. Lett. 118, 216401

(2017).[7] K. Mills, M. Spanner, and I. Tamblyn, Phys. Rev. A 96,

042113 (2017).[8] S. J. Wetzel, Phys. Rev. E 96, 022140 (2017).[9] W. Hu, R. R. P. Singh, and R. T. Scalettar, Phys. Rev.

E 95, 062122 (2017).[10] F. Schindler, N. Regnault, and T. Neupert, Phys. Rev.

B 95, 245134 (2017).[11] S. Lu, S. Huang, K. Li, J. Li, J. Chen, D. Lu, Z. Ji,

Y. Shen, D. Zhou, and B. Zeng, arXiv:1705.01523.[12] M. Cristoforetti, G. Jurman, A. I. Nardelli, and

C. Furlanello, arXiv:1705.09524.[13] C. Wang and H. Zhai, Phys. Rev. B 96, 144432 (2017).[14] P. Zhang, H. Shen, and H. Zhai, Phys. Rev. Lett. 120,

066401 (2018).[15] W.-J. Rao, Z. Li, Q. Zhu, M. Luo, and X. Wan, Phys.

Rev. B 97, 094207 (2018).[16] N. Yoshioka, Y. Akagi, and H. Katsura, Phys. Rev. B

97, 205110 (2018).[17] P. Huembeli, A. Dauphin, and P. Wittek, Phys. Rev. B

97, 134109 (2018).[18] J. Liu, Y. Qi, Z. Y. Meng, and L. Fu, Phys. Rev. B 95,

041101 (2017).[19] L. Huang and L. Wang, Phys. Rev. B 95, 035105 (2017).[20] J. Liu, H. Shen, Y. Qi, Z. Y. Meng, and L. Fu, Phys.

Rev. B 95, 241104 (2017).[21] X. Y. Xu, Y. Qi, J. Liu, L. Fu, and Z. Y. Meng, Phys.

Rev. B 96, 041119 (2017).[22] L. Huang, Y.-f. Yang, and L. Wang, Phys. Rev. E 95,

031301 (2017).[23] Y. Nagai, H. Shen, Y. Qi, J. Liu, and L. Fu, Phys. Rev.

B 96, 161102 (2017).[24] A. Tanaka and A. Tomiya, arXiv:1712.03893.[25] Z. H. Liu, X. Y. Xu, Y. Qi, K. Sun, and Z. Y. Meng,

arXiv:1706.10004.[26] S. Baroni and S. Moroni, Phys. Rev. Lett. 82, 4745

(1999).[27] C. Pierleoni and D. M. Ceperley, ChemPhysChem 6, 1872

(2005).[28] O. F. Syljuasen and A. W. Sandvik, Phys. Rev. E 66,

046701 (2002).[29] V. G. Rousseau, Phys. Rev. E 78, 056707 (2008).[30] G. Carleo, F. Becca, S. Moroni, and S. Baroni, Phys.

Rev. E 82, 046710 (2010).[31] G. Cybenko, Math. Control. Signals, Syst. 2, 303 (1989).[32] K. Hornik, Neural Networks 4, 251 (1991).[33] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn-

ing (MIT Press, 2016) http://www.deeplearningbook.

org.[34] N. P. Jouppi, C. Young, N. Patil, D. Patterson,

G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden,A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark,J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.Ghaemmaghami, R. Gottipati, W. Gulland, R. Hag-mann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt,J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan,D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon,J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Na-garajan, R. Narayanaswami, R. Ni, K. Nix, T. Nor-rie, M. Omernick, N. Penukonda, A. Phelps, J. Ross,M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov,M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan,G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan,R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, inProceedings of the 44th Annual International Symposiumon Computer Architecture, ISCA ’17 (ACM, New York,NY, USA, 2017) pp. 1–12.

[35] R. Blankenbecler, D. J. Scalapino, and R. L. Sugar,Phys. Rev. D 24, 2278 (1981).

[36] J. E. Hirsch, Phys. Rev. B 31, 4403 (1985).[37] S. R. White, D. J. Scalapino, R. L. Sugar, E. Y. Loh,

http://arxiv.org/abs/de-sc/0010526

http://dx.doi.org/10.1038/nphys4035


http://dx.doi.org/10.1103/PhysRevB.94.195105

http://dx.doi.org/10.7566/JPSJ.86.063001





http://dx.doi.org/10.1103/PhysRevLett.118.216401


http://dx.doi.org/10.1103/PhysRevA.96.042113

http://dx.doi.org/10.1103/PhysRevA.96.042113

http://dx.doi.org/10.1103/PhysRevE.96.022140





http://arxiv.org/abs/1705.01523





http://dx.doi.org/ 10.1103/PhysRevLett.120.066401

http://dx.doi.org/ 10.1103/PhysRevLett.120.066401

http://dx.doi.org/ 10.1103/PhysRevB.97.094207

















https://arxiv.org/abs/1712.03893







http://dx.doi.org/10.1002/cphc.200400587

http://dx.doi.org/10.1002/cphc.200400587




http://dx.doi.org/ 10.1103/PhysRevE.82.046710

http://dx.doi.org/ 10.1103/PhysRevE.82.046710

http://dx.doi.org/10.1007/BF02551274

http://dx.doi.org/10.1016/0893-6080(91)90009-T

http://www.deeplearningbook.org

http://www.deeplearningbook.org

http://dx.doi.org/10.1145/3079856.3080246

http://dx.doi.org/10.1145/3079856.3080246

http://dx.doi.org/10.1103/PhysRevD.24.2278


6

J. E. Gubernatis, and R. T. Scalettar, Phys. Rev. B 40,506 (1989).

[38] See Supplemental Material for (i) proof of the relationbetween the acceptance rate and the training MSE; (ii)basics of neural networks; (iii) details of the neural net-work structure and the training hyperparameters; (iv)measured fermion imaginary-time Green’s function.

[39] P. W. Anderson, Phys. Rev. 124, 41 (1961).[40] F. D. M. Haldane, Phys. Rev. Lett. 40, 416 (1978).[41] J. E. Hirsch and R. M. Fye, Phys. Rev. Lett. 56, 2521

(1986).[42] R. M. Fye and J. E. Hirsch, Phys. Rev. B 38, 433 (1988).[43] G. Carleo and M. Troyer, Science 355, 602 (2017).[44] J. Chen, S. Cheng, H. Xie, L. Wang, and T. Xiang, Phys.

Rev. B 97, 085104 (2018).[45] D.-L. Deng, X. Li, and S. Das Sarma, Phys. Rev. X 7,

021021 (2017).[46] X. Gao and L.-M. Duan, Nat. Commun. 8, 662 (2017).[47] Y. Huang and J. E. Moore, arXiv:1701.06246.[48] G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer,

R. Melko, and G. Carleo, Nat. Phys. 14, 447 (2018).

[49] H. Saito and M. Kato, J. Phys. Soc. Jpn. 87, 014001(2018).

[50] Y. Nomura, A. S. Darmawan, Y. Yamaji, and M. Imada,Phys. Rev. B 96, 205152 (2017).

[51] S. R. Clark, J. Phys. A Math. Theor. 51, 135301 (2018).[52] I. Glasser, N. Pancotti, M. August, I. D. Rodriguez, and

J. I. Cirac, Phys. Rev. X 8, 011006 (2018).[53] E. Gull, P. Werner, A. Millis, and M. Troyer, Phys. Rev.

B 76, 235123 (2007).[54] E. Gull, A. J. Millis, A. I. Lichtenstein, A. N. Rubtsov,

M. Troyer, and P. Werner, Rev. Mod. Phys. 83, 349(2011).

[55] L.-F. Arsenault, A. Lopez-Bezanilla, O. A. von Lilienfeld,and A. J. Millis, Phys. Rev. B 90, 155136 (2014).

[56] L.-F. Arsenault, O. A. von Lilienfeld, and A. J. Millis,arXiv:1506.08858.

[57] A. Georges, G. Kotliar, W. Krauth, and M. J. Rozen-berg, Rev. Mod. Phys. 68, 13 (1996).

[58] R. Kondor, H. T. Son, H. Pan, B. Anderson, andS. Trivedi, arXiv:1801.02144.



http://dx.doi.org/10.1103/PhysRev.124.41





http://dx.doi.org/10.1126/science.aag2302



http://dx.doi.org/10.1103/PhysRevX.7.021021

http://dx.doi.org/10.1103/PhysRevX.7.021021

http://dx.doi.org/10.1038/s41467-017-00705-2



http://dx.doi.org/ 10.1038/s41567-018-0048-5




http://dx.doi.org/10.1088/1751-8121/aaaaf2

http://dx.doi.org/ 10.1103/PhysRevX.8.011006



http://dx.doi.org/ 10.1103/RevModPhys.83.349









Supplemental Material for “Self-learning Monte Carlo with Deep Neural Networks”

Huitao Shen,1, ∗ Junwei Liu,1, 2, † and Liang Fu1

1Department of physics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA2Department of Physics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China

I. RELATION BETWEEN MSE AND THE ACCEPTANCE RATE

Define the weight difference between the original model and the effective model as ∆(S) ≡ lnW [S]−lnWeff [S], wherewe have suppressed the α index. Suppose the MSE is minimized as minα ES∼W [S]/Z∆(S)2 ≡ l2. The expectation ofthe logarithmic acceptance rate is estimated as [1]

ES1∼W [S1]/Z,S2∼W [S2]/Z [ln p(S1 → S2)]2 (1)

=ES1∼W [S1]/Z,S2∼W [S2]/Z [min(0,∆(S2)−∆(S1))]2 (2)

=ES1∼W [S1]/Z,S2∼W [S2]/Z1

2[∆(S2)−∆(S1)]

2 (3)

=ES∼W [S]/Z∆(S)2 (4)=l2. (5)

From Eq. (2) to Eq. (3), we note the fact that half of the configurations have ∆(S2)−∆(S1) < 0. From Eq. (3) to Eq. (4)we have assumed ES∼W [S]/Z∆(S) = 0, meaning our effective model is unbiased. Hence we obtain ⟨(ln p)2⟩ = l2 = MSE.

If the effective model is biased with ES∼W [S]/Z∆(S) = µ, the formula becomes ⟨(ln p)2⟩ = MSE − µ2. This is amanifestation of the bias–variance tradeoff in statistics.

Compared with the coefficient of determination R2 defined in Ref. [2], this quantity is directly related to theacceptance rate, which further determines the efficiency of SLMC. It is more meaningful to use this quantity tocharacterize how well the effective model resembles the original model.

II. FULLY-CONNECTED NEURAL NETWORKS

Mathematical Description A fully-connected neural network is composed of several fully-connected layers. Theeffect of i-th fully-connected layer can be summarized as ai+1 = fi(Wiai+bi), where ai is the input/output vector ofthe i-th/(i− 1)-th layer. Wi, bi, fi are the weight matrix, bias vector, and the nonlinear activation function of suchlayer. The layer is said to have Ni neurons when Wi is of size Ni ×Ni−1.

Training Data The training set is small compared with that of the whole configuration space. In order for the neuralnetwork to perform well, the limited training data should to be as typical as possible. Therefore they are generatedby importance sampling from the Monte Carlo simulation of the original model. Also, these data are sufficientlyindependent from each other.

The training data s, lnW [s] are all fed directly into the neural network without normalization. The input auxiliaryspins are ±1, which are already pretty well-conditioned. We do not normalize lnW [s] (typically of order 10) becausethe third layer is a linear layer which should automatically scale the data.

The size of the training set is 30000, which is always larger than the trainable parameters to prevent over-fitting.The size of the test set is 10000.

Training Hyperparameters The neural networks are trained with TensorFlow [3] using back-propagation algorithm.All the network training in this paper were done in a laptop, and every training task was completed within 10 minutes.

We use mini-batch training with batch size 128. All the training is completed using the Adam optimizer [4] witha learning rate 0.001. Xavier initialization [5] for the weights, and constant initialization to 0.01 for the biases areadopted. Since our networks are relatively small, the loss would converge to a satisfactory level in several hundredsof epochs for a training starting from scratch. Typical loss during the training can be found in Fig. 1. In order to

∗ [email protected]† [email protected]

2

FIG. 1. Typical MSE loss for a fully-connected network of µ = −1 in Fig. 1 in the main text (blue, circle) and for a convolutionalnetwork of µ = −1 in Fig. 4 in the main text (green, diamond) during a training instance.

TABLE I. Performances of the fully-connected neural networks of different sizes. Here β = 20, U = 3.0, µ = −1, V = 1.0 andL = 120.

(N1, N2) Trainables∗ MSE(×10−1) Acceptance Rate(140,70) 26881 1.31 73.06(120,60) 21841 1.22 72.92(100,50) 17201 1.33 73.34(80,40) 12961 1.54 68.94(60,30) 9121 1.87 66.40(40,20) 5681 2.55 61.90

[∗]“Trainables” is the number of trainable parameters in the neural network. Same abbreviation for Table. II.

obtain good visualization, proper choice of regularization is necessary. We always use both L1 and L2 regularizationwith strength 0.005 to encourage sparsity.

Iterative Training When computing Fig. 1 in the main text, we used the technique of iterative training. Instead oftraining a new effective model from scratch for every chemical potential µ, we only do so once (for example, for theµ = 0 model). For other chemical potentials, we train effective models using the already trained model as the startingpoint. The losses would then converge typically in less than 5 epoches. This technique is particularly useful when alot of similar models need to be simulated systematically.

Performance Benchmark In benchmarking the performance of the neural network, the MSE is evaluated on a testset of independent configurations. The acceptance rate is averaged over 5000 consecutive proposals of global moves.Each global move involves 10L local updates on the effective model so that the proposed configuration is uncorrelatedwith the original one.

Effect of Network Size The performance of the neural networks with respect to the network size is summarized inTable. I. For simplicity, we always fix N2 = N1/2. As long as L ∼ N1, the MSEs could be minimized to a reasonablevalue. More complex neural networks could not enhance the performance significantly, possibly due to the increasingredundancy in large networks.

Effect of Activation Function We have tested the three activation functions: rectified linear unit (ReLU), sigmoid,tanh.

ReLU: f(x) = maxx, 0, (6)

sigmoid: f(x) =1

1 + e−x. (7)

tanh: f(x) = tanhx =1− e−2x

1 + e−2x. (8)

The performance of the effective models are not very sensitive to the choice of activation functions. Nevertheless, weempirically found ReLU neural networks are much easier to train especially in the empty orbital regime, i.e., small µ.

3

TABLE II. Structures of the convolutional networks involved in this paper. For convolutional layers, (n, d, t) is the(kernel number, kernel size, stride); For fully-connected layers, N is the number of neurons.

β 1st 2nd 3rd 4th Trainables10 (6,11,5) (2,4,2) 6 - 19520 (2/6,11,5) (2,4,2) 7 - 211/29130 (6,11,5) (2,4,2) 7 - 37540 (6,11,5) (2,6,2) (2,4,2) 7 31950 (6,11,5) (2,6,2) (2,6,2) 7 355

(a) (b)

FIG. 2. The effect of (a) the kernel number and (b) the kernel size of the first convolutional layer in the network on theperformance of the effective model. The second layer is fixed to have 2 kernels of size 4 and stride 2. Here β = 20, U = 3.0, µ =−1, V = 1.0 and L = 120. For (a), the size of the kernel is fixed to be 11 and the stride is 5. For (b), the size of the finalfully-connected layer is adjusted correspondingly.

III. CONVOLUTIONAL NEURAL NETWORKS

Mathematical Description The convolutional neural network used in this paper is composed of several convolutionallayers and several fully-connected layers.

The input and the output of each convolutional layer, instead of being vectors, are now generally matrices. Theinput vector of auxiliary spin can be seen as a L × 1 matrix. The large weight matrix is replaced by several smallmatrices called “kernels”. The effect of the convolutional layer is the sliding inner product (also called cross-correlation)between the kernels h and the input matrix a, which is defined as

[a ∗ h]i,j ≡ [a ∗ hj ]i ≡d∑

k=1

ay∑

l=1

at(i−1)+k,lhjk,l, (9)

where hj is the j-th kernel. The size of the kernel is d, and the stride of the sliding inner product is t. l is the numberof kernels in the previous convolutional layer. We take the periodic boundary condition (also called padding) for thesliding inner product (not shown in Fig. 3 in the main text for simplicity). Suppose there are n kernels, and thedimension of the input matrix is ax × ay, then the dimensions of the kernels should be d× ay, and the dimension ofthe output matrix is ax/t× n. The effect of i-th such layer can be denoted compactly as ai+1 = fi(ai ∗ hi + bi). bi isthe bias associated with kernel hi.

After propagating the input through several convolutional layers, the matrix is flattened into a vector so that it canbe used as the input of subsequent fully-connected layers. In the literature, this operation is often formally performedby a “flatten layer” in between the convolutional layers and the fully-connected layers.

Structures of the Networks See Table. II. The activation functions for all convolutional networks are ReLU.Training Hyperparameters The size of the data set, the initialization, the regularization and the learning rate are all

the same as those of the fully-connected network, except that there is no L1 regularization for convolutional networks.Typical loss during the training can be found in Fig. 1. Iterative training is also used in computing Fig. 4(a) in themain text.

We note that for convolutional layers, there exists possibility that the network was stuck at some local minimaof the loss function, possibly because the landscape of the loss function is very drastic due to the small number of

4

FIG. 3. Measured impurity electron Green’s function G(τ) for different chemical potentials µ. The measurement is averagedover 5000 independent configurations of auxiliary spins. Here β = 20, U = 3.0, V = 1.0 and L = 2βU .

trainable parameters. One should always try to train the network for two or three times and pick the best one on thevalidation set.

Effect of Kernel Number The performance of the convolutional networks with respect to the kernel size is shownin Fig. 2(a). Here only the kernel number of the first layer is varied. As can be seen, the performance decreasessignificantly when the number is smaller than two, which is in accordance with Fig. 2 in the main text that thereare mainly two features in the auxiliary spin configuration. The performance would be slightly improved if the kernelnumber is further increased beyond two.

Effect of Kernel Size The performance of the convolutional networks with respect to the kernel size is shown inFig. 2(b). Here only the kernel size of the first layer is varied. Large kernel size leads to better acceptance rates,probably because of the fewer trainable parameters in these networks.

IV. MEASURED GREEN’S FUNCTION

The Green’s function of impurity electrons is defined as

Gσσ′(τ − τ ′) = ⟨Tτ dσ(τ)d†σ′(τ

′)⟩β , (10)

where ⟨· · ·⟩β ≡ Tr[e−βH · · ·]/Tr[e−βH ] is the thermal ensemble average, Tτ is the time-ordering operator, and dσ is theimpurity electron annihilation operator (see Eq. (4)-(6) in the main text). In the definition, we have already exploitedthe translational symmetry in the imaginary time. Since spin is conserved in the asymmetric Anderson model, wehave G↑↓(τ) = G↓↑(τ) = 0 and G(τ) ≡ G↑↑(τ) = G↓↓(τ). In Fig. 3 we show measured G(τ) using SLMC.

[1] We would like to thank an anonymous referee for pointing out a mistake in our previous derivation.[2] J. Liu, H. Shen, Y. Qi, Z. Y. Meng, and L. Fu, Phys. Rev. B 95, 241104 (2017).[3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,

I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” (2015), software available from tensorflow.org.

[4] D. P. Kingma and J. Ba, in International Conference on Learning Representations (ICLR2015), arXiv:1412.6980.[5] X. Glorot and Y. Bengio, in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics,

Proceedings of Machine Learning Research, Vol. 9, edited by Y. W. Teh and M. Titterington (PMLR, Chia Laguna Resort,Sardinia, Italy, 2010) pp. 249–256.

self-learning monte carlo with deep neural networks · 2018. 5. 30. · self-learning monte carlo...

Documents