high-performance computing for exact numerical...

12
High-Performance Computing for Exact Numerical Approaches to Quantum Many-Body Problems on the Earth Simulator Susumu Yamada Japan Atomic Energy Agency 6-9-3 Higashi-Ueno, Taito-ku Tokyo, 110-0015, Japan [email protected] Toshiyuki Imamura The University of Electro-Communications 1-5-1 Chofugaoka, Chofu-shi Tokyo, 182-8585, Japan [email protected] Masahiko Machida ∗† Japan Atomic Energy Agency 6-9-3 Higashi-Ueno, Taito-ku Tokyo, 110-0015, Japan [email protected] Takuma Kano Japan Atomic Energy Agency 6-9-3 Higashi-Ueno, Taito-ku Tokyo, 110-0015, Japan [email protected] ABSTRACT In order to study intriguing features of quantum many- body problems, we develop two matrix diagonalization codes, one of which solves only the ground state including a few excitation states, and another of which does all quantum states. The target model in both codes is the Hubbard model with confinement potential which describes an atomic Fermi gas loaded on an optical lattice and partly High-Tc cuprate superconductors. For the former code, we make a parallel tuning to attain the best performance and to expand the matrix size limitation on the Earth Simulator. Conse- quently, we obtain 18.692TFlops (57 % of the peak) as the best performance when calculating the ground state of 100- billion-dimensional matrix. From these large-scale calcula- tions, we find that the confinement effect leads to atomic- scale inhomogeneous superfluidity which is a new challeng- ing subject for physicists. For the latter code, we develop or install the best three routines on three calculation stages and succeed in solving the matrix whose dimension is 375,000 with 18.396TFlops (locally 24.613TFlops and 75% of the peak). The numerical calculations reveal a novel quantum feature, i.e., a change from Schr¨odinger’s cat to classical one can be controlled by tuning the interaction. This is a marked contrast to the general concept that the change occurs with increasing the system size. Corresponding author CREST(JST) 4-1-8 Honcho, Kawaguchi-shi, Saitama 330- 0012, Japan, and CTC(JSPS) project Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC2006 November 2006, Tampa, Florida, USA. 0-7695-2700-0/06 $20.00 c 2006 IEEE. 1. INTRODUCTION Since the establishment of the quantum mechanics in the last century, the mysterious features have attracted not only scientists but also ordinary people. Nowadays, foundations of almost fields of physics and chemistry are based on the quantum mechanics, and even the computer science has a quite deep connection with the quantum mechanics, i.e., quantum computing. In the quantum mechanics, the most interesting topic is behaviors of quantum many-body systems. If one wants to solve the quantum many-body problems analytically, one soon notices that the best way is to give up the attempt. The problem is so terribly difficult. However, this situation drastically changes if one uses computers. Especially, the modern parallel supercomputers enables us to access exact quantum behaviors of many-body systems composed of more than 10 quantum particles interacting each other as shown in Section 3.4. In this paper, we challenge the topic by using the Earth Simulator and show the latest our advancement, e.g., how large quantum systems and how fast we can solve in such exact numerical attempts. Very recently, a novel quantum state, i.e., a new type of superfluid state has been discovered as the ground state in the trapped atomic Fermi gases. In the systems, the inter- action is surprisingly tunable by using the so-called Fesh- bach resonance, and a variety from weakly to strongly in- teracting many-body systems is observable. Furthermore, by irradiating two counter laser-beams into the gas, one can create a many-body fermion system under a lattice like po- tential, which is equivalent to an electronic system inside a solid state. Thus, the atomic Fermi gas is regarded as a flexible stage in which various concepts of quantum many- body systems can be tested. In this paper, we concentrate on the lattice fermion system whose model is called the fermion-Hubbard model with flexible confinement potentials and perform two types of exact numerical calculations on the model in order to study the ground state and the quantum dynamics for such flexible systems.

Upload: others

Post on 06-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

 High-Performance Computing for Exact NumericalApproaches to Quantum Many-Body Problems on the

Earth Simulator

Susumu YamadaJapan Atomic Energy Agency6-9-3 Higashi-Ueno, Taito-ku

Tokyo, 110-0015, Japan

[email protected]

Toshiyuki ImamuraThe University of

Electro-Communications1-5-1 Chofugaoka, Chofu-shi

Tokyo, 182-8585, Japan

[email protected]

Masahiko Machida∗ †

Japan Atomic Energy Agency6-9-3 Higashi-Ueno, Taito-ku

Tokyo, 110-0015, Japan

[email protected]

Takuma KanoJapan Atomic Energy Agency6-9-3 Higashi-Ueno, Taito-ku

Tokyo, 110-0015, Japan

[email protected]

ABSTRACTIn order to study intriguing features of quantum many-

body problems, we develop two matrix diagonalization codes,one of which solves only the ground state including a fewexcitation states, and another of which does all quantumstates. The target model in both codes is the Hubbardmodel with confinement potential which describes an atomicFermi gas loaded on an optical lattice and partly High-Tc

cuprate superconductors. For the former code, we make aparallel tuning to attain the best performance and to expandthe matrix size limitation on the Earth Simulator. Conse-quently, we obtain 18.692TFlops (57 % of the peak) as thebest performance when calculating the ground state of 100-billion-dimensional matrix. From these large-scale calcula-tions, we find that the confinement effect leads to atomic-scale inhomogeneous superfluidity which is a new challeng-ing subject for physicists. For the latter code, we develop orinstall the best three routines on three calculation stages andsucceed in solving the matrix whose dimension is 375,000with 18.396TFlops (locally 24.613TFlops and 75% of thepeak). The numerical calculations reveal a novel quantumfeature, i.e., a change from Schrodinger’s cat to classical onecan be controlled by tuning the interaction. This is a markedcontrast to the general concept that the change occurs withincreasing the system size.

∗Corresponding author†CREST(JST) 4-1-8 Honcho, Kawaguchi-shi, Saitama 330-0012, Japan, and CTC(JSPS) project

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SC2006 November 2006, Tampa, Florida, USA.0-7695-2700-0/06 $20.00 c©2006 IEEE.

1. INTRODUCTIONSince the establishment of the quantum mechanics in the

last century, the mysterious features have attracted not onlyscientists but also ordinary people. Nowadays, foundationsof almost fields of physics and chemistry are based on thequantum mechanics, and even the computer science has aquite deep connection with the quantum mechanics, i.e.,quantum computing.

In the quantum mechanics, the most interesting topic isbehaviors of quantum many-body systems. If one wants tosolve the quantum many-body problems analytically, onesoon notices that the best way is to give up the attempt.The problem is so terribly difficult. However, this situationdrastically changes if one uses computers. Especially, themodern parallel supercomputers enables us to access exactquantum behaviors of many-body systems composed of morethan 10 quantum particles interacting each other as shownin Section 3.4. In this paper, we challenge the topic by usingthe Earth Simulator and show the latest our advancement,e.g., how large quantum systems and how fast we can solvein such exact numerical attempts.

Very recently, a novel quantum state, i.e., a new type ofsuperfluid state has been discovered as the ground state inthe trapped atomic Fermi gases. In the systems, the inter-action is surprisingly tunable by using the so-called Fesh-bach resonance, and a variety from weakly to strongly in-teracting many-body systems is observable. Furthermore,by irradiating two counter laser-beams into the gas, one cancreate a many-body fermion system under a lattice like po-tential, which is equivalent to an electronic system insidea solid state. Thus, the atomic Fermi gas is regarded as aflexible stage in which various concepts of quantum many-body systems can be tested. In this paper, we concentrateon the lattice fermion system whose model is called thefermion-Hubbard model with flexible confinement potentialsand perform two types of exact numerical calculations on themodel in order to study the ground state and the quantumdynamics for such flexible systems.

The fermion-Hubbard model is one of the most intensively-studied models by computers because it owns very rich physicsalthough the model expression is quite simple [1]. The Hamil-tonian of the Hubbard model with a harmonic-well confine-ment potential [2, 3], which describes the atomic Fermi gasunder the optical lattice potential, is given as

H = −t∑i,j,σ

(a†jσaiσ + H.C.)

+U∑

i

ni↑ni↓ +

(2

N

)2

V∑i,σ

niσ

(i− N

2

)2

, (1)

where t, U , and V are the hopping parameter from i-th toj-th sites (normally j is the nearest neighbor site of i), therepulsive energy for the on-site double occupation of twofermions, and the parameter characterizing the strength ofthe confinement (harmonic well) potential, respectively, as

schematically shown in Fig.1, and ai,σ, a†i,σ and ni,σ are

the annihilation, the creation, and the number operator ofa fermion with pseudo-spin σ (=↑ (up) or ↓ (down)) on thei-th site, respectively. Here, we note that the confinementpotential in Eq.(1) may be temporarily variable in atomicFermi gases. In this paper, we investigate quantum statesand their dynamics in the presence of the harmonic well typeas given in Eq.(1), the double well, and the dynamicallyvariable potentials.

The computational finite-size approaches on the Hubbardmodel are roughly classified into three types. The first oneis the exact diagonalization for the Hamiltonian matrix us-ing the Lanczos method [4], the second one is the quan-tum Monte Carlo [1], and the third one is the density ma-trix renormalization group (DMRG). The first one exactlycalculates the ground and the low-lying excited states ofthe model, while the second and the third ones approachlarger systems by discarding almost irrelevant high-energyquantum-states to save the memory space. However, wenote that the second and the third approaches call specialattentions since reliable results are not always guaranteedin contrast to the first one.

Here, let us focus on the first approach. If one studies areasonably large many-body system using the first one, thenone finds that high-performance computing (HPC) tech-niques [5], i.e., parallelization techniques are simply crucial.This parallelization in matrix diagonalization has been oneof the most main issues in HPC fields since the advance-ment is useful for other several types of applications. In thispaper, we therefore report HPC issues in the two type ofmatrix diagonalizations, one of which is the exact diagonal-ization, and another of which is the full matrix diagonaliza-tion. Although the former one calculates only the groundstate including a few excited states, it has a history thatit has been called “exact diagonalization”, since other ap-proaches are not exact. The latter one has been frequentlyused for studying the exact quantum many-body dynamics,molecular orbitals and so on. In this paper, we study an ex-act dynamics on the Hubbard model under trap potentialswhose shapes rapidly change.

Our main supercomputer employed in this research projectis the Earth Simulator, which is still one of the most at-tractive supercomputers in HPC fields, since the CPU hasvector pipelines, which is now a minority but has a highpotentiality as the processing speed of the floating point op-eration and the memory bandwidth are well-balanced. The

theoretical peak performance is 40.96 TFlops, and the totalmemory size is 10 TB. Although these resources are now sit-uated below top 5, it is well-known that the Earth Simulatorhas shown excellent total throughput for scientific and tech-nological applications [6]. In fact, there have been severalreports about applications which achieved excellent perfor-mance and won the Gordon Bell Prize at the Supercomput-ing conferences [7, 8, 9, 10, 11]. For the comparison with atypical scalar machine, we use Altix 3700/Bx2 in JAEA forthe full diagonalization as well as the Earth Simulator andbriefly report on the performance (see Appendix).

Let us show the contents of this paper. In Section 2,we report recent developments on the exact diagonaliza-tion. Previously, we suggested a new profitable algorithm,i.e., preconditioned conjugate gradient (PCG), and showthat PCG is generally about three times faster in the to-tal CPU time and about 1.5 times more excellent in theperformance than the conventional Lanczos. Then, the bestperformance of PCG was 16.447 TFlops for 1-dimensionalHubbard model [5]. In this paper, we apply the PCG al-gorithm to 2-dimensional Hubbard model and design a newparallelization scheme from viewpoints of the performance,the throughput, and the memory saving. We test the schemeon the Earth Simulator and discuss its effectiveness. Thebest performance in this paper is 18.692 Tflops which is57% of the peak performance. After these HPC issues, webriefly show typical numerical calculation results and men-tion the physical meanings and significance of the numericalresults. In Section 3, we switch the topic into HPC issueson the full diagonalization solver. Since the solver can bedivided into three calculation stages, we compare the widelyused routines and the self-developed ones on each stage, andconstruct the best combination on the Earth Simulator. Wereport scalabilities of the solver on the CPU number and thematrix size and clarify a problem in the current tuning level.Furthermore, we show that we succeed in solving all eigen-states of 375,000-dimensional matrix with 18.396TFlops (lo-cally obtain 24.613TFlops which is 75% of the peak for theback-Householder transformation routine: the best peak-ratio data on the Earth Simulator). After these HPC issues,we briefly introduce a problem of quantum dynamics andexplain a new insight of the quantum dynamics obtained inthe large-matrix full-diagonalization. Section 4 is our con-clusion.

2. EXACT DIAGONALIZATION

2.1 Two methodsSo far, a tremendous number of physicists have numeri-

cally studied the ground and the low-lying excited states ofthe strongly correlated electronic and the lattice quantumspin systems by the so-called ”Exact Diagonalization” [4].The typical target models are the Hubbard model, its ex-tended versions, the Heisenberg model, t-J model and so on.Since these model Hamiltonians are directly represented bylarge sparse matrices, the Lanczos method has been tra-ditionally employed. However, the Lanczos method is aproblematic solver and improvement or alternative methodis required. Previously on SC2005, we suggested the PCGmethod as an alternative and reported that the PCG for 1-dimensional Hubbard model shows about three times fasterthan the Lanczos on the average of the total CPU time[5]. In this paper, we show that the PCG for 2-dimensional

U

tConfinement Potential

Figure 1: A schematic figure of the one-dimensional fermion-Hubbard model with a confinement potential.Here, t and U are the hopping parameter and the repulsive energy in the double occupation on a site,respectively. The up-arrow and the down-arrow stand for fermion with up and down pseudo-spin, respectively.

model is more than 10 times faster in the total CPU time. Inthis section, after briefly explaining algorithms of the Lanc-zos and the PCG methods, we discuss HPC issues and com-pare the performance on the Earth Simulator.

2.1.1 Lanczos methodThe Hubbard Hamiltonian with a confinement potential,

i.e., Eq.(1) is represented by a large sparse and symmetricmatrix. In Eq.(1), the first term gives the non-zero non-diagonal components, while the second and the third termscontribute to only the diagonal elements. If the first termdescribes only the next neighbor hopping, then the matrix isclearly found to be quite sparse. Thus, the Lanczos methodhas been traditionally employed to save the memory spaceand calculate the matrix whose size is as large as possible[4]. As the algorithm is summarized in Fig.2(a), it is quitesimple. However, due to its simplicity, the following weakpoints in terms of the performance have been pointed out.1) It requires prefixing iteration counts before actual calcu-lations to guarantee sufficient precision. This indicates thatthe calculation always includes wasteful iterations. 2) Asthe calculation is too large to retain all Lanczos basis-setsgenerated in the iteration, it requires another extra iterationwhen obtaining the eigenvector. Moreover, one needs to payan attention to the orthogonality when calculating multipleeigenvalues.

2.1.2 Preconditioned Conjugate Gradient MethodSince the CPU time generally grows as the matrix size

increases, the performance improvement is a crucial factorfor systematic studies repeating huge matrix diagonaliza-tion. Thus, we have suggested an alternative algorithmbased on the conjugate gradient (CG) theory. Among var-ious CG methods, we employed an algorithm proposed byKnyazev[12, 13], which searches for the minimum eigenvec-tor with use of the direction vector calculated by Ritz vec-tor as shown in Fig.2(b). Since the PCG method basicallyneeds to keep only three vectors, i.e., the eigenvector x, thesearch direction p, and the residual w, its basic memory re-quirement is equivalent with the Lanczos method. However,due to too high calculation cost in projected operation tothe subspaces, one prepares three additional vectors X, P ,and W , which are Hx, Hp, and Hw, to reduce the com-putational cost. An excellent point in the PCG method iscontrollability of the iteration, i.e., one can flexibly stop theiteration according to residual w or eigenvector x. More-over, the convergence property is strongly dependent on apreconditioned matrix as well as the PCG method for thelinear equation problem. Knyazev pointed out that a goodchoice of a preconditioned matrix may drastically reducethe iteration counts[12, 13]. From these reasons, the PCG

x0 := an initial guess.β0 := 1, v−1 := 0, v0 := x0/‖x0‖do i=0,1,... m− 1, or until βi < ε

ui := Hvi − βivi−1

αi := (ui, vk)wi+1 := ui − αivi

βi+1 := ‖wi‖vi+1 := wi/βi+1

enddo

(a) Lanczos method

x0 := an initial guess, p0 := 0x0 := x0/‖x0‖, X0 := Hx0, P0 = 0, μ−1 := (x0, X0)w0 := X0 − μ−1x0

do k=0, ... until convergenceWk := Hwk

SA := {wk, xk, pk}T {Wk, Xk, Pk}SB := {wk, xk, pk}T {wk, xk, pk}Solve the smallest eigenvalue μ

and the corresponding vector v,SAv = μSBv, v = (α, β, γ)T .

μk := (μ + (xk, Xk))/2xk+1 := αwk + βxk + γpk, xk+1 := xk+1/‖xk+1‖pk+1 := αwk + γpk, pk+1 := pk+1/‖pk+1‖Xk+1 := αWk + βXk + γPk, Xk+1 := Xk+1/‖xk+1‖Pk+1 := αWk + γPk, Pk+1 := Pk+1/‖pk+1‖wk+1 := T (Xk+1 − μkxk+1), wk+1 := wk+1/‖wk+1‖

enddo

(b) Preconditioned conjugate gradient method

Figure 2: Algorithms of two eigenvalue solvers.

method is quite attractive and promising.

2.2 Core Calculation and its ParallelizationIn both the Lanczos and PCG methods (see Fig. 2(a)

and (b)), a core calculation is Hv, where H is the Hamilto-nian matrix and v is an eigenvector. The both methods arecomposed of several times repetition of the core calculation.Therefore, we show the numerical algorithm and the paral-lelization scheme on Hv. By using a matrix representation,the Hubbard Hamiltonian H (1) is mathematically given as

H = I ⊗A + A⊗ I + D, (2)

where I , A, and D are the identity matrix, the sparse ma-trix due to the hopping between neighboring sites for up(down) spin, and the diagonal matrix originated from thepresence of the on-site repulsion, respectively. Here, we notethat when the dimension of A is n, that of the Hamilto-

nian H is n2. By dividing and rearranging the eigenvectorv = (v1, v2, · · · , vn2), we have the following dense matrix

V = ((v1, · · · , vn)T , · · · , (v(n−1)n+1, · · · , vn2)T ).

With the matrix V , we can represent the core calculationHv as

Hv = AV + V AT + D � V,

where k-th diagonal element of the matrix D (dk) is alsomapped onto the matrix D according to the rule in v → V ,and the operator � means an elementwise multiplication.In order to uniformly distribute the calculation load of Hvinto each processor, we execute the following sequence of thecalculations and the communications

CAL1: Ec = Dc � V c,

CAL2: W c1 = Ec + AV c,

COM1: communication to transpose V c into V r,

CAL3: W r2 = V rAT ,

COM2: communication to transpose W r2 into W c

2 ,

CAL4: W c = W c1 + W c

2 ,

where CAL and COM represent the calculation and thecommunication, respectively, and the superscripts c and rof the matrices denote columnwise and rowwise data parti-tioning, respectively. In this sequence, COM1 and COM2are all-to-all communication. Since the ES developer rec-ommends the use of MPI Put in all-to-all communications,we use MPI Put for COM1 and COM2. Here, let us com-pare calculation and communication costs for all CAL’s andCOM’s. Figure 3 shows the elapsed time for each stage. Theexample problem is 2-dimensional and 24 site (6 up-spins,6 down-spins) Hubbard model and its matrix dimension isabout 18 billions. From Fig.3, it is found that a sum ofcosts for COM1 and COM2 is almost a half of the to-tal cost. This indicates that the communication overlap iscrucial to improve the performance. Let us show the com-munication overlap scheme below. By considering the re-lationship between the stages, we find that the calculationsCAL1 and CAL2 and the communication COM1 is clearlyindependent and their overlap is straightforward. Moreover,although the relationship between CAL3 and COM2 is notso simple, the overlap can be principally realized in a pipelin-ing fashion as shown in Fig.4. Thus, the two communicationprocesses (COM1 and COM2) are hidden behind the cal-culations.

2.3 A Parallelization Technique for PCGHere, let us show a parallelization technique to save the

memory space. This is an important technique for the PCGmethod, which costs more memory than the Lanczos one.When using the algorithm considering the communicationoverlap as shown in Fig.4, COM1 and COM2 generallyrequire three N-word buffers, where N is the dimensionof the Hamiltonian matrix H . Then, we find that a fence(MPI Win fence) enables to reuse the buffer and to save thememory space of the buffer as schematically shown in Fig.5.However, since the cost of the fence is high, its repetitioncauses degradation of the performance. Table 1 shows howthe elapsed time grows with increasing the counts of the

0

0.02

0.04

0.06

0.08

0.1

0.12

CAL1&

CAL2

COM1 CAL3 COM2 CAL4

Ela

psed t

ime (

sec)

Figure 3: The cost distribution of the core calcula-tion (Hv).

fence 1. From the data, we find that 4 is the most effectivenumber for the counts of the fence in this algorithm. Wecall this method “PCG with fence” in the following.

2.4 Performance on the Earth SimulatorLet us show the performance for a huge-scale matrix diag-

onalization by using 512 nodes (4096 PE’s and 8 TB mem-ory). The problem is 2-dimensional 22 sites (see Fig.6) and16 fermions (8 up-spins, 8 down-spins) Hubbard model witha harmonic confinement potential. Its matrix dimension isabout 100-billion. For the comparison, we show the perfor-mance of not only the PCG with fence but also the Lanczosmethod. Here, we note the PCG without fence can notcalculate the problem due to memory shortage. The PCGwith fence reduces one N-word communication buffer. Ta-ble 2, i.e., the performance comparison, shows that in thetotal throughput, the PCG with fence and the Lanczos, re-spectively, take 18.7 and 266.2 sec, for the calculation ofthe ground state. The PCG with fence is about 14 timesfaster than the conventional Lanczos method. The perfor-mance of the PCG with fence and the Lanczos method are18.692Tflops and 14.397Tflops, respectively, and the PCGwith fence is more about 1.3 times excellent than the Lanc-zos method. However, we note that the memory size of theLanczos is about 0.7 time smaller than the PCG with fence,and the Lanczos has still an advantage for larger size cal-culations. Thus, the PCG with fence is mainly used in thisproject, while the Lanczos method is employed only whenthe problem size is beyond the limit of the PCG with fence.

2.5 Simulation Result I: Inhomogeneous Su-perfluidity

Recently, since the atomic-scale inhomogeneity has beenobserved in High-Tc cuprate superconductors [15], whetherthe microscopic inhomogeneity is intrinsic or not, has beenintensively debated. This is because such a inhomogene-ity is a challenge to the conventional superconductivity the-ory. Generally, when the system enters the superconductingphase, the spatial variation of superconductivity is given bythe coherence length (the Cooper pair radius), which is usu-

1A latency of MPI Win fence on 512 nodes is 223.75 microsec [14].

Node 0

Node 1

Node 2

Node 0 Node 1 Node 2

Node 0 Node 1 Node 2

→TVA

Calculation

→TVA

→TVA

Communication (MPI_put)

Step 1

Step 2

Step 3

MPI_Win_fence

Node 0 Node 1 Node 2

c

W2

ccc

WWW →+21

Calculation

Calculation Communication (MPI_put)

Figure 4: A data-transfer diagram to overlap calcula-

tion (CAL3) with one-side communication (COM2) in

a case using three nodes.

CAL3

COM2

MPI_Put

CAL4

CAL3

CAL3

CAL3

COM2

MPI_Put

COM2

MPI_Put

CAL4

MPI_Win_fence

MPI_Win_fence

Calculation Communication

CAL4

MPI_Win_fence

Figure 5: A data-transfer diagram to overlap calcula-

tions (CAL3 and CAL4) and communication (COM2).

Table 1: Relationship between fence counts andperformance for the multiplication of a 18-billion-dimensional matrix and a vector on 128 nodes ofthe Earth Simulator.

Fence countsElapsed No. of

time (sec) buffers1 0.24480 32 0.24930 34 0.24780 28 0.25815 1.516 0.27818 1.2532 0.29784 1.12564 0.34454 1.0625128 0.39398 1.03125

Table 2: Performance of the Lanczos method andthe PCG method with fence on the Earth Simulator.

methodNo, of Elapsed

TFlops ErrorMemory

iter. time (sec) (TB)Lanczos 250 266.157 14.397 2.1×10−9 5.4PCG 29 18.681 18.693 8.7×10−9 6.9

Figure 6: 2-dimensional 22-site model.

ally much beyond atomic-scale sizes. Thus, the atomic-scaledisorders are normally smeared out, and, the atomic-scaleinhomogeneity widely observed in the High-Tc cuprates stillremains unsolved. In this paper, we, therefore, numericallyexplore how the atomic-scale inhomogeneity is created inattractively-interacting fermions. Our employed model is 2-dimensional Hubbard model with a harmonic confinementpotential and the on-site interaction U is negative. Here, wenote that the negatively sufficient large U is well-known tolead to the crossover to strong-coupling superconductivitystate from weak-coupling Bardeen-Cooper-Schrieffer super-conducting one. Thus, the problem is reduced to whetheror not the presence of a smooth harmonic confinement po-tential is responsible for lattice-distance level inhomogene-ity, i.e., atomic-scale inhomogeneity. Here, let us explain thereason why one chooses the harmonic confinement potential.There are two reasons. The first one is that the harmonic-well potential qualitatively captures the confinement effectof low energy particles inside the vortex core [16], which has

studied intensively in High-Tc cuprate and other supercon-ductors. We stress that the atomic-scale inhomogeneity isexperimentally remarkable especially inside the vortex core[17]. The second reason is that the system fully describes anatomic Fermi gas loaded on an optical lattice [2, 3] althoughit has been realized very recently. This fact implies thatthe present simulation results in this model can be directlyconfirmed by actual experimental realities.

Now, let us show typical calculation data in Fig.7 andFig.8. These figures display the particle (electron) densitydistribution on 2-dimensional sites (25 sites). The num-ber of fermions in Fig.7 and Fig.8 correspond to (4 ↑, 4 ↓)and (6 ↑, 6 ↓), respectively, and U/t is -10 for both cases.These results clearly show that the atomic-scale inhomo-geneity emerges because the particle density does not followthe shape of the harmonic-well potential. The particle den-sity profile shows checkerboard patterns as seen in Fig. 7and 8. In addition, a scale characterizing the inhomogeneousstructure is found to be just the lattice distance. We noticethat these results are qualitatively similar to the experimen-tal results for the electronic structures obtained by using thescanning tunneling spectroscopy inside High-Tc supercon-ducting vortex core [15, 17], which shows the checkerboardlike pattern for the bound-state level structures [18]. Wenote that these calculations are the first challenge and theHPC techniques are crucial for the numerical observationsince one can not reach such a checkerboard pattern if theenlargement of the system size is not made by the paral-lelization technique.

3. FULL DIAGONALIZATIONIn order to investigate exact dynamics of quantum many-

body systems, we develop a full diagonalization solver and asimulation code for their time evolution. The full diagonal-ization is an essential scheme for studies of quantum dynam-ics and other issues in quantum chemistry and physics, andthe effective solver development is frequently an importanttheme in their fields. It should be noted that the numericalalgorithm for the full diagonalization is completely differ-ent from the exact diagonalization scheme solving only theground and the low-lying excited states.

3.1 Three Steps and their AlgorithmsWhen solving all eigenstates, it is generally advantageous

for CPU and memory resources to use a direct method fora dense matrix, even if the target matrix is sparse. Figure 9schematically shows three steps to solve all eigenstates for adense matrix in ScaLAPACK[19], which is one of the mostfamous parallel library. The roles of these three steps are,respectively, written down as follows:

1. Tridiagonalization of a symmetric matrix by using theHouseholder transformation,

2. Calculation of all eigenvalues and eigenvectors for thetridiagonal matrix,

3. Calculation of all eigenvectors of the original matrixby using the back transformation.

In the following, we compare typical algorithms for thethree steps and select the best one or develop alternatives byourselves. For the first step, the Householder tri-diagonalizationof ScaLAPACK is widely-used, but there have been papers

which report that the performance is poor. As an alterna-tive, we therefore develop an original routine adopting theloop fusion method proposed by Naono et.al., [20] and op-tionally using the Bischof’s algorithm [21]. Yamamoto [22]reported that since Level 3 BLAS can be utilized when ex-ecuting the Bischof’s algorithm, it is expected to be twicefaster than the standard Dongarra’s algorithm. However,since the cost of the back transformation becomes large inthis case, the Bischof’s algorithm may be not suitable forcalculating many or all eigenvectors. Here, let us comparethe actual elapsed time ratio of the tridiagonalization to theback transformation in order to confirm whether the aboveidea is true or not. The ratios are approximately 4:3 and 3:2on the Earth Simulator and SGI Altix 3700Bx2, respectively(see Fig.14 and 18). We note that if the cost of the tridi-agonalization is twice larger than that of the back transfor-mation, then we can obtain a merit for adopting Bischof’salgorithm. However, the above ratio does not satisfy thecondition. Thus, we finally select Dongarra’s algorithm forthe first step. Then, WY block algorithm[23] is automati-cally selected for the third step. We develop programs forthe first and the third steps based on these algorithms byourselves. For the second step, there are several methodsas the bisection+inverse iteration method, the QR method,the divide and conquer method, the MRRR(Multiple Rela-tively Robust Representations) method [24]. At the present,the most stable and the fastest algorithm in ScaLAPACKis the divide and conquer method, while PLAPACK (Revi-sion 3.2) [25] developed in University of the Texas at Austinincludes a routine of the MRRR method. Since there is a re-port in which the Householder transformation of PLAPACKdoes not show better performance[26], we adopt the divideand conquer method (pdstedc) of ScaLAPACK. However,ScaLAPACK is not officially supported on the Earth Simu-lator. Thus, we install ScaLAPACK on the Earth Simulatorby ourselves.

3.2 ParallelizationHere, let us briefly mention the parallelization on the full

diagonalization solver. The original sequential algorithmsfor the Householder transformation and its backward oneare given in Fig.10 and Fig.11, respectively, and their paral-lel algorithms are shown in Fig.12 and Fig.13, respectively.As seen in Fig.12 and Fig.13, we adopt 2-dimensional parti-tion for the data parallelization like ScaLAPACK. We do notuse (Block-Cyclic, Block-Cyclic) distribution but (simple-Cyclic, simple-Cyclic) for the first step. On the other hand,we employ the Block-Cyclic for the divide and conquer rou-tine since the simple Cyclic is found to be not effective fromour test on the Earth Simulator. Based on the test data, wealso find that the optimum block width is 48.

3.3 Performance TestLet us present the performance test data of our routines

on the Earth simulator (see Appendix for the performancetest data on SGI Altix 3700Bx2 in JAEA). A test matrix is aFrank matrix. Firstly, we show a test data in 64 nodes (512CPU) on the Earth Simulator. In this case, our solver takes1,225 seconds (1.652TFLOPS, 40.3% of the peak) and 3,907seconds (2.299 TFLOPS, 56.1% of the peak) for solvingeigenvalue problems of a 60,000-dimensional and a 100,000-dimensional matrix, respectively. This result indicates thatthe solver reasonably works until 100,000-dimension. Next,

0.00

1.10

0.88

0.66

0.44

0.22

Figure 7: The particle density distribution for 2-dimensional 25-site 8 fermions (4 ↑, 4 ↓) model.U/t = −10, and V/t = 1.

0.00

1.10

0.88

0.66

0.44

0.22

Figure 8: The particle density distribution for 2-dimensional 25-site 12 fermions (6 ↑, 6 ↓) model.U/t = −10, and V/t = 1.

Dense matrix Tridiagonal Eigenpairs Eigenpairs

ScaLAPACK : pdsyevd

pdsytrd�×× pdstedc�○○ pdormtr�△△??Library

HouseholderO(N3)

Divide&ConqureO(N2)~O(N3)

Back TransformO(N3)method

Figure 9: Three steps for calculating all eigenstates of a symmetric dense matrix.

let us show the performance test data for a 280,000-dimensionalmatrix on 256nodes(2048 CPU), 384 nodes(3072 CPU), 480nodes (3840 CPU) and 512 nodes (4096 CPU) in Fig.14. Weconfirm that our code speeds up until 512 nodes (4096CPU).Moreover, we study the matrix size dependence of the elapsedtime on 512 nodes (4096 CPU) in Fig.15, in which the ma-trix size varies from 280,000 to 375,000. These results showthat all eigenvalues and vectors are solved within 10,000seconds for a 375,000-dimensional matrix. This data for375,000-dimensional matrix is the world record in the his-tory of the full diagonalization for real applications to ourknowledge, and also an important information for scientificfields which need the huge-matrix full diagonalization, i.e.,quantum physics and chemistry. Next, let us show the Flopsrate of our solver. The Flops rates of routines (the problemis 375,000-dimensional matrix in 512 nodes) are as follows,

• Red(Householder transformation): 13.2TFLOPS (40%of the peak),

• Backtrafo(back transformation): 24.6TFLOPS (75%of the peak).

These results indicate that our solver works with quite excel-

lent performance. Especially, the performance, 24.6TFlops(75% of the peak) is the best peak ratio on the Earth Sim-ulator although it is the local performance.

Here, let us summarize the present status and the problemof the parallel tuning. In our solver, the matrix data is par-titioned in a 2-dimensional cyclic fashion, and the processorgroup can be flexibly arranged on arbitrary configurations.Thus, we can minimize the communication cost. Moreover,we tune the solver on a scalar processor as well as the vec-tor one (See the appendix for the tuning on a scalar ma-chine), i.e., we achieve high performance on both architec-tures. At the present, our solver can solve an eigenproblemfor 375,000-dimensional matrix, while there still remains aroom for the optimization. In the future, we will advancethe full diagonalization of huge matrices beyond 500,000-dimension after the analysis enough in 375,000-dimensionallevel.

3.4 Simulation Results II: From Quantum toClassical Dynamics

The quantum dynamics change to the classical ones when

! Householder transformation , so-called Dongarra’s algorithmfor j = N, . . . , 1 step −M

U ← ∅, V ← ∅, W ← A(∗,j−M+1:j)

for k = 0, . . . , M − 1(1) Householder reflector: u(k) = H(W(∗,j−k))(2) Matrix-Vector multiplication:

v(k− 23 ) ← A(1:j−k−1,1:j−k−1)u

(k)

(3) v(k− 13 ) ← v(k− 2

3 )−(UV T +V UT )u(k)

(4) v(k) ← v(k− 13 ) − ((u(k), v(k− 1

3 ))/2|u(k)|2)u(k)

U ← [U, u(k)], V ← [V, v(k)].(5) Local update:

W(∗,j−k:j) ←W(∗,j−k:j) − (u(k)v(k)T+ v(k)u(k)T

)(∗,j−k:j)

U ← [U, u(k)], V ← [V, v(k)]endforA(∗,j−M+1:j) ←W(6) 2M rank-update

A(1:j−M,1:j−M) ← A(j−M:j−M) − (UV T + V UT )(j−M,j−M)

endfor

Figure 10: Algorithm of the Householder transformation.

! Householder back transformation with WY representation,! X ← (I + α2u2u

T2 ) . . . (I + αN−2uN−2u

TN−2)(I + αN−1uN−1u

TN−1)X

for k = N − 1, . . . , 2 step −M ′

Form WY-representaitonW ← ∅, Y ← ∅for j = k, MAX(k −M ′ + 1, 2) step −1

z = (I + WY T )uj

W ← [W, z], Y ← [Y, αjuj ]endforX ← (I + WY T )X

endfor

Figure 11: Algorithm of the back transformation.

the number of particles of the system enormously increases.This concept has been well-known through a very famousstory, i.e., the Schrodinger’s cat. The story tells us thatif a system is composed of not so many atoms or moleculesand obeys the quantum law, then the so-called Schrodinger’scat linked with another quantum system can wander fromalive to dead and vice versa until someone observes the cat.From this story, a fundamental question, how many atoms ormolecules we need to observe the change from Schrodinger’scat to classical one, arises. This question has so far attracteda great number of physicists. However, no one clearly con-cludes the question.

Our initial motivation of this study using the full diag-onalization partly comes from the mysterious question re-lated to the Schrodinger’s cat. On the other hand, severalexperimental confirmations on the quantum dynamics in theatomic gases always inspire us to study various new quan-tum ideas. Especially, in atomic gases, a feature, i.e., theinteraction is freely tunable from attraction to repulsion,drives us to a test how the interaction affects the quantumdynamics[18]. In the story of the Schrodinger’s cat, effects ofthe interaction have been neglected, but we have a hypoth-esis that the interaction is also a crucial factor to controlthe cat’s state. The reason is that the interaction enablesto entangle many quantum states and to enhance effectively

the degree of freedom, although the number of particles iskept. In order to test the idea, we start the studies aboutexact quantum dynamics with the full diagonalization solver(see Section 3.1-3.3 for the details in the development andthe analysis of the solver). Our computation target is toperform full diagonalization for matrices whose dimensionsexceed over 300,000, which realize exact quantum dynamicsof the system composed of more than 10 particles. In thispaper, we study quantum dynamics of the Hubbard modelwith a time-varying confinement potential and clarify the ef-fect of the interaction on the dynamics. The Hubbard modelis so simple that one can study exact quantum dynamics ofsystems composed of more than 10 particles if one uses atop class supercomputer like Earth Simulator.

Figure 16 shows the time-varying confinement potential,whose potential shape is assumed to change from the left-hand side to the right-hand side one at time = 0. Fromthis potential-shape change, we can expect that all parti-cles initially assemble in the bottom of the left-hand-sidedeep potential (time < 0), and they equivalently scatterinto both bottoms of equivalent double-well potential afterthe time evolution (time > 0) if the system behaves as aquantum one. On the other hand, if the system behaves asa classical one, then the particles still remain inside the bot-tom in the deep-well side since the potential barrier located

! R: Rowwisely split process group, partial matrix, vector, or partial sum.

! for example: xR = {xmyrow , xmyrow+nrows, xmyrow+2nrows, . . . }T .! C: Columnwisely split ones as same as in R.for j = N, . . . , 1 step −M

U ← ∅, V ← ∅ in both R and C, W ← Aowning[∗,j−M+1:j].for k = 0, . . . , M − 1

(1) (u(k))R ← H(W(∗,j−k))

If (I own (u(k))R ) then

Broadcast (u(k))R over Relse

Receive (u(k))R from owner on R.endif

(2) (v(k− 23 ))R ←

[A(,)u

(k)]R

(3) sR ← (V R)T (u(k))R, tR ← (UR)T (u(k))R

Global sum sR → s, tR → t over C.

(v(k− 13 ))R ← (v(k− 2

3 ))R − (URs + V Rt)

(4) sR ← ((u(k))R)T (v(k− 13 ))R, tR ← ‖(u(k))R‖2

Global sum sR → s, tR → t over C.

(v(k))R ← (v(k− 13 ))R − s

2t(u(k))R

Redistribute {(u(k))R} → (u(k))C, {(v(k))R} → (v(k))C over C.(5) Wowning[∗,j−k:j] ←Wowning[∗,j−k:j]

−((u(k))R((v(k))C)T + (v(k))R((u(k))C)T

)V ← [V, v(k)], U ← [U, v(k)] in both R and C.

endforAowning[∗,j−M+1:j] ←W

Aowning[1:j−M,1:j−M] ← Aowning[1:j−M,1:j−M] −(UR(V C)T + V R(UC)T

)endfor

Figure 12: Parallel algorithm of the Householder transformation.

2-dimensional implementation of back transform with WY representation.

! R: Rowwisely split process group, partial matrix, vector, or partial sum.

! for example: xR = {xmyrow , xmyrow+nrows, xmyrow+2nrows, . . . }T .! C: Columnwisely split ones as same as in R.for k = N − 1, . . . , 2 step −M ′

k′ = MAX(k −M ′ + 1, 2)for j = k, k′ step -1

If (I own uRj )then

Broadcast uRj over R

elseReceive uR

j from owner on Rendif

endforWR ← ∅, Y R ← ∅, s(j, j′)R ← αj(u

Rj )T uR

j′ , (k′ ≤ j′ < j ≤ k)

Global sum s(j, j′)R → s(j, j′) over Cfor j = k, k′ step -1

zR ← uRj +

∑j−1j′=k′ s(j, j′)wR

j′

WR ← [WR, zR(= wRj )], Y R ← [Y R, αju

Rj (= yR

j )]endfors(j)R ← (yR

j )T XRowning, (j = k′, . . . , k)

Global sum s(j)R → s(j) over C

XRowning ← XR

owning −∑k

j=k′ wRj s(j)

endfor

Figure 13: Parallel algorithm of the back transformation.

[sec]

0

1000

2000

3000

4000

5000

6000

7000

8000

2048PEs 3072PEs 3840PEs 4096PEs

Total

Red

Eig

Backtrafo

Figure 14: The elapsed time vs. the numberof CPUs for the full diagonalization of 280,000-dimensional matrix. Total, Red, Eig, and Backtrafomean total time, the Householder transformation,the Divide & Conquer, and the back transforma-tion, respectively.

[sec]

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

N=280K N=300K N=350K N=375K

Total

Red

Eig

Backtrafo

Figure 15: The elapsed time vs. the matrix dimen-sion on 512 nodes (4096 PE’s). The other conditionsare the same as Fig.14.

at the center prohibits the transfer (tunneling) of particles.In order to confirm whether such a change from the quan-tum to the classical behavior occurs or not by changing theinteraction, we perform the following simulation. First, weobtain the ground state in the presence of the left-hand-sideanti-symmetric potential in Fig.16, and second, we solve theall eigenstates in the presence of the right-hand-side sym-metric potential in Fig.16. From the product of the groundstate and all the eigenstates and all eigenvalues, we trace theexact time evolution of the initial state through the time-dependent Schrodinger equation.

Let us show the simulation results. Figure 17 are particledistributions at t = 0 (the left hand side) and those aftera sufficient time (the right hand side), respectively, and theupper and the lower portions display the difference betweenparticle distributions in U/t = 10 and U/t = 1, respectively.From the comparison between the upper and the lower timeevolutions, it is found that the case of U/t = 1 behaveslike a quantum system, while that of U/t = 10 does like aclassical one. This result indicates that the interaction hasalso an important role on the change from the quantum to

Figure 16: The time-varying potential shape. Attime=0, the shape suddenly changes from the leftto the right hand side.

0.0

0.5

1.0

1.5

2.0

time=0 time=400 (a.u.)

Figure 17: The time evolution for the Hubbardmodel with the time varying potential as shown inFig.16. For the upper and the lower cases, U/t = 10and U/t = 1, respectively.

the classical like dynamics. Now, we explore the mechanismin more details, but we find that many high energy statesare involved in the dynamics when switching on the stronginteraction.

4. CONCLUSIONSWe developed two different types of eigenvalue solvers in

order to study the quantum many-body systems. The firstone is the so-called exact diagonalization which solves onlythe ground and a few excited states for a huge sparse ma-trix, and our HPC challenge is to calculate them faster andto extend the matrix size limitation. In order to accomplishthe purposes, we employed the PCG algorithm and made theparallel tuning which includes the memory saving technique.Consequently, we succeeded in calculating the ground statefor a 100-billion-dimensional matrix with 18.692 TFlops.The elapsed time in this calculation (∼ 18.7sec) is morethan 10 times faster than that of the conventional Lanc-zos algorithm (∼ 266.2sec). These advancements made itpossible to systematically study various cases. The physicalinsight obtained by the exact diagonalization is the atomic-scale inhomogeneity of superfluidity, which can not be ob-served until about 31-billion-dimensional problem (25-siteHubbard model). The second approach is to solve all eigen-states of a matrix, which has been always a central issuein HPC fields. For three steps in the full diagonalization,we developed the solvers by ourselves if proper routines arenot provided, while we installed the proper routine if it isavailable. The model Hamiltonian in this study is also thesame as the first study. We succeeded in diagonalizing the375,000-dimensional matrix within 10,000 sec, and obtained24.6TFlops (75% of the peak) in the local routine. Thesedata (matrix dimension for the full diagonalization and thelocal peak ratio) are the world records to our knowledge.We find through such huge calculations that the interactionmay be also an important control parameter to change from

the quantum system to the classical like one as well as thesystem size.

AcknowledgementsThe authors in CCSE JAEA thank G.Yagawa, T.Hirayama,

N.Nakajima and C.Arakawa for their supports and acknowl-edge all staff members in the Earth Simulator for their sup-ports in the present calculations. One of the authors (M.M.)acknowledges H.Matsumoto, Y.Ohashi, and T.Koyama fortheir collaboration on the optical-lattice fermion systems,and T.Ishida and K.Kadowaki for their financial support.Two of authors (S.Y. and M.M.) thank Y.Oyanagi and T.Hottafor their illuminating discussion in the diagonalization tech-niques.

The work (M.M.) was partially supported by Grant-in-Aidfor Science Research on Priority Area “Physics on new quan-tum phases in superclean materials”(Grant No. 18043022and 18043005) from the Ministry of Education, Culture,Sports, Science and Technology of Japan. This work wasalso supported by Grant-in-Aid for Science Research fromMEXT, Japan (Grant No. 17540368 and 18500033).

5. REFERENCES

[1] See, for example, M. Rasetti, ed., The Hubbard Model:Recent Results, World Scientific, Singapore, 1991; A.Montorsi, ed., The Hubbard Model, World Scientific,Singapore, 1992.

[2] M. Machida, S. Yamada, Y. Ohashi, and H.Matsumoto, Novel superfluidity in a trapped gas ofFermi atoms with repulsive interaction loaded on anoptical lattice, Phys. Rev. Lett., 93, 200402, 2004.

[3] M. Rigol, A. Muramatsu, G. G. Batrouni, and R. T.Scalettar, Local quantum criticality in confinedfermions on optical lattices, Phys. Rev. Lett., 91,130403, 2003.

[4] For example, see, E. Dagotto, Correlated electrons inhigh-temperature superconductors, Rev. Mod. Phys.,66, 763, 1994.

[5] S.Yamada, T.Imamura, M.Machida, 16.447 TFlopsand 159-Billion-dimensional Exact-diagonalization forTrapped Fermion-Hubbard Model on the EarthSimulator, Proc. of SC2005, 2005.http://sc05.supercomputing.org/schedule/pdf/pap188.pdf

[6] The Earth Simulator Center,http://www.es.jamstec.go.jp/esc/eng/index.html

[7] S. Shingu, H. Takahara, H. Fuchigami, M. Yamada, Y.Tsuda, W. Ohfuchi, Y. Sasaki, K. Kobayashi, T.Hagiwara, S. Habata, M. Yokokawa, H. Itoh and K.Otsuka, A 26.58 Tflops Global AtmosphericSimulation with the Spectral Transform Method onthe Earth Simulator, Proc. of SC2002, 2002.http://sc-2002.org/paperpdfs/pap.pap331.pdf

[8] H. Sakagami, H. Murai, Y. Seo and M. Yokokawa,14.9 TFLOPS Three-dimensional Fluid Simulation forFusion Science with HPF on the Earth Simulator,Proc. of SC2002, 2002.http://sc-2002.org/paperpdfs/pap.pap147.pdf

[9] M. Yokokawa, K. Itakura, A. Uno, T. Ishihara and Y.Kaneda, 16.4 Tflops Direct Numerical Simulation ofTurbulence by Fourier Spectral Method on the EarthSimulator, Proc. of SC2002, 2002.

http://sc-2002.org/paperpdfs/pap.pap273.pdf

[10] D. Komatitsch, S. Tsuboi, C, Ji and J. Tromp, A 14.6billion degrees of freedom, 5 teraflops, 2.5 terabyteearthquake simulation on the Earth Simulator, Proc.of SC2003, 2003.http://www.sc-conference.org/sc2003/paperpdfs/pap124.pdf

[11] A. Kageyama, M. Kameyama, S. Fujihara, M.Yoshida, M. Hyodo, and Y. Tsuda, A 15.2 TFlopsSimulation of Geodynamo on the Earth Simulator,Proc. of SC2004, 2004.http://www.sc-conference.org/sc2004/schedule/pdfs/pap234.pdf

[12] A. V. Knyazev, Preconditioned eigensolvers - Anoxymoron?, Electronic Transactions on Numericalanalysis, Vol. 7, 104-123, 1998.

[13] A. V. Knyazev, Toward the optimal eigensolver:Locally optimal block preconditioned conjugategradient method, SIAM J. Sci. Comput., 23 , 517-541,2001.

[14] H. Uehara, M. Tamura, K. Itakura and M. Yokokawa,MPI Performance Evaluation on the Earth Simulator(in Japanese), Transactions on High PerformanceComputing Systems, 44(SIG 1 (HPS 6)), 2003, 24-34.

[15] J.E. Hoffman, E. W. Hudson, K. M. Lang, V.Madhavan, H. Eisaki, S. Uchida, and J.C. Davis, Afour unit cell periodic pattern of Quasi-particle statessurrounding vertex cores in Bi2Sr2CaCuO8+δ ,SCIENCE, Vol. 295, 2002

[16] M Machida, and T. Koyama, Friedel oscillation incharge profile and position dependent screeningaround a superconducting vortex core, Phys. Rev.Lett., Vol. 90, 077003, 2003.

[17] G. Levy, M. Kugler, A. A. Manuel, and Ø. Fischer,Fourfold structure of vertex-cores states inBi2Sr2CaCuO8+δ, Phys. Rev. Lett., Vol. 95, 257005,2005.

[18] M Machida,and T. Koyama, Structure of a quantizedvortex near the BCS-BEC crossover in an atomicfermi gas, Phys. Rev. Lett., Vol. 94, 140401, 2005.

[19] ScaLAPACK,http://www.netlib.org/scalapack/scalapack_home.html

[20] K.Naono, Y.Yamamoto, M.Igai, and H.Hirayama:“High performance implementation oftridiagonalization on the SR8000”, In Proc. of theHigh Performance Computing in Asia-Pacific Region(HPC-ASIA2000), Beijing, pp.206-219, 2000.

[21] C.Bischof, X.Sun, and B.Lang, “Paralleltridiagonalization through two-step band reduction”,In Proc. of Scalable High Performance ComputingConference, pp.23–27. IEEE, 1994.

[22] Y. Yamamoto, Performance and Accuracy ofAlgorithms for Computing the Eigenvalues of RealSymmetric Matrices on Cache-based Multiprocessors(in Japanese), Transactions on High PerformanceComputing Systems, 46(SIG 3 (ACS8)), 2005, 81-91.

[23] C.Bischof, C.V.Loan, “The WY Representation forProducts of Householder Matrices”, SIAM J. Sci. Stat.Comput. Vol.8, No.1, 1987.

[24] P.Bientinest, I.S.Dhillon, R.V.Geijn: “A ParallelEigensolver for Dense Symmetric Matrices Based onMultiple Relatively Robust Representations”, SIAM J.

Sci. Comput. Vol.27, No.1, pp.43-66, 2005.

[25] P.Alpatov, et al.: “PLAPACK: Parallel LinearAlgebra Package,” in Proc. of the SIAM ParallelProcessing Conference, 1997http://www.cs.utexas.edu/~plapack/

[26] E.Breitmoser, A.G.Sunderland, “A performance studyof the PLAPACK and ScaLAPACK Eigensolvers onHPCx for the standard problem”, Technical Reportfrom the HPCx Consortium, 2004.http://www.hpcx.ac.uk/research/hpc/HPCxTR0406.pdf

APPENDIX

Performance Test on SGI Altix 3700Bx2In this appendix, we show the performance test result of oursolver for the full diagonalization on an SGI Altix 3700Bx2with Itanium 2, 1.6GHz (JAEA) as a reference data. Figure18 is a performance comparison between the vendor-suppliedScaLAPACK routines (back side) and our routines (frontside) when solving 20,000-dimensional matrix. Here, notethat a tuning considering the presence of the cache memoryon a scalar processor is implemented in the solver. Theresult shows that the elapsed time of our solver is about 30%less than that of the ScaLAPACK routine. Figure 19 andFigure 20 show the performance data (MFlops vs. matrixdimension) for the Householder transformation (Red) andthe back transformation (Backtrafo), respectively. Theseresults show that both the transformations achieve quitehigh performance. Especially, over 50% of the theoreticalpeak performance in the back transformation is remarkable.These data indicate that our solver offers high scalability ona scalar-parallel machine as well as the vector machine.

Ela

psed t

ime [

sec]

0

50

100

150

200

250

300

350

400

450

500

32PEs 64PEs 128PEs 256PEs

Total

Red

Eig

Backtrafo

Figure 18: The elapsed time vs. the numberof CPU’s for the full diagonalization of 20,000-dimensional matrix on the Altix 3700Bx2. The frontand back sides correspond to data of our solver andScaLAPACK, respectively. The other conditions arethe same as Fig.14.

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

0 5000 10000 15000 20000 25000 30000 35000 40000

256PE

128PE

[dimension]

[MFLOPS]

Figure 19: MFlops vs. matrix dimension for theHouseholder transformation on 128 PE’s (red) and256 PE’s (green) in the Altix 3700 Bx2.

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

0 5000 10000 15000 20000 25000 30000 35000 40000

256PE

128PE

[dimension]

[MFLOPS]

Figure 20: MFlops vs. matrix dimension for theback transformation on 128 PE’s (red) and 256 PE’s(green) in the Altix 3700 Bx2.