fpga accelerated fdtd simulation on the cray xd1 using ......fpga accelerated fdtd simulation on the...
TRANSCRIPT
![Page 1: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/1.jpg)
FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C
CUG 2006, Lugano, SwitzerlandMay 9, 2006
Peter Messmer*, David Smithe, Paul Schoessow
Tech-X Corporation
Ralph Bodenner
Impulse Accelerated Technologies
![Page 2: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/2.jpg)
Tech-X Corporation2
Outline
• The algorithm
– FDTD and its applications
• The tools used
– XD1 system overview, FPGA
– Impulse C
• Porting FDTD to FPGA
– Pure software based optimizations
– Initial port to Impulse C/FPGA
– Further optimizations
• Summary and Conclusion
![Page 3: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/3.jpg)
Tech-X Corporation3
Complex electromagnetic phenomena require simulations
www.txcorp.com/products/VORPAL
Particle beam in a cavity, simulated by
the Plasma Simulation Framework
VORPAL. At the heart of this code sits
an implementation of the FDTD
algorithm.
![Page 4: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/4.jpg)
Tech-X Corporation4
Time-dependent solution of
Maxwell’s equations• Initial condition:
– Solution to Poisson’s equation known:
– No magnetic monopoles:
• Evolve dynamic Maxwell’s equations in time:
• Current satisfies continuity equation:
For this project: no charges/currents!
JBE
−×∇=∂
∂
t
EB
×∇−=∂
∂2
1
ct
ρ−=∇E
0=∇B
t∂
∂−=⋅∇
ρJ
![Page 5: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/5.jpg)
Tech-X Corporation5
FDTD / Yee Grid
• Discretization of Maxwell’s equations
– Finite differences for curl and time derivative, e.g.
Bz(t+1) = Bz(t) + dt ( (Ex(y+1) – Ex) / dy – (Ey(x+1) – Ey) / dx )
• Yee grid – Arrangement of EM field components
– Spatially centred finite differences
• Second order in space
• Leap-frog of Ampere/Faraday • Second order in time
Ex Ex
Ex
Ey Ey
Bz
![Page 6: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/6.jpg)
Tech-X Corporation6
FPGAs have potential to accelerate FDTD
• Large simulations require >109 cells per processor and thousands of time-steps
• Straight forward CPU based implementation:
~ 5.106 3D cell updates / second
⇒ need to accelerate cell updates
• High degree of parallelism in FDTD
– Cell half-updates independent of each other
• Various groups: FDTD on FPGA with custom made pipelines
– E.g. Culley et al. U. Cincinnati
• Today: High-level tools available for non-FPGA experts
– FDTD simple enough to experiment with FPGAs and new tools
• Have access to an FPGA enhanced system
– Cray XD1
![Page 7: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/7.jpg)
Tech-X Corporation7
Cray XD1 System
• Cluster of XD1 chassis – 2 XD1 chassis
• 12 nodes total
• 2 x 2.2 GHz Dual core
AMD Opterons
– 1 Chassis equipped with Application Accelerator• User programmable FPGA
– Field-programmable Gate Array, ‘configurable matrix of logic’– ‘Programmable matrix of logic hardware’– Xilinx Vitrex-II Pro FPGA
• Act as a configurable (co-)processor
• FPGA runs at 200MHz (!) – Have to exploit parallelism to compete with 2.2 GHz Opteron.
– “Need more than factor 10 in parallelism”
![Page 8: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/8.jpg)
Tech-X Corporation8
FPGA programming using Impulse C
• C-to-hardware/HDL translator
– Programming model: Application = set of processes
– Processes communicate via streams, shared memories or
signals
– Processes located on either FPGA or CPU
• Impulse C dialect
– API for streams, shared memory and signals
– Functions for specifying location of processes (CPU or FPGA)
– Directives for tuning (pipelines, loop unrolling, timing)
![Page 9: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/9.jpg)
Tech-X Corporation9
FPGA programming using Impulse C (cont.)
• Support for Cray XD1
– Impulse C communication primitives on top of RapidArrayTransport IP
– Generates all necessary hardware and software interfaces
– Generates makefiles and Xilinx ISE / XST project files
• Current limitations
– XD1 support currently in beta stage
• Only streams supported, no shared memory
– Only limited support for floating-point IP
• At time of project start not available for Cray XD1
• For more information: www.impulsec.com
Pellerin and Thibault, Practical FPGA Programming in C, Prentice Hall, 2005.
![Page 10: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/10.jpg)
Tech-X Corporation10
The long road from source code to an
FPGA enabled application
Create/port source using ImpulseC– Both for hardware and software processes– Simulate in pure software
Create VHDL or Verilog from C code– Simulate the VHDL code
Export application– VHDL, interfaces, source for software processes– makefiles/project files
Translate VHDL into a bitstream– Compile VHDL– Place&Route
Transfer bitstream and software process source to XD1
Append header to bitstream using Cray tools
Build CPU application
Execute CPU application– Impulse C wrapper loads bitstream to FPGA
![Page 11: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/11.jpg)
Tech-X Corporation11
Initial optimization: Rescaling
• At start of project, ImpulseC limited to fix-point arithmetic
– Floating point support currently being implemented
• Rescaling of variables
– E -> E dx, B-> B dx/dt
– Avoids most divisions and multiplications
– Only integers needed for variables
• Update is reduced to
Ez += ( Bx[y+1] – Bx – By[x+1] – By ) / cdx
… similar for Ex, Ey, Bx, By, Bz
• Benchmark problem: 400 x 200 x 1 cells, 100 timesteps
• Pure software implementation: 0.22s (> 38 G Cells/s)
![Page 12: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/12.jpg)
Tech-X Corporation12
Software implementation of FDTD algorithm
Loop over 3D grid
Index computation
Curl computation
for(int z = 0; z < nz; z++)
for(int y = 0; y < ny; y++)
for(int x = 0; x < nx; x++){
int ind = x + y * nx + z * nx * ny;
int ind_px = (x + 1) % nx + y * nx + z * nx * ny;
int ind_py = x + ((y + 1) % ny) * nx + z * nx * ny;
int ind_pz = x + y*nx + (( z + 1) % nz) * nx * ny;
ex[ind] += (bz[ind_py] - bz[ind] - by[ind_pz] + by[ind]) / cdx4;
ey[ind] += (bx[ind_pz] - bx[ind] - bz[ind_px] + bz[ind]) / cdx4;
ez[ind] += (by[ind_px] - by[ind] - bx[ind_py] + bx[ind]) / cdx4;
}
… and similar for the B-field update.
![Page 13: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/13.jpg)
Tech-X Corporation13
Initial FPGA implementation:
Put curl computation onto FPGA
E field update
B field update
CPU FPGA
• Receive 9 data elements
• Compute curl
• Return 3 data elements
• Receive 9 data elements
• Compute curl
• Return 3 data elements
Curl E only : 283 s
Curl E and Curl B : 445 s
(FPGA running at 90 MHz)• Send 9 values
• Receive 3 values
![Page 14: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/14.jpg)
Tech-X Corporation14
Some source code…
void CurlProcess(co_stream In, co_stream Out) {
co_int32 bx, bx2, by, by2, bz, bz2;
….
do {
co_stream_open(In, O_RDONLY, INT_TYPE(32));
co_stream_open(Out, O_WRONLY, INT_TYPE(32));
while(!co_stream_eos(In)) {
#pragma CO pipeline
co_stream_read(In, &bx, sizeof(co_int32)); bx2 = bx >> 2;
co_stream_read(In, &by, sizeof(co_int32)); by2 = by >> 2;
….
ex = bz_py2 - bz2 - by_pz2 + by2;
co_stream_write(Out, &ex, sizeof(co_int32));
….
}
co_stream_close(Out);
co_stream_close(In);
IF_SIM(break;) // Only run once during desktop simulation
} while (1);
}
Regular C function,
streams as parameters
Opening streams
Directives for tuning
Streams IO
Macros for simulation
![Page 15: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/15.jpg)
Tech-X Corporation15
Curl on FPGA: Works, but..
• … low performance
• Possible causes
– Large amount of data transfer CPU <-> FPGA
• 32 bit transfers
– Very little computation on FPGA
• No pipelining, no unrolling, large sequential part
Curl on FPGA : 445 s
(CPU only: 0.2s)
![Page 16: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/16.jpg)
Tech-X Corporation16
• Three components of curl can be computed in parallel
• Each component a separate, identical process
• Out += ( In1 – In2 – In3 + In4 )
• Pack 2 x 32 bit words into 64 bit word
Going parallel and optimizing streams
CPUFPGA
• Receive 5 values
• Compute curl
• Return Result3 curl processes : 243 s
(FPGA running at 140 MHz)
![Page 17: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/17.jpg)
Tech-X Corporation17
Fighting the data transfer bottle neck
• Communication bottle neck too big!
– No computation, just data ping-pong: 241 s
– Plain Opteron implementation: < 0.2s!
• Avoid problem by putting entire application onto FPGA
• ‘Right way’ via shared memory, once it’s available
• Grid size limited to FPGA BRAM
FPGACPU
Start Process
Stop Process
FPGA only : 5.9 s
(FPGA running at 140 MHz)
![Page 18: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/18.jpg)
Tech-X Corporation18
Full FDTD on FPGA
• Initial implementation : 5.9 s• Pipelining of curl computation : 5.1 s• Unrolling curl : 5.0s
• Only one memory access per clock cycle– Splitting E, B array -> Ex, Ey, Ez, Bx, By, Bz array
• Array splitting : 4.7s
• Still factor 10 away from plain software implementation• Currently main loop pipeline 10 cycles per result, 2 cycles per
curl
• Further array splitting (odd/even split)• System level parallelism
![Page 19: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/19.jpg)
Tech-X Corporation19
Exploiting System Level Parallelism
• Only about 8 % of FPGA real estate used
• Multiple FDTD pipelines in parallel
• Domain decomposition
FPGACPU
Start Process
Stop Process
Preliminary results:
1 pipeline : 4.7 s
2 pipelines : 2.3 s
4 pipelines : 1.0 s
(FPGA running at 140 MHz)
![Page 20: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/20.jpg)
Tech-X Corporation20
Optimizations Summary
1
10
100
1000
10000C
url
on
FP
GA
, 3
2b
it t
ran
sfe
rC
url
on
FP
GA
, 6
4b
it t
ran
sfe
rC
url
on
FP
GA
, 1
pro
ce
ss
Ma
in l
oo
pin
HW
Ma
in l
oo
pin
HW
, 1
pro
ce
ss
Sin
gle
co
mp
on
en
tfi
eld
Sin
gle
co
mp
on
en
tfi
eld
Tw
op
ipe
lin
es
Fo
ur
pip
eli
ne
s
Optimization
Tim
e F
PG
A/C
PU Main loop on CPU
Main loop on FPGA
Multiple pipelines
![Page 21: FPGA accelerated FDTD Simulation on the Cray XD1 using ......FPGA accelerated FDTD Simulation on the Cray XD1 using Impulse C CUG 2006, Lugano, Switzerland May 9, 2006 Peter Messmer*,](https://reader030.vdocuments.site/reader030/viewer/2022040618/5f2974f9c46f77474476d828/html5/thumbnails/21.jpg)
Tech-X Corporation21
Conclusion and Summary• Optimized an FDTD implementation on AMD Opteron
• Ported it to FPGA
• Experimented with various optimizations
– Avoiding bus bottleneck, pipelining
– Multiple concurrent processes, Domain decomposition
• Not quite at the performance of a single CPU, but getting closer
– Potential is there!
• High-Level tools enable domain scientists to experiment with FPGA
• Cray XD1 system provides ideal platform for these experiments
• FPGA FDTD optimization
– Getting speedup by combining domain decomposition and pipeline optimization
We would like to thank David Strenski (Cray) and Roy White (Xilinx) for providing access to various tools and resources. Access to the Cray XD1 system was provided through
the Cray Marketing Partner Network.