fast reflectarray antenna analysis and synthesis on gpus...the large number of radiating elements...
TRANSCRIPT
Fast Reflectarray Antenna Analysis and
Synthesis on GPUs
GPU Technology Conference
San Jose, California, March 18-21, 2013
Amedeo Capozzoli, Angelo Liseno
1
Aknowledgements
2
A-periodic Conformal Reflectarrays are covered by a World Patent
recently purchased by the European Space Agency
The research activity on reflectarrays at the DIETI (Antenna
Lab) of the Università di Napoli Federico II involves and has
involved also:
prof. Giuseppe D’Elia
dr. Claudio Curcio
The research activity on A-periodic Conformal Reflectarrays is
being developed in cooperation with dr. Giovanni Toso from
Antenna and Sub-Millimeter Wave Section, Electromagnetics
Division, TEC-EEA European Space Agency, ESA ESTEC.
The research activity on A-periodic Conformal Reflectarrays is
now funded by the European Space Agency
High Performance Antennas
Pencil beam Steerable Beams
•The antenna radiates a pattern
with a prescribed shape.
•Useful in satellite applications,
when we need to cover a a
region of the Earth surface,
without illuminating other
Countries or desolate locations.
Shaped reconfigurable beam
Multi-beam antennas
•At the same time, the antenna radiates more then a single beam.
•Useful when a link between a point and a set of points is required.
•The antenna radiates
in a well defined
direction.
•It is used when the
link between two
point is required.
•The antenna changes the pointing
directions according to needs.
•Useful in civil and military
applications, in radar systems and
in wireless networks.
3
The pattern is controlled by acting on the
geometry of the reflecting surface and/or by
exploiting a cluster of feeds.
The pattern is controlled by acting on the
excitation coefficients of the elements.
Traditional antenna systems
Array antennas Reflector antennas
Advantages:
• Versatility
Drawbacks:
• Complex beam-forming network
Advantages:
• High gains
• Large bandwidth
Drawbacks:
• Weight, dimensions and cost
• Mechanical reconfiguration
• Poor electronic reconfiguration
capabilities.
4
Reflectarrays
A reflectarray antenna is made of an array of passive elements, illuminated, as in
traditional reflectors, by a primary source located at a fixed distance.
•The radiation pattern can be controlled by acting on the characteristics (amplitude and
phase) of the field reflected by each element.
•The reflected field can be controlled f.i. by acting on the geometrical characteristic of the
elements.
5
Reflectarrays
A reflectarray antenna is made of an array of passive elements, illuminated, as in
traditional reflectors, by a primary source located at a fixed distance.
•The radiation pattern can be controlled by acting on the characteristics (amplitude and
phase) of the field reflected by each element.
•The reflected field can be controlled f.i. by acting on the geometrical characteristic of the
elements.
How does a reflectarray work?
By changing the length of the transmission lines, we can control the phase of the
reflected field and, as a consequence, the radiated pattern.
Reflectarrays conjugate the advantages of reflector antennas with those of array antennas.
6
The first reflectarray
In 1963 (Berry, Malech and Kennedy) the first reflectarray, based on the waveguide
technology, has been proposed and realized.
By properly defining the length of each waveguide, the
phase of the field reflected by each element can be
controlled in order to satisfy the design specifications
on the far-field pattern.
The waveguide technology has not favored reflectarrays as a valid
alternative to reflectors and arrays:
• Unfavorable dimensions and weight.
• Difficulties related to their practical use.
• Complex manufacturing process.
Recently, thanks to the impressive advancements in high-frequency printed-circuits
technologies, reflectarrays are being proposed as an attractive solution to the
drawbacks of array and reflector antennas.
7
Why are printed reflectarrays becoming attractive?
The printed reflectarray combines the advantages of reflectors with those of classical arrays:
•Flexibility of arrays retained.
•Complexity of the feeding structure dismissed.
•Simple realization process.
•Low cost.
•Low weight.
•Moderate conformability of the reflecting surface to the geometry of the installation site.
•Easy installation and deployment.
8
Patches loaded with “passive” reactive elements
(planar geometry)
Advantages:
• “Direct” design of the array
Drawbacks:
• spurious radiation from stubs
• large dimensions of the reflecting elements and difficulty of
integration
How to control the reflected field?
How to control the reflected field?
Advantages:
• no spurious radiation from stubs
• compact patches
Drawbacks:
• no “direct” design of the array
• spurious diffraction effects due to the abrupt variation of the
patches geometry
Patches with different resonant dimensions
Patches loaded with “passive” reactive elements
(stacked geometry)
Advantages:
• no spurious radiation from stubs
• no spurious diffraction effects due to the abrupt variation of
the patches geometry
Drawbacks:
• complexity
jB
jB
jB
jB
jB
How to control the reflected field?
Patches loaded with “active” reactive elements
Advantages
• no spurious radiation from stubs
• no spurious diffraction effects due to the abrupt variation of
the patches geometry
• electronic reconfiguration
Drawbacks:
• complexity
• biasing and driving network
How to control the reflected field?
Conformal A-Periodic Reflectarrays
The aim of the research activity on Conformal A-periodic Reflectarrays is to
develop new tools for advanced reflectarray antennas, able to exploit at the
best all the degrees of freedom of the structure
Positions of the scattering elements
Reflecting surface shape
Degrees of Freedom
Characteristics of the scattering elements
Orientations of the scattering elements
Reflectarray degrees of freedom
Since the reflectarray antennas allow essentially only the phase control of the reflected field, the additional degrees
of freedom related to the element positions could be exploited to get an
equivalent tapering behavior
Positions of the reflecting elements
As in a-periodic arrays, bandwidth improvements could be expected
Why can additional degrees of freedom be useful?
Reflectarray degrees of freedom
The orientations of the reflecting elements can be designed to improve the cross-polar pattern of the antenna
Orientations of the reflecting elements
Why can additional degrees of freedom be useful?
Two key aspects
Synthesis
Design of a Reflectarray
Analysis
Furnishes the scattering
behavior of the radiating
elements as function of the
control parameters
Furnishes the control
parameters guaranteeing a
pattern satisfying the
specifications
Algorithms Computing hardware
•Fermi
•Kepler (issued on
November 2012)
Reflectarray Synthesis Issues
The large number of radiating elements and control parameters makes the
analysis and the synthesis of a reflectarray antenna a challenging task
An advanced synthesis tool is demanded, taking into account for:
Accuracy Efficiency Effectiveness Constraints
High accuracy for the pattern prediction requires high computational
burden
The effectiveness is strictly related to the choice of the optimization
method. Global and/or local optimization algorithm are usually employed.
The enforcement of constraints drawn from the physics of the problem
seriously affects the convergence of the optimization algorithm
The synthesis approach
A(DA)
DA
X
Y
A
CPP
The synthesis procedure requires the solution of an inverse problem for the operator A.
The solutions can be obtained by finding the global minimum of the functional:
where PCpp project onto CPP
•X the space of the unknowns, to be defined
according to the tolerable computational
complexity.
•DA the effective subset wherein we should
search the unknowns. It is defined according to
physical constraints, the design specifications,
and the limits of the physical-mathematical
model.
•A|X→Y is the radiation operator mapping the
unknowns into the far-field squared amplitude
pattern.
•CPP set of far-field pattern meeting the design
specifications.
18
x X(x) A(x) PCpp (x)2
Synthesis Tools: optimization approach
|ECO|2
Ms
MI
The specifications for each component are
enforced by means of proper mask functions
Fine radiation pattern control
multiple spots and/or
shaped beams
local control of the
directivity/gain over and
outside the coverage
The objective functional to be minimized is given by:
22
))(()())(()()(
xAPxAxAPxAx CRCRCOCOCRCO
YY
Abstract formulation of the algorithm
The Trapping Problem
A serious issue is related to the trapping of the optimization
process into false solutions (local optima of the objective
functional)
optimal solution
sub optimal solution
starting point
20
The synthesis approach
A(DA)
DA
X
Y
A
CPP
The synthesis procedure requires the solution of an inverse problem for the operator A.
The solutions can be obtained by finding the global minimum of the functional:
where PCpp project onto CPP
•X the space of the unknowns, to be defined
according to the tolerable computational
complexity.
•DA the effective subset wherein we should
search the unknowns. It is defined according to
physical constraints, the design specifications,
and the limits of the physical-mathematical
model.
•A|X→Y is the radiation operator mapping the
unknowns into the far-field squared amplitude
pattern.
•CPP set of far-field pattern meeting the design
specifications.
21
x X(x) A(x) PCpp (x)2
The synthesis approach
A(DA)
DA
X
Y
A
CPP
The synthesis procedure requires the solution of an inverse problem for the operator A.
The solutions can be obtained by finding the global minimum of the functional:
where PCpp project onto CPP
•X the space of the unknowns, to be defined
according to the tolerable computational
complexity.
•DA the effective subset wherein we should
search the unknowns. It is defined according to
physical constraints, the design specifications,
and the limits of the physical-mathematical
model.
•A|X→Y is the radiation operator mapping the
unknowns into the far-field squared amplitude
pattern.
•CPP set of far-field pattern meeting the design
specifications.
22
x X(x) A(x) PCpp (x)2
We need the mathematical expression for the
radiation operator A.
We need a physical-mathematical model for the
scattering by each patch.
The multi-stage Synthesis
The synthesis is performed by using, in sequence, several synthesis tools, based on different radiative models and optimization techniques;
the number of the degrees of freedom of the structure, the accuracy and the computational complexity, are progressively increased across the stages.
Tool 1
Tool 2
…….
Ac
cu
rac
y
Nu
mb
er o
f un
kn
ow
ns
Co
mp
lex
ity
In the synthesis tools the unknowns of interest are
obtained by minimizing a proper objective functional , thanks to the use of local
and/or global optimizer
The multi-stage Synthesis: ARA design
CCAS: Constrained Conformal Aperture Synthesis
APRPOS: A-Periodic Phase Only Synthesis
APRACS: A-Periodic Accurate Synthesis
APRACS
APRPOS – Local – Stage A: Zernike
CCAS– Local
APRPOS – Local – Stage B: Impulse
Antenna Layout
Phase Only
Radiative Model
Accurate Radiative
Model
CCAS– Global Aperture Synthesis
Reflectarray
Synthesis
Essential Structure of the algorithm
Synthesis Tools: optimization approach
Global Optimizer
A “smart” multistart approach has been
implemented
Local Optimizer
An iterative gradient-based procedure relying
on a self-scaled version of the Broyden-
Fletcher-Goldfarb-Shanno (BFGS) scheme has
been implemented
Computational complexity (for both
optimizers)
Far-field pattern and gradient evaluation
The objective functional to be minimized is given by:
22
))(()())(()()(
xAPxAxAPxAx CRCRCOCOCRCO
YY
Abstract formulation of the algorithm
Non-Uniform FFT (NUFFT)
FFT
Synthesis Tools: computational efficiency
Pattern Evaluation
N number of patches
Eco co-polar, Ecr cross-
polar
wavenumber,
u,v cosine directors of the
observation point
r radial coordinate of the
observation point
Sn scattering matrix of the
n-th patch
Ef feed field
Q matrix converting
cartesian components to co-
polar and cross-polar
(xn,yn,zn) coordinate of the
n-th patch Computational complexity
Brute Force (BF) Summation
Optimized Matrix Vector Multiplication (OMVM)
O(N2)
O(Nlog5N)
O(NlogN)
O(NlogN)
Only Flat Periodic Arrays
Periodic and A-Periodic Flat Arrays
Synthesis Tools: computational efficiency
Pattern Evaluation
Non-Uniform FFT (NUFFT) FFT
Only Periodic Flat Arrays Periodic and A-Periodic Flat Arrays
NUFFT routines perform a DFT starting from a non-uniform grid of radiating
element, and/or over a non uniform grid of observation points
NUFFT Type 1
NUFFT Type 2
NUFFT Type 3
Synthesis Tools: computational efficiency
Pattern Evaluation
Non-Uniform FFT (NUFFT) FFT
NUFFT routines can realize a FFT starting from a non-uniform grid of radiating
element, and/or over a non uniform grid of observation points
A more flexible control of the synthesized pattern
A reduction of the spectral region of interest reduction of the computational burden
Only Periodic Flat Arrays Periodic and A-Periodic Flat Arrays
Synthesis Tools: computational efficiency
Pattern Evaluation
Non-Uniform FFT (NUFFT) FFT
Only Periodic Arrays Periodic and A-Periodic Arrays
The pattern evaluation can not always be performed by using FFT and NUFFT routines. In particular their use is strictly related to :
Antenna Geometry
Radiative model
Facetted and conformal
structures prevent the
use of FFT/NUFFT
Simplified
Phase Only
Accurate
FFT/NUFFT
(planar geometry)
FFT/NUFFT
The use of FFT/NUFFT can be
restored by using the P-Series approach
OMVM
The PO Model
Accurate Model
FFT/NUFFT
the dependence of the Sn’s on the
features of the different patches can
be described by a phase factor
exp(jn) only and by a term S0
common to all the Sn’s, that is,
Sn(u,v)S0(u,v)exp(jn).
the dependence of S on the
incidence angle is neglected
n
f
n
j rm
f f n
n
eE E w
r
rn is the distance between the feed
and the n-th patch
is a vector independent on the
index n
is the feed
illumination factor
)(cos n
mm
nffw
Phase Only (PO) hypotheses
PO model
FFT/NUFFT
(planar geometry)
The p-series Approach
PO model
FFT/NUFFT
(planar geometry)
When dealing with facetted or conformal structures the use of
FFT/NUFFT can be restored thanks to the P-Series approach
Array Factor n
f n
j rm j
n n
n
ea w e
r
Let (u0,v0,w0) be the values of (u,v,w) related to the main beam direction,
u’=u-u0, v’=v-v0, w’=w-w0 and a’n= an exp{j(u0xn+ v0yn+w0zn)}
P term Taylor expansion Computational complexity O(PNlogN)
Usually the value of P for mild conformal or faceted structures usually considered are below 5, allowing a satisfactory speedup
Unknowns representation
To provide effectiveness to the approach, proper representations for the synthesis
parameters should be adopted
m = m-th control phases
(xm,ym) = m-th position
g: zm=g(xm,ym) surface
equation
mmmm
m
f wzvyuxjkM
m
j
m
rjm
m eer
ewvuSvuQ
1
0),(),(
Control phases
In first stages, Zernike
polynomials are adopted.
In final stage, impulsive
functions are used so that
all the command phase
DoF’s are exploited.
Element positions
i
mm
y
i
y
i
i
mm
x
i
x
i
m
m
pd
pd
y
x
),(
),(
Mapping a uniform grid in
the (,) plane in a non-
uniform grid in the (x,y)
plane.
Surface shape
Similar modal expansion
for g.
k
kk yxgeyxg ),(),(
k
kk
mmm
yxfc
yx
),(
),(
Constraints
Constraints on the element spacings are crucial, since small
spacings must be avoided, to avoid complex inter-element effects
and apparent superdirectivity.
Constraints on the maximum spacing are also necessary, to avoid
exceedingly large RAs .
Element
positions
Constraints on the smoothness of the surface shape are also needed. Reflecting
surface
Constraints on the on the command phase can become crucial to
avoid abrupt variations between adjacent elements not practically
achievable.
Command
phases
Abrupt changes in
the element size due
to phase wrap
Canada+ConUS coverage
Number of x and y elements 44x44
Working frequency 14.25GHz
(min/max) x and y spacing 0.5λ/0.7λ
Feed Location Zfeed=1m
yfeed=26.7cm
Feed Pointing Angle θf=14.94°
Feed illumination factor mf 12
Target Coverage:
Continental US + Canada
Reference Min. Gain 28.4dB
zRA
yRA
(yfeed, zfeed)
Reflectarray
yf
zf
θf F
D
Canada+ConUS coverage
Periodic RA
/2 spacing along x and y Aperiodic RA
Mean directivity 29.71dB
Minimum directivity 27.66dB
Mean directivity 30.97dB
Minimum directivity 27.98dB
Outline Minimization of the objective functional
22
))(()())(()()(
xAPxAxAPxAx CRCRCOCOCRCO
YY
Keypoints
• Radiated field
• Functional gradient
• Optimization
Starting guess
Calculate radiated
field (Aco and Acr) Calculate
gradient
Specs
fulfilled?
Update
unknowns x
No
Solution
Yes
Calculation of field and gradient at each step is highly demanding for large reflectarrays (>40x40 or larger).
GPUs make the synthesis of
large reflectarrays feasible in
reasonable computing times
Radiated field for POS
Gradient for POS
Radiated field for accurate
Gradient for accurate
POS: CUDA implementation Accurate: Jacket implementation
Outline
Computing Hardware
Kepler
• Higher double-precision throughput
• Faster atomic operations
• Dynamic parallelism
Implementations on both, Fermi
and Kepler architectures.
S
M radiating
elements
Fast evaluation of the radiated field (POS)
1
0
][),(
M
l
zwyvxuj
lhhhlhlhlheavuFF
Fh cannot be necessarily expressed in terms of a standard DFT of al due to:
• the exponential term exp(jwhzl) (conformal reflectarray) and/or
• the possibly irregular (xl,yl) spatial grid and/or
• the possible need of calculating the pattern in an irregular (uh,vh) spectral grid
mmmm
m
f wzvyuxjkM
m
j
m
rjm
m
cr
coee
r
ewvuSvuQ
E
E
10
),(),(
Array factor
Standard FFT routines (having a convenient O(MlogM) computational complexity) cannot
be necessarily employed
S
M radiating
elements
Fast evaluation of the radiated field (POS)
1
0
][),(
M
l
zwyvxuj
lhhhlhlhlheavuFF
mmmm
m
f wzvyuxjkM
m
j
m
rjm
m
cr
coee
r
ewvuSvuQ
E
E
10
),(),(
Array factor
Fast algorithms
P-series
Subarray approach
Non-Uniform FFT (NUFFT)
Hardware
Employ advanced computing
hardware (Graphics Processing
Units - GPU)
An algorithm for the fast analysis of irregular arrays having the same computational complexity of standard FFTs, and employing advanced (parallel) hardware is now in
order.
P-series and subarray approach
1
0
0)(
!
)]([0
P
p
p
l
p
hzwwjz
p
wwje lh
If S has a mild curvature, and denoting by w0 the
value of w corresponding to the main beam
center
P-series
lzjw
ll eaa 0'
1
0
][1
0
0 '!
)]([ M
l
yvxuj
l
p
l
P
p
p
hh
lhlheazp
wwjF
NUFFT
Q
q
M
Ml
yvxuj
l
p
l
P
p
p
hh
q
q
lhlheazp
wwjF
1
1
][1
0
01
'!
)]([Subarray approach
q-th subarray Speeds-up the convergence of the p-series
M
l
lhlh aBF0
][ lhlhlh zwyvxuj
hl eB
Optimized Matrix Vector Multiplication – OMVM
~ O(M2)
Flat
Fast evaluation of the radiated field (POS)
Surface Patch lattice Spectral lattice
Regular
Numerical tool
Regular FFT
Flat Irregular Regular NED-NUFFT
Flat Regular Irregular NER-NUFFT
Flat Irregular Irregular Type 3-NUFFT
Conformal
•FFT
•Non-Equispaced Data
(NED) NUFFT
•Non-Equispaced
Results (NER) NUFFT
•Type 3 NUFFT
[…] […] P-Series +
above tools
NER-NUFFT
42
Non-Uniform FFT (NUFFT) - NER
NER-type (Non-Equispaced Results) DFT
xl non-uniform result sampling points
2/
2/
/2ˆN
Nk
k
Nkxj
l zez l
l=1,…,M
The NUFFT exploits the Poisson summation formula expressing each “non-
uniformly sampled” exponential into an infinite number of “uniformly sampled”
exponentials
m
cNmkj
l
Nkxjemcx
cNke l /2
2/1/2
)(ˆ)/2(
)2(
c oversampling factor
proper window function
transform of ^
1D case
has support in (-/c,/c)
should be concentrated in (-K,K) ^
K
Km
cNmkj
l
Nkxjl
l
l emcxcNk
e
/22/1
/2)(ˆ
)/2(
)2(
The Poisson summation formula becomes an
interpolation formula specifically tailored to
“non-uniformly sampled” exponentials
][ ll cxInt
A v
u
K
Km
kN
Nk
cNmkj
ll
l
lcNk
zemcxz
)/2()(ˆ)2(ˆ
2/
2/
/22/1
Standard FFT on cN points
Scaling and zero padding of c
Steps to calculate the NER-NUFFT
l=1,…,M
Convolution (interpolation)
22
22sinh2)(ˆ
xK
xKx
A possible, although suboptimal, choice for the NUFFT windows is
|| 0,
|| ,)()(
22
0 KI
01.01
2
c
I0 modified Bessel function
K
Km
kN
Nk
cNmkj
ll
l
lcNk
zemcxz
)/2()(ˆ)2(ˆ
2/
2/
/22/1
Standard FFT on cN points
Scaling and zero padding of c
NER-NUFFT: operations count
l=1,…,M
Convolution (interpolation)
Standard FFT on cN
points O(cNlog(cN)) operations
Scaling and zero padding of c 2N operations
Interpolation M(2K+1) operations
Depends on the accuracy desired to calculate the involved
functions (special functions)
Spatial and spectral
windows
For N,M>>K, the
computational complexity is
O(cNlog(cN))
Kcx
Kcxm
kN
Nk
cNmkj
ll
l
lcNk
zemcxz
)/2()(ˆ)2(ˆ
2/
2/
/22/1
Standard FFT on cN points
Standard FFT on cN
points cuFFT library
Scaling and zero padding of c
Scaling and zero padding of c Implemented by a specific kernel. Intrinsically parallel step.
NER-NUFFT: parallel CUDA implementation
l=1,…,M
Interpolation
Interpolation Each thread is assigned to a different l and calculates a
summation of 2K+1 terms.
Modified Bessel function evaluated by rational Chebyshev
approximations. Calculated in advance by a specific kernel.
Evaluation of spatial
window function
Calculation of the modified Bessel function I0 __device__ double bessi0(double x) { double num, den, x2; x2 = abs(x*x); x=abs(x); if (x > 15.0) { den = 1.0 / x; num = -4.4979236558557991E+006; num = fma (num, den, 2.7472555659426521E+006); num = fma (num, den, -6.4572046640793153E+005); [...] num = fma (num, den, 3.9894228040143265E-001); num = num * den; den = sqrt (x); num = num * den; den = exp (0.5 * x); /* prevent premature overflow */ num = num * den; num = num * den; return num; } else { num = -0.27288446572737951578789523409E+010; num = fma (num, x2, -0.6768549084673824894340380223E+009); num = fma (num, x2, -0.4130296432630476829274339869E+008); […] den = -0.2728844657273795156746641315E+010; den = fma (den, x2, 0.5356255851066290475987259E+007); […] return num/den; } }
15|| ,)(3
0
2
16
0
2
0
x
xq
xp
xI
j
j
j
j
j
j
15 ,130
)(25
0
2/1
0
xx
TpexxI j
j
j
x
pj, qj, expansion coefficients
Tj Chebyshev polynomials
J.M. Blair, “Rational Chebyshev approximations for the modified Bessel
functions I0 and I1”, Math. of Comput., vol. 28, n. 126, pp. 581-583, Apr. 1974.
Special function calculation Bessel function I0 not available in CUDA libraries.
Implemented according to Blair’s approach.
NER-NUFFT: Interpolation
Kcx
Kcxm
k
N
Nk
cNmkj
ll
l
lcNk
zemcxz)/2(
1)(ˆ)2(ˆ
2/
2/
/22/1
__global__ void Interpolation(const double2* __restrict__ U_d, const double* __restrict__ x1_d, const double* __restrict__ x2_d, double2* __restrict__ tr, const int N1, const int N2, const int N) { int i = threadIdx.x + blockDim.x * blockIdx.x; if (i<N) { int ind_i,ind_j; double x1 = x1_d[i], x2 = x2_d[i], mu1 = rint(c*x1), mu2 = rint(c*x2), phicap1, phicap2, tempd, p1, p2, expon; double2 UU, temp = make_cuDoubleComplex(0.0,0.0); for (int m1=-K; m1<=K; m1++) { ind_i = modulo((int)mu1 + m1 + c*N1,c*N1); expon = (c*x1-(mu1+(double)m1)); p1 = K*K-expon*expon; if(p1<0.) {tempd=rsqrt(-p1); phicap1 = (1./pi)*((sin(alfa/tempd))*tempd); } else if(p1>0.) {tempd=rsqrt(p1); phicap1 = (1./pi)*((sinh(alfa/tempd))*tempd); } else phicap1 = alfa/pi; for (int m2 = -K; m2<=K; m2++) { ind_j = modulo((int)mu2 + m2 + c*N2,c*N2); expon = (c*x2-(mu2+(double)m2)); p2 = K*K-expon*expon; if(p2<0.) {tempd=rsqrt(-p2); phicap2 = (1./pi)*((sin(alfa/tempd))*tempd); } else if(p2>0.) {tempd=rsqrt(p2); phicap2 = (1./pi)*((sinh(alfa/tempd))*tempd); } else phicap2 = alfa/pi; UU = U_d[IDX2R(ind_j,ind_i,c*N2)]; temp.x = temp.x+phicap1*phicap2*UU.x; temp.y = temp.y+phicap1*phicap2*UU.y; } } tr[i] = temp; }
22
22sinh2)(ˆ
xK
xKx
Overlap betwee memory loads and computation
Analysis of
branch
paths.
Efficiency
close to
100%.
Reciprocal sqrt
foo%n==foo&(n-1)
n power of 2
Read-only cache for Kepler
NED-NUFFT
49
Standard FFT on cN points
Interpolation
NED-NUFFT: Parallel CUDA implementation
Scaling and decimation
)/(212/
2/||
,
2/1
))((ˆ)/2(
)2(ˆ
cNskjcN
cNsKmcNs
ml
llk emcNscxzcNk
z
l
k=-N/2,…,N/2
Modified Bessel function evaluated by rational Chebyshev
approximations. Calculated in advance by a specific kernel.
Evaluation of spatial
window function
Interpolation
Standard FFT on cN
points
Scaling and decimation
Implemented by a specific kernel. Intrinsically parallel step.
cuFFT library
Each thread is assigned to a different s and calculates a
summation of 2K+1 terms. Atomic operations required.
A v
u
Standard FFT on cN points
Interpolation
Steps to calculate the NED-NUFFT and operations count
Scaling and decimation
)/(212/
2/||
,
2/1
))((ˆ)/2(
)2(ˆ
cNskjcN
cNsKmcNs
ml
llk emcNscxzcNk
z
l
k=-N/2,…,N/2
Depends on the accuracy desired to calculate the involved
functions
Spatial and spectral
windows
Interpolation M(2K+1) operations
Standard FFT on cN
points O(cNlog(cN)) operations
Scaling and decimation 2N operations
For N,M>>K, the
computational complexity is
O(cNlog(cN))
)/(212/
2/||
,
2/1
))((ˆ)/2(
)2(ˆ
cNskjcN
cNsKmcNs
ml
llk emcNscxzcNk
z
l
NED-NUFFT: Interpolation
int i = threadIdx.x + blockDim.x * blockIdx.x; double cc_points1=cc*x[i], r_cc_points1=rint(cc_points1), cc_diff1 = cc_points1-r_cc_points1; double cc_points2=cc*y[i], r_cc_points2=rint(cc_points2), cc_diff2 = cc_points2-r_cc_points2; double phi_cap1, phi_cap2, P1, P2, tempd; int PP1, PP2; if(i<M) { for(int m=0; m<(2*K+1); m++) { P1 = K*K-(cc_points1-(r_cc_points1+(m-K)))*(cc_points1-(r_cc_points1+(m-K))); PP1 = modulo((r_cc_points1+(m-K)+N1*cc/2),(cc*N1)); if(P1<0.) {tempd=rsqrt(-P1); phi_cap1 = (1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P1>0.) {tempd=rsqrt(P1); phi_cap1 = (1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap1 = alfa/pi_double; for(int n=0; n<(2*K+1); n++) { P2 = K*K-(cc_points2-(r_cc_points2+(n-K)))*(cc_points2-(r_cc_points2+(n-K))); PP2 = modulo((r_cc_points2+(n-K)+N2*cc/2),(cc*N2)); if(P2<0.) {tempd=rsqrt(-P2); phi_cap2 = (1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P2>0.) {tempd=rsqrt(P2); phi_cap2 = (1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap2 = alfa/pi_double; atomicAdd(&result[IDX2R(PP1,PP2,cc*N2)].x,data[i].x*phi_cap1*phi_cap2); atomicAdd(&result[IDX2R(PP1,PP2,cc*N2)].y,data[i].y*phi_cap1*phi_cap2); } } }
atomicAdd routine contained in the CUDA Programming
guide
NED-NUFFT: Interpolation with dynamic parallelism __global__ void series_terms(double2 temp_data, double2* __restrict__ result, const double r_cc_points1, const double cc_diff1, const double r_cc_points2, const double cc_diff2, const int N1, const int N2) { int m = threadIdx.x; int n = threadIdx.y; double tempd, phi_cap; P = K*K-(cc_diff1-(m-K))*(cc_diff1-(m-K)); if(P<0.) {tempd=rsqrt(-P); phi_cap = (1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P>0.) {tempd=rsqrt(P); phi_cap = (1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap = alfa/pi_double; P = K*K-(cc_diff2-(n-K))*(cc_diff2-(n-K)); if(P<0.) {tempd=rsqrt(-P); phi_cap = phi_cap*(1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P>0.) {tempd=rsqrt(P); phi_cap = phi_cap*(1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap = phi_cap*alfa/pi_double; int PP1 = modulo((r_cc_points1+(m-K)+N1*cc/2),(cc*N1)); int PP2 = modulo((r_cc_points2+(n-K)+N2*cc/2),(cc*N2)); atomicAdd(&result[IDX2R(PP1,PP2,cc*N2)].x,temp_data.x*phi_cap); atomicAdd(&result[IDX2R(PP1,PP2,cc*N2)].y,temp_data.y*phi_cap); } __global__ void dynamic_interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M) { int i = threadIdx.x + blockDim.x * blockIdx.x; double cc_points1=cc*x[i]; double r_cc_points1=rint(cc_points1); // Equivalente di mu const double cc_diff1 = cc_points1-r_cc_points1; double cc_points2=cc*y[i]; double r_cc_points2=rint(cc_points2); // Equivalente di mu const double cc_diff2 = cc_points2-r_cc_points2; double2 temp_data = data[i]; dim3 dimBlock(13,13); dim3 dimGrid(1,1); if(i<M) { series_terms<<<dimGrid,dimBlock>>>(temp_data,result,r_cc_points1,cc_diff1,r_cc_points2,cc_diff2,N1,N2); } }
Child kernel function
Parent kernel function
NED-NUFFT: fftshift
)/(212/
2/||
,
2/1
))((ˆ)/2(
)2(ˆ
cNskjcN
cNsKmcNs
ml
llk emcNscxzcNk
z
l
fftshift
ifftshift
FFT Solution with memory movements (swap)
cuFFT requires summation indices ranging from
0 and returns DFTs with indices ranging from 0
k=-N/2,…,N/2
NED-NUFFT: fftshift
)/(212/
2/||
,
2/1
))((ˆ)/2(
)2(ˆ
cNskjcN
cNsKmcNs
ml
llk emcNscxzcNk
z
l
fftshift
ifftshift
FFT
Solution without memory movements
)/(2
1
0
)/(21
0
)1()1(cNskj
cN
s
s
jsjkcNskjcN
s
s
skefeeef
)/(2
1
0
)2/()(2)2/()(2 cNskjcN
s
s
cNcNsjcNcNkjefee
)/(
221
0
)2/()(2cNk
cNsjcN
s
s
cNcNkjefe
k=-N/2,…,N/2
NED-NUFFT: fftshift
fftshift
ifftshift
FFT
Solution without memory movements
)/(2
1
0
)/(21
0
)1()1(cNskj
cN
s
s
jsjkcNskjcN
s
s
skefeeef
)/(2
1
0
)2/()(2)2/()(2 cNskjcN
s
s
cNcNsjcNcNkjefee
)/(2
212/
2/
)/(212/
2/
)2/()(2cN
cNksjcN
cNs
s
cNskjcN
cNs
s
cNcNkjefefe
k=-cN/2,…,cN/2 )/(2
12/
2/
ˆ cNskjcN
cNs
sk eff
cuFFT (returns indices 0,…,cN-1
NUFFT NER Results
Speedup Execution time
MxM is the number of
elements. Oversampling of
2 for the power pattern.
NUFFT NED Results
Speedup Execution time
Dynamic parallelism relevantly
improves the result
MxM is the number of
elements. Oversampling of
2 for the power pattern.
Acceleration of the pattern evaluation by
Accelereyes Jacket
NUFFT routines are not anymore possible
The algorithm has been written in Matlab script accelerated by functions
exploiting the Accelereyes Jacket toolbox.
mmm
m
wzvyuxjkM
m
fmcr
coeEvuSvuQ
E
E
1
),(),(
Fast Matrix-Vector Product
routine by Accelereyes
Machine: Genesis Tesla I-7950 workstation 6Gb of RAM
CPU : Intel CPU i7-950 (8 cores), 3.06GHz
GPU: Nvidia Tesla C2050 (448 cores) 1.15GHz - 2.8GB RAM
Speedup ≈ 8
Fast evaluation of the radiated field (accurate)
mhmhmh
hhhh
m
m
f zwyvxuj
co
H
hhcococo
j
m
rjm
m
m
eSQEPEEer
ew
x 01
22* )(4 UIm
Phase-Only Synthesis case.
Gradient evaluation
Calculated by the P-series + NUFFT approach
Accurate synthesis case
pl 4Re
ACOA
pl,ACO(ACO
2YCO
, ,( , ) ( ) l l
Al jk ux vyCO
fCOl l
S u v pAB u v E l e
p p
Evaluation of the scalar product by
Fast Matrix-Vector Product routine by
Accelereyes
The acceleration of the gradient follows the same computation scheme or
the radiated field for both the two synthesis cases
Conclusions
Acceleration of reflectarray synthesis by
P-Series
Non Uniform FFT (NUFFT)
Optimized Matrix Vector Multiplication (OMVM)
Implementation on GPUs (testing on Fermi and Kepler architectures)
make the reflectarray synthesis feasible in reasonable time
The POS synthesis stage (CUDA implementation) takes about 3/4 hours
for a 44x44 reflectarray
The accurate synthesis stage (Jacket implementation) takes about 3/4 days
for a 44x44 reflectarray
NUFFT algorithms are of interest in many other application fields.
For electromagnetic applications, we have successfully employed the described NUFFTs in:
• Near-field antenna characterization;
• Synthetic Aperture Radar fast processing