fast reflectarray antenna analysis and synthesis on gpus...the large number of radiating elements...

Fast Reflectarray Antenna Analysis and

Synthesis on GPUs

GPU Technology Conference

San Jose, California, March 18-21, 2013

Amedeo Capozzoli, Angelo Liseno

1

Aknowledgements

2

A-periodic Conformal Reflectarrays are covered by a World Patent

recently purchased by the European Space Agency

The research activity on reflectarrays at the DIETI (Antenna

Lab) of the Università di Napoli Federico II involves and has

involved also:

prof. Giuseppe D’Elia

dr. Claudio Curcio

The research activity on A-periodic Conformal Reflectarrays is

being developed in cooperation with dr. Giovanni Toso from

Antenna and Sub-Millimeter Wave Section, Electromagnetics

Division, TEC-EEA European Space Agency, ESA ESTEC.

The research activity on A-periodic Conformal Reflectarrays is

now funded by the European Space Agency

High Performance Antennas

Pencil beam Steerable Beams

•The antenna radiates a pattern

with a prescribed shape.

•Useful in satellite applications,

when we need to cover a a

region of the Earth surface,

without illuminating other

Countries or desolate locations.

Shaped reconfigurable beam

Multi-beam antennas

•At the same time, the antenna radiates more then a single beam.

•Useful when a link between a point and a set of points is required.

•The antenna radiates

in a well defined

direction.

•It is used when the

link between two

point is required.

•The antenna changes the pointing

directions according to needs.

•Useful in civil and military

applications, in radar systems and

in wireless networks.

3

The pattern is controlled by acting on the

geometry of the reflecting surface and/or by

exploiting a cluster of feeds.

The pattern is controlled by acting on the

excitation coefficients of the elements.

Traditional antenna systems

Array antennas Reflector antennas

Advantages:

• Versatility

Drawbacks:

• Complex beam-forming network

Advantages:

• High gains

• Large bandwidth

Drawbacks:

• Weight, dimensions and cost

• Mechanical reconfiguration

• Poor electronic reconfiguration

capabilities.

4

Reflectarrays

A reflectarray antenna is made of an array of passive elements, illuminated, as in

traditional reflectors, by a primary source located at a fixed distance.

•The radiation pattern can be controlled by acting on the characteristics (amplitude and

phase) of the field reflected by each element.

•The reflected field can be controlled f.i. by acting on the geometrical characteristic of the

elements.

5

Reflectarrays

A reflectarray antenna is made of an array of passive elements, illuminated, as in

traditional reflectors, by a primary source located at a fixed distance.

•The radiation pattern can be controlled by acting on the characteristics (amplitude and

phase) of the field reflected by each element.

•The reflected field can be controlled f.i. by acting on the geometrical characteristic of the

elements.

How does a reflectarray work?

By changing the length of the transmission lines, we can control the phase of the

reflected field and, as a consequence, the radiated pattern.

Reflectarrays conjugate the advantages of reflector antennas with those of array antennas.

6

The first reflectarray

In 1963 (Berry, Malech and Kennedy) the first reflectarray, based on the waveguide

technology, has been proposed and realized.

By properly defining the length of each waveguide, the

phase of the field reflected by each element can be

controlled in order to satisfy the design specifications

on the far-field pattern.

The waveguide technology has not favored reflectarrays as a valid

alternative to reflectors and arrays:

• Unfavorable dimensions and weight.

• Difficulties related to their practical use.

• Complex manufacturing process.

Recently, thanks to the impressive advancements in high-frequency printed-circuits

technologies, reflectarrays are being proposed as an attractive solution to the

drawbacks of array and reflector antennas.

7

Why are printed reflectarrays becoming attractive?

The printed reflectarray combines the advantages of reflectors with those of classical arrays:

•Flexibility of arrays retained.

•Complexity of the feeding structure dismissed.

•Simple realization process.

•Low cost.

•Low weight.

•Moderate conformability of the reflecting surface to the geometry of the installation site.

•Easy installation and deployment.

8

Patches loaded with “passive” reactive elements

(planar geometry)

Advantages:

• “Direct” design of the array

Drawbacks:

• spurious radiation from stubs

• large dimensions of the reflecting elements and difficulty of

integration

How to control the reflected field?


Advantages:

• no spurious radiation from stubs

• compact patches

Drawbacks:

• no “direct” design of the array

• spurious diffraction effects due to the abrupt variation of the

patches geometry

Patches with different resonant dimensions

Patches loaded with “passive” reactive elements

(stacked geometry)

Advantages:


• no spurious diffraction effects due to the abrupt variation of

the patches geometry

Drawbacks:

• complexity

jB

jB

jB

jB

jB


Patches loaded with “active” reactive elements

Advantages


• no spurious diffraction effects due to the abrupt variation of

the patches geometry

• electronic reconfiguration

Drawbacks:

• complexity

• biasing and driving network


Conformal A-Periodic Reflectarrays

The aim of the research activity on Conformal A-periodic Reflectarrays is to

develop new tools for advanced reflectarray antennas, able to exploit at the

best all the degrees of freedom of the structure

Positions of the scattering elements

Reflecting surface shape

Degrees of Freedom

Characteristics of the scattering elements

Orientations of the scattering elements

Reflectarray degrees of freedom

Since the reflectarray antennas allow essentially only the phase control of the reflected field, the additional degrees

of freedom related to the element positions could be exploited to get an

equivalent tapering behavior

Positions of the reflecting elements

As in a-periodic arrays, bandwidth improvements could be expected

Why can additional degrees of freedom be useful?

Reflectarray degrees of freedom

The orientations of the reflecting elements can be designed to improve the cross-polar pattern of the antenna

Orientations of the reflecting elements

Why can additional degrees of freedom be useful?

Two key aspects

Synthesis

Design of a Reflectarray

Analysis

Furnishes the scattering

behavior of the radiating

elements as function of the

control parameters

Furnishes the control

parameters guaranteeing a

pattern satisfying the

specifications

Algorithms Computing hardware

•Fermi

•Kepler (issued on

November 2012)

Reflectarray Synthesis Issues

The large number of radiating elements and control parameters makes the

analysis and the synthesis of a reflectarray antenna a challenging task

An advanced synthesis tool is demanded, taking into account for:

Accuracy Efficiency Effectiveness Constraints

High accuracy for the pattern prediction requires high computational

burden

The effectiveness is strictly related to the choice of the optimization

method. Global and/or local optimization algorithm are usually employed.

The enforcement of constraints drawn from the physics of the problem

seriously affects the convergence of the optimization algorithm

The synthesis approach

A(DA)

DA

X

Y

A

CPP

The synthesis procedure requires the solution of an inverse problem for the operator A.

The solutions can be obtained by finding the global minimum of the functional:

where PCpp project onto CPP

•X the space of the unknowns, to be defined

according to the tolerable computational

complexity.

•DA the effective subset wherein we should

search the unknowns. It is defined according to

physical constraints, the design specifications,

and the limits of the physical-mathematical

model.

•A|X→Y is the radiation operator mapping the

unknowns into the far-field squared amplitude

pattern.

•CPP set of far-field pattern meeting the design

specifications.

18

x X(x) A(x) PCpp (x)2

Synthesis Tools: optimization approach

|ECO|2

Ms

MI

The specifications for each component are

enforced by means of proper mask functions

Fine radiation pattern control

multiple spots and/or

shaped beams

local control of the

directivity/gain over and

outside the coverage

The objective functional to be minimized is given by:

22

))(()())(()()(

xAPxAxAPxAx CRCRCOCOCRCO

YY

Abstract formulation of the algorithm

The Trapping Problem

A serious issue is related to the trapping of the optimization

process into false solutions (local optima of the objective

functional)

optimal solution

sub optimal solution

starting point

20


A(DA)

DA

X

Y

A

CPP






complexity.





model.



pattern.


specifications.

21



A(DA)

DA

X

Y

A

CPP






complexity.





model.



pattern.


specifications.

22


We need the mathematical expression for the

radiation operator A.

We need a physical-mathematical model for the

scattering by each patch.

The multi-stage Synthesis

The synthesis is performed by using, in sequence, several synthesis tools, based on different radiative models and optimization techniques;

the number of the degrees of freedom of the structure, the accuracy and the computational complexity, are progressively increased across the stages.

Tool 1

Tool 2

…….

Ac

cu

rac

y

Nu

mb

er o

f un

kn

ow

ns

Co

mp

lex

ity

In the synthesis tools the unknowns of interest are

obtained by minimizing a proper objective functional , thanks to the use of local

and/or global optimizer

The multi-stage Synthesis: ARA design

CCAS: Constrained Conformal Aperture Synthesis

APRPOS: A-Periodic Phase Only Synthesis

APRACS: A-Periodic Accurate Synthesis

APRACS

APRPOS – Local – Stage A: Zernike

CCAS– Local

APRPOS – Local – Stage B: Impulse

Antenna Layout

Phase Only

Radiative Model

Accurate Radiative

Model

CCAS– Global Aperture Synthesis

Reflectarray

Synthesis

Essential Structure of the algorithm

Synthesis Tools: optimization approach

Global Optimizer

A “smart” multistart approach has been

implemented

Local Optimizer

An iterative gradient-based procedure relying

on a self-scaled version of the Broyden-

Fletcher-Goldfarb-Shanno (BFGS) scheme has

been implemented

Computational complexity (for both

optimizers)

Far-field pattern and gradient evaluation

The objective functional to be minimized is given by:

22

))(()())(()()(


YY

Abstract formulation of the algorithm

Non-Uniform FFT (NUFFT)

FFT

Synthesis Tools: computational efficiency

Pattern Evaluation

N number of patches

Eco co-polar, Ecr cross-

polar

wavenumber,

u,v cosine directors of the

observation point

r radial coordinate of the

observation point

Sn scattering matrix of the

n-th patch

Ef feed field

Q matrix converting

cartesian components to co-

polar and cross-polar

(xn,yn,zn) coordinate of the

n-th patch Computational complexity

Brute Force (BF) Summation

Optimized Matrix Vector Multiplication (OMVM)

O(N2)

O(Nlog5N)

O(NlogN)

O(NlogN)

Only Flat Periodic Arrays

Periodic and A-Periodic Flat Arrays


Pattern Evaluation

Non-Uniform FFT (NUFFT) FFT

Only Periodic Flat Arrays Periodic and A-Periodic Flat Arrays

NUFFT routines perform a DFT starting from a non-uniform grid of radiating

element, and/or over a non uniform grid of observation points

NUFFT Type 1

NUFFT Type 2

NUFFT Type 3


Pattern Evaluation


NUFFT routines can realize a FFT starting from a non-uniform grid of radiating

element, and/or over a non uniform grid of observation points

A more flexible control of the synthesized pattern

A reduction of the spectral region of interest reduction of the computational burden

Only Periodic Flat Arrays Periodic and A-Periodic Flat Arrays


Pattern Evaluation


Only Periodic Arrays Periodic and A-Periodic Arrays

The pattern evaluation can not always be performed by using FFT and NUFFT routines. In particular their use is strictly related to :

Antenna Geometry

Radiative model

Facetted and conformal

structures prevent the

use of FFT/NUFFT

Simplified

Phase Only

Accurate

FFT/NUFFT

(planar geometry)

FFT/NUFFT

The use of FFT/NUFFT can be

restored by using the P-Series approach

OMVM

The PO Model

Accurate Model

FFT/NUFFT

the dependence of the Sn’s on the

features of the different patches can

be described by a phase factor

exp(jn) only and by a term S0

common to all the Sn’s, that is,

Sn(u,v)S0(u,v)exp(jn).

the dependence of S on the

incidence angle is neglected

n

f

n

j rm

f f n

n

eE E w

r

rn is the distance between the feed

and the n-th patch

is a vector independent on the

index n

is the feed

illumination factor

)(cos n

mm

nffw

Phase Only (PO) hypotheses

PO model

FFT/NUFFT

(planar geometry)

The p-series Approach

PO model

FFT/NUFFT

(planar geometry)

When dealing with facetted or conformal structures the use of

FFT/NUFFT can be restored thanks to the P-Series approach

Array Factor n

f n

j rm j

n n

n

ea w e

r

Let (u0,v0,w0) be the values of (u,v,w) related to the main beam direction,

u’=u-u0, v’=v-v0, w’=w-w0 and a’n= an exp{j(u0xn+ v0yn+w0zn)}

P term Taylor expansion Computational complexity O(PNlogN)

Usually the value of P for mild conformal or faceted structures usually considered are below 5, allowing a satisfactory speedup

Unknowns representation

To provide effectiveness to the approach, proper representations for the synthesis

parameters should be adopted

m = m-th control phases

(xm,ym) = m-th position

g: zm=g(xm,ym) surface

equation

mmmm

m

f wzvyuxjkM

m

j

m

rjm

m eer

ewvuSvuQ

1

0),(),(

Control phases

In first stages, Zernike

polynomials are adopted.

In final stage, impulsive

functions are used so that

all the command phase

DoF’s are exploited.

Element positions

i

mm

y

i

y

i

i

mm

x

i

x

i

m

m

pd

pd

y

x

),(

),(

Mapping a uniform grid in

the (,) plane in a non-

uniform grid in the (x,y)

plane.

Surface shape

Similar modal expansion

for g.

k

kk yxgeyxg ),(),(

k

kk

mmm

yxfc

yx

),(

),(

Constraints

Constraints on the element spacings are crucial, since small

spacings must be avoided, to avoid complex inter-element effects

and apparent superdirectivity.

Constraints on the maximum spacing are also necessary, to avoid

exceedingly large RAs .

Element

positions

Constraints on the smoothness of the surface shape are also needed. Reflecting

surface

Constraints on the on the command phase can become crucial to

avoid abrupt variations between adjacent elements not practically

achievable.

Command

phases

Abrupt changes in

the element size due

to phase wrap

Canada+ConUS coverage

Number of x and y elements 44x44

Working frequency 14.25GHz

(min/max) x and y spacing 0.5λ/0.7λ

Feed Location Zfeed=1m

yfeed=26.7cm

Feed Pointing Angle θf=14.94°

Feed illumination factor mf 12

Target Coverage:

Continental US + Canada

Reference Min. Gain 28.4dB

zRA

yRA

(yfeed, zfeed)

Reflectarray

yf

zf

θf F

D

Canada+ConUS coverage

Periodic RA

/2 spacing along x and y Aperiodic RA

Mean directivity 29.71dB

Minimum directivity 27.66dB

Mean directivity 30.97dB

Minimum directivity 27.98dB

Outline Minimization of the objective functional

22

))(()())(()()(


YY

Keypoints

• Radiated field

• Functional gradient

• Optimization

Starting guess

Calculate radiated

field (Aco and Acr) Calculate

gradient

Specs

fulfilled?

Update

unknowns x

No

Solution

Yes

Calculation of field and gradient at each step is highly demanding for large reflectarrays (>40x40 or larger).

GPUs make the synthesis of

large reflectarrays feasible in

reasonable computing times

Radiated field for POS

Gradient for POS

Radiated field for accurate

Gradient for accurate

POS: CUDA implementation Accurate: Jacket implementation

Outline

Computing Hardware

Kepler

• Higher double-precision throughput

• Faster atomic operations

• Dynamic parallelism

Implementations on both, Fermi

and Kepler architectures.

S

M radiating

elements

Fast evaluation of the radiated field (POS)

1

0

][),(

M

l

zwyvxuj

lhhhlhlhlheavuFF

Fh cannot be necessarily expressed in terms of a standard DFT of al due to:

• the exponential term exp(jwhzl) (conformal reflectarray) and/or

• the possibly irregular (xl,yl) spatial grid and/or

• the possible need of calculating the pattern in an irregular (uh,vh) spectral grid

mmmm

m

f wzvyuxjkM

m

j

m

rjm

m

cr

coee

r

ewvuSvuQ

E

E

10

),(),(

Array factor

Standard FFT routines (having a convenient O(MlogM) computational complexity) cannot

be necessarily employed

S

M radiating

elements


1

0

][),(

M

l

zwyvxuj

lhhhlhlhlheavuFF

mmmm

m

f wzvyuxjkM

m

j

m

rjm

m

cr

coee

r

ewvuSvuQ

E

E

10

),(),(

Array factor

Fast algorithms

P-series

Subarray approach

Non-Uniform FFT (NUFFT)

Hardware

Employ advanced computing

hardware (Graphics Processing

Units - GPU)

An algorithm for the fast analysis of irregular arrays having the same computational complexity of standard FFTs, and employing advanced (parallel) hardware is now in

order.

P-series and subarray approach

1

0

0)(

!

)]([0

P

p

p

l

p

hzwwjz

p

wwje lh

If S has a mild curvature, and denoting by w0 the

value of w corresponding to the main beam

center

P-series

lzjw

ll eaa 0'

1

0

][1

0

0 '!

)]([ M

l

yvxuj

l

p

l

P

p

p

hh

lhlheazp

wwjF

NUFFT

Q

q

M

Ml

yvxuj

l

p

l

P

p

p

hh

q

q

lhlheazp

wwjF

1

1

][1

0

01

'!

)]([Subarray approach

q-th subarray Speeds-up the convergence of the p-series

M

l

lhlh aBF0

][ lhlhlh zwyvxuj

hl eB

Optimized Matrix Vector Multiplication – OMVM

~ O(M2)

Flat


Surface Patch lattice Spectral lattice

Regular

Numerical tool

Regular FFT

Flat Irregular Regular NED-NUFFT

Flat Regular Irregular NER-NUFFT

Flat Irregular Irregular Type 3-NUFFT

Conformal

•FFT

•Non-Equispaced Data

(NED) NUFFT

•Non-Equispaced

Results (NER) NUFFT

•Type 3 NUFFT

[…] […] P-Series +

above tools

NER-NUFFT

42

Non-Uniform FFT (NUFFT) - NER

NER-type (Non-Equispaced Results) DFT

xl non-uniform result sampling points

2/

2/

/2ˆN

Nk

k

Nkxj

l zez l

l=1,…,M

The NUFFT exploits the Poisson summation formula expressing each “non-

uniformly sampled” exponential into an infinite number of “uniformly sampled”

exponentials

m

cNmkj

l

Nkxjemcx

cNke l /2

2/1/2

)(ˆ)/2(

)2(

c oversampling factor

proper window function

transform of ^

1D case

has support in (-/c,/c)

should be concentrated in (-K,K) ^

K

Km

cNmkj

l

Nkxjl

l

l emcxcNk

e

/22/1

/2)(ˆ

)/2(

)2(

The Poisson summation formula becomes an

interpolation formula specifically tailored to

“non-uniformly sampled” exponentials

][ ll cxInt

A v

u

K

Km

kN

Nk

cNmkj

ll

l

lcNk

zemcxz

)/2()(ˆ)2(ˆ

2/

2/

/22/1

Standard FFT on cN points

Scaling and zero padding of c

Steps to calculate the NER-NUFFT

l=1,…,M

Convolution (interpolation)

22

22sinh2)(ˆ

xK

xKx

A possible, although suboptimal, choice for the NUFFT windows is

|| 0,

|| ,)()(

22

0 KI

01.01

2

c

I0 modified Bessel function

K

Km

kN

Nk

cNmkj

ll

l

lcNk

zemcxz

)/2()(ˆ)2(ˆ

2/

2/

/22/1



NER-NUFFT: operations count

l=1,…,M

Convolution (interpolation)

Standard FFT on cN

points O(cNlog(cN)) operations

Scaling and zero padding of c 2N operations

Interpolation M(2K+1) operations

Depends on the accuracy desired to calculate the involved

functions (special functions)

Spatial and spectral

windows

For N,M>>K, the

computational complexity is

O(cNlog(cN))

Kcx

Kcxm

kN

Nk

cNmkj

ll

l

lcNk

zemcxz

)/2()(ˆ)2(ˆ

2/

2/

/22/1


Standard FFT on cN

points cuFFT library


Scaling and zero padding of c Implemented by a specific kernel. Intrinsically parallel step.

NER-NUFFT: parallel CUDA implementation

l=1,…,M

Interpolation

Interpolation Each thread is assigned to a different l and calculates a

summation of 2K+1 terms.

Modified Bessel function evaluated by rational Chebyshev

approximations. Calculated in advance by a specific kernel.

Evaluation of spatial

window function

Calculation of the modified Bessel function I0 __device__ double bessi0(double x) { double num, den, x2; x2 = abs(x*x); x=abs(x); if (x > 15.0) { den = 1.0 / x; num = -4.4979236558557991E+006; num = fma (num, den, 2.7472555659426521E+006); num = fma (num, den, -6.4572046640793153E+005); [...] num = fma (num, den, 3.9894228040143265E-001); num = num * den; den = sqrt (x); num = num * den; den = exp (0.5 * x); /* prevent premature overflow */ num = num * den; num = num * den; return num; } else { num = -0.27288446572737951578789523409E+010; num = fma (num, x2, -0.6768549084673824894340380223E+009); num = fma (num, x2, -0.4130296432630476829274339869E+008); […] den = -0.2728844657273795156746641315E+010; den = fma (den, x2, 0.5356255851066290475987259E+007); […] return num/den; } }

15|| ,)(3

0

2

16

0

2

0

x

xq

xp

xI

j

j

j

j

j

j

15 ,130

)(25

0

2/1

0

xx

TpexxI j

j

j

x

pj, qj, expansion coefficients

Tj Chebyshev polynomials

J.M. Blair, “Rational Chebyshev approximations for the modified Bessel

functions I0 and I1”, Math. of Comput., vol. 28, n. 126, pp. 581-583, Apr. 1974.

Special function calculation Bessel function I0 not available in CUDA libraries.

Implemented according to Blair’s approach.

NER-NUFFT: Interpolation

Kcx

Kcxm

k

N

Nk

cNmkj

ll

l

lcNk

zemcxz)/2(

1)(ˆ)2(ˆ

2/

2/

/22/1

__global__ void Interpolation(const double2* __restrict__ U_d, const double* __restrict__ x1_d, const double* __restrict__ x2_d, double2* __restrict__ tr, const int N1, const int N2, const int N) { int i = threadIdx.x + blockDim.x * blockIdx.x; if (i<N) { int ind_i,ind_j; double x1 = x1_d[i], x2 = x2_d[i], mu1 = rint(c*x1), mu2 = rint(c*x2), phicap1, phicap2, tempd, p1, p2, expon; double2 UU, temp = make_cuDoubleComplex(0.0,0.0); for (int m1=-K; m1<=K; m1++) { ind_i = modulo((int)mu1 + m1 + c*N1,c*N1); expon = (c*x1-(mu1+(double)m1)); p1 = K*K-expon*expon; if(p1<0.) {tempd=rsqrt(-p1); phicap1 = (1./pi)*((sin(alfa/tempd))*tempd); } else if(p1>0.) {tempd=rsqrt(p1); phicap1 = (1./pi)*((sinh(alfa/tempd))*tempd); } else phicap1 = alfa/pi; for (int m2 = -K; m2<=K; m2++) { ind_j = modulo((int)mu2 + m2 + c*N2,c*N2); expon = (c*x2-(mu2+(double)m2)); p2 = K*K-expon*expon; if(p2<0.) {tempd=rsqrt(-p2); phicap2 = (1./pi)*((sin(alfa/tempd))*tempd); } else if(p2>0.) {tempd=rsqrt(p2); phicap2 = (1./pi)*((sinh(alfa/tempd))*tempd); } else phicap2 = alfa/pi; UU = U_d[IDX2R(ind_j,ind_i,c*N2)]; temp.x = temp.x+phicap1*phicap2*UU.x; temp.y = temp.y+phicap1*phicap2*UU.y; } } tr[i] = temp; }

22

22sinh2)(ˆ

xK

xKx

Overlap betwee memory loads and computation

Analysis of

branch

paths.

Efficiency

close to

100%.

Reciprocal sqrt

foo%n==foo&(n-1)

n power of 2

Read-only cache for Kepler

NED-NUFFT

49


Interpolation

NED-NUFFT: Parallel CUDA implementation

Scaling and decimation

)/(212/

2/||

,

2/1

))((ˆ)/2(

)2(ˆ

cNskjcN

cNsKmcNs

ml

llk emcNscxzcNk

z

l

k=-N/2,…,N/2

Modified Bessel function evaluated by rational Chebyshev

approximations. Calculated in advance by a specific kernel.

Evaluation of spatial

window function

Interpolation

Standard FFT on cN

points


Implemented by a specific kernel. Intrinsically parallel step.

cuFFT library

Each thread is assigned to a different s and calculates a

summation of 2K+1 terms. Atomic operations required.

A v

u


Interpolation

Steps to calculate the NED-NUFFT and operations count


)/(212/

2/||

,

2/1

))((ˆ)/2(

)2(ˆ

cNskjcN

cNsKmcNs

ml

llk emcNscxzcNk

z

l

k=-N/2,…,N/2

Depends on the accuracy desired to calculate the involved

functions

Spatial and spectral

windows

Interpolation M(2K+1) operations

Standard FFT on cN

points O(cNlog(cN)) operations

Scaling and decimation 2N operations

For N,M>>K, the

computational complexity is

O(cNlog(cN))

)/(212/

2/||

,

2/1

))((ˆ)/2(

)2(ˆ

cNskjcN

cNsKmcNs

ml

llk emcNscxzcNk

z

l

NED-NUFFT: Interpolation

int i = threadIdx.x + blockDim.x * blockIdx.x; double cc_points1=cc*x[i], r_cc_points1=rint(cc_points1), cc_diff1 = cc_points1-r_cc_points1; double cc_points2=cc*y[i], r_cc_points2=rint(cc_points2), cc_diff2 = cc_points2-r_cc_points2; double phi_cap1, phi_cap2, P1, P2, tempd; int PP1, PP2; if(i<M) { for(int m=0; m<(2*K+1); m++) { P1 = K*K-(cc_points1-(r_cc_points1+(m-K)))*(cc_points1-(r_cc_points1+(m-K))); PP1 = modulo((r_cc_points1+(m-K)+N1*cc/2),(cc*N1)); if(P1<0.) {tempd=rsqrt(-P1); phi_cap1 = (1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P1>0.) {tempd=rsqrt(P1); phi_cap1 = (1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap1 = alfa/pi_double; for(int n=0; n<(2*K+1); n++) { P2 = K*K-(cc_points2-(r_cc_points2+(n-K)))*(cc_points2-(r_cc_points2+(n-K))); PP2 = modulo((r_cc_points2+(n-K)+N2*cc/2),(cc*N2)); if(P2<0.) {tempd=rsqrt(-P2); phi_cap2 = (1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P2>0.) {tempd=rsqrt(P2); phi_cap2 = (1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap2 = alfa/pi_double; atomicAdd(&result[IDX2R(PP1,PP2,cc*N2)].x,data[i].x*phi_cap1*phi_cap2); atomicAdd(&result[IDX2R(PP1,PP2,cc*N2)].y,data[i].y*phi_cap1*phi_cap2); } } }

atomicAdd routine contained in the CUDA Programming

guide

NED-NUFFT: Interpolation with dynamic parallelism __global__ void series_terms(double2 temp_data, double2* __restrict__ result, const double r_cc_points1, const double cc_diff1, const double r_cc_points2, const double cc_diff2, const int N1, const int N2) { int m = threadIdx.x; int n = threadIdx.y; double tempd, phi_cap; P = K*K-(cc_diff1-(m-K))*(cc_diff1-(m-K)); if(P<0.) {tempd=rsqrt(-P); phi_cap = (1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P>0.) {tempd=rsqrt(P); phi_cap = (1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap = alfa/pi_double; P = K*K-(cc_diff2-(n-K))*(cc_diff2-(n-K)); if(P<0.) {tempd=rsqrt(-P); phi_cap = phi_cap*(1./pi_double)*((sin(alfa/tempd))*tempd); } else if(P>0.) {tempd=rsqrt(P); phi_cap = phi_cap*(1./pi_double)*((sinh(alfa/tempd))*tempd); } else phi_cap = phi_cap*alfa/pi_double; int PP1 = modulo((r_cc_points1+(m-K)+N1*cc/2),(cc*N1)); int PP2 = modulo((r_cc_points2+(n-K)+N2*cc/2),(cc*N2)); atomicAdd(&result[IDX2R(PP1,PP2,cc*N2)].x,temp_data.x*phi_cap); atomicAdd(&result[IDX2R(PP1,PP2,cc*N2)].y,temp_data.y*phi_cap); } __global__ void dynamic_interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M) { int i = threadIdx.x + blockDim.x * blockIdx.x; double cc_points1=cc*x[i]; double r_cc_points1=rint(cc_points1); // Equivalente di mu const double cc_diff1 = cc_points1-r_cc_points1; double cc_points2=cc*y[i]; double r_cc_points2=rint(cc_points2); // Equivalente di mu const double cc_diff2 = cc_points2-r_cc_points2; double2 temp_data = data[i]; dim3 dimBlock(13,13); dim3 dimGrid(1,1); if(i<M) { series_terms<<<dimGrid,dimBlock>>>(temp_data,result,r_cc_points1,cc_diff1,r_cc_points2,cc_diff2,N1,N2); } }

Child kernel function

Parent kernel function

NED-NUFFT: fftshift

)/(212/

2/||

,

2/1

))((ˆ)/2(

)2(ˆ

cNskjcN

cNsKmcNs

ml

llk emcNscxzcNk

z

l

fftshift

ifftshift

FFT Solution with memory movements (swap)

cuFFT requires summation indices ranging from

0 and returns DFTs with indices ranging from 0

k=-N/2,…,N/2

NED-NUFFT: fftshift

)/(212/

2/||

,

2/1

))((ˆ)/2(

)2(ˆ

cNskjcN

cNsKmcNs

ml

llk emcNscxzcNk

z

l

fftshift

ifftshift

FFT

Solution without memory movements

)/(2

1

0

)/(21

0

)1()1(cNskj

cN

s

s

jsjkcNskjcN

s

s

skefeeef

)/(2

1

0

)2/()(2)2/()(2 cNskjcN

s

s

cNcNsjcNcNkjefee

)/(

221

0

)2/()(2cNk

cNsjcN

s

s

cNcNkjefe

k=-N/2,…,N/2

NED-NUFFT: fftshift

fftshift

ifftshift

FFT

Solution without memory movements

)/(2

1

0

)/(21

0

)1()1(cNskj

cN

s

s

jsjkcNskjcN

s

s

skefeeef

)/(2

1

0

)2/()(2)2/()(2 cNskjcN

s

s

cNcNsjcNcNkjefee

)/(2

212/

2/

)/(212/

2/

)2/()(2cN

cNksjcN

cNs

s

cNskjcN

cNs

s

cNcNkjefefe

k=-cN/2,…,cN/2 )/(2

12/

2/

ˆ cNskjcN

cNs

sk eff

cuFFT (returns indices 0,…,cN-1

NUFFT NER Results

Speedup Execution time

MxM is the number of

elements. Oversampling of

2 for the power pattern.

NUFFT NED Results

Speedup Execution time

Dynamic parallelism relevantly

improves the result

MxM is the number of

elements. Oversampling of

2 for the power pattern.

Acceleration of the pattern evaluation by

Accelereyes Jacket

NUFFT routines are not anymore possible

The algorithm has been written in Matlab script accelerated by functions

exploiting the Accelereyes Jacket toolbox.

mmm

m

wzvyuxjkM

m

fmcr

coeEvuSvuQ

E

E

1

),(),(

Fast Matrix-Vector Product

routine by Accelereyes

Machine: Genesis Tesla I-7950 workstation 6Gb of RAM

CPU : Intel CPU i7-950 (8 cores), 3.06GHz

GPU: Nvidia Tesla C2050 (448 cores) 1.15GHz - 2.8GB RAM

Speedup ≈ 8

Fast evaluation of the radiated field (accurate)

mhmhmh

hhhh

m

m

f zwyvxuj

co

H

hhcococo

j

m

rjm

m

m

eSQEPEEer

ew

x 01

22* )(4 UIm

Phase-Only Synthesis case.

Gradient evaluation

Calculated by the P-series + NUFFT approach

Accurate synthesis case

pl 4Re

ACOA

pl,ACO(ACO

2YCO

, ,( , ) ( ) l l

Al jk ux vyCO

fCOl l

S u v pAB u v E l e

p p

Evaluation of the scalar product by

Fast Matrix-Vector Product routine by

Accelereyes

The acceleration of the gradient follows the same computation scheme or

the radiated field for both the two synthesis cases

Conclusions

Acceleration of reflectarray synthesis by

P-Series

Non Uniform FFT (NUFFT)

Optimized Matrix Vector Multiplication (OMVM)

Implementation on GPUs (testing on Fermi and Kepler architectures)

make the reflectarray synthesis feasible in reasonable time

The POS synthesis stage (CUDA implementation) takes about 3/4 hours

for a 44x44 reflectarray

The accurate synthesis stage (Jacket implementation) takes about 3/4 days

for a 44x44 reflectarray

NUFFT algorithms are of interest in many other application fields.

For electromagnetic applications, we have successfully employed the described NUFFTs in:

• Near-field antenna characterization;

• Synthetic Aperture Radar fast processing

fast reflectarray antenna analysis and synthesis on gpus...the large number of radiating elements...

Documents