a parallel model for the heterogeneous …...this thesis explores the use of heterogeneous parallel...

A PARALLEL MODEL FOR THE

HETEROGENEOUS COMPUTATION OF

RADIO ASTRONOMY SIGNAL CORRELATION

by

Christopher John Harris

B.Sc.(Hons)

This thesis is presented for the degree of

Doctor of Philosophy

at the School of Physics

July 2009

c© 2009

Christopher J. Harris

Abstract

The computational requirements of scientific research are constantly growing. In the

field of radio astronomy, observations have evolved from using single telescopes, to

interferometer arrays of many telescopes, and there are currently arrays of massive

scale under development. These interferometers use signal and image processing

to produce data that is useful to radio astronomy, and the amount of processing

required scales quadratically with the scale of the array. Traditional computational

approaches are unable to meet this demand in the near future.

This thesis explores the use of heterogeneous parallel processing to meet the

computational demands of radio astronomy. In heterogeneous computing, multiple

hardware architectures are used for processing. In this work, the Graphics Process-

ing Unit (GPU) is used as a co-processor along with the Central Processing Unit

(CPU) for the computation of signal processing algorithms. Specifically, the suit-

ability of the GPU to accelerate the correlator algorithms used in radio astronomy

is investigated.

This work first implemented a FX correlator on the GPU, with a performance

increase of one to two orders of magnitude over a serial CPU approach. The FX

correlator algorithm combines pairs of telescope signals in the Fourier domain. Given

iii

iv Abstract

N telescope signals from the interferometer array, N2 conjugate multiplications must

be calculated in the algorithm. For extremely large arrays (N >> 30), this is a huge

computational requirement. Testing will show that the GPU correlator produces

results equivalent to that of a software correlator implemented on the CPU. However,

the algorithm itself is adapted in order to take advantage of the processing power of

the GPU. Research examined how correlator parameters, in particular the number of

telescope signals and the Fast Fourier Transform (FFT) length, affected the results.

The conjugate multiply and accumulation (CMAC) stage of the correlator, re-

quires computation that increases quadratically with the number of telescope signals

in the interferometer array. Because the other stages of the correlator scale linearly,

this becomes the bottleneck for large radio telescope arrays. This work investigates

a number of potential parallel approaches, in order to determine which is the most

optimal. This research will show that of those approaches, two are superior.

An important consideration in the design of radio telescope infrastructure is the

ongoing power usage of compute systems. The increasing processing requirements

are causing the cost of electricity to be an important budgeting concern. Thus power

efficient compute architectures are now desired for scientific research. This work has

investigated the power usage of both the GPU and CPU. The processing power per

watt of energy for the parallel correlator implementation is shown to be lower than

the serial implementation by up to a factor of 30.

Finally, the addition of a parallel polyphase filter front end demonstrates the

adaptability of the GPU correlator implementation. This filter is commonly used

in signal processing to efficiently reduce the effect of spectral leakage in the Fourier

transform. However, it comes at the cost of additional processing and memory access

within the algorithm. The implementation on the graphics processing unit supports

v

1, 2, 4 and 8 filter taps, and a filter length of 128. A tap is the number of signal

phases combined in the decimation part of the polyphase filter. This research shows

the increase in processing time for the filter stage is a quarter of a direct scaling

with additional computation as the number of taps increase.

vi Abstract

Acknowledgements

I acknowledge the efforts of my supervisors Karen Haines, Lister Staveley-Smith

and David Blair. I have appreciated Karen’s friendship, motivation, and excellent

advice since I first walked into her office at the beginning of my honours year. I

thank her for consistently getting me to a plethora of conferences directly related to

my work around the world, introducing me to the leading people in my field, and

always providing me with the most cutting edge hardware. Lister’s knowledge and

experience of the radio astronomy field is of great benefit to my research. I have

valued his insight into the possible directions for my work. I am grateful for David

offering his supervision when I began in the Physics department. David’s drive and

passion for research is an inspiration to his students.

In the field of radio astronomy, I thank all whom have assisted my learning

over the past four years. This includes Chris Phillips for his time spent testing

live data streaming of VLBI data on to the GPU; Steve Tingay and Adam Deller

for introducing me to software correlation; Katherine Blundell and Ben Mort for

hosting me at Oxford; Peter Quinn for his advice; Frank Briggs for his Fortran

correlator code which served as a standard for my own implementations both serial

and parallel; and Randall Wayth for his insightful discussions.

vii

viii Acknowledgements

In the field of graphics hardware, I thank Mark Harris of NVIDIA and Justin

Hensley of AMD for sharing their knowledge of the graphics products of their re-

spective companies. I am grateful to Ed Buckingham for hosting me on a day trip

to visit the GPU compute group at AMD.

I thank all the staff at the Western Australian Supercomputer Program. Jason,

Akos, and Khanh have been stalwart in maintaining a high standard of computing

support. Paul’s insight in visualisation has been of great help. I appreciate the work

Rosie and Anna-Lee have put in organising conference visits.

I appreciate the support and patience of my family and friends. In particular, I

thank my parents for encouraging my interests in science and technology. Finally, I

thank my teachers and mentors across all fields of endeavour.

Contents

Abstract iii

Acknowledgements vii

List of Figures xi

List of Variables xvii

1 Introduction 1

2 Background 7

2.1 Radio Astronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Radio Spectrometry . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Radio Interferometry . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.3 Aperture Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Signal Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 Digital FX Correlator . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Polyphase Filter Techniques . . . . . . . . . . . . . . . . . . . . . . . 33

2.4 Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.5 Programmable Graphics . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.5.1 Development of the Programmable GPU . . . . . . . . . . . . 51

2.5.2 Compute Unified Device Architecture (CUDA) . . . . . . . . . 59

ix

x CONTENTS

3 Literature Review 67

3.1 GPU Programming Languages . . . . . . . . . . . . . . . . . . . . . . 69

3.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.3 GPUs in Astronomy and Astrophysics . . . . . . . . . . . . . . . . . 73

4 Model 77

4.1 CMAC Stage Optimisation . . . . . . . . . . . . . . . . . . . . . . . . 83

4.2 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.3 Polyphase Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5 Testing 95

5.1 Preliminary Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.2 CMAC Stage Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.3 GPU Correlator Results . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.4 Polyphase Filter Testing . . . . . . . . . . . . . . . . . . . . . . . . . 124

6 Discussion 127

6.1 Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.2 Optimisation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.3 GPU FX Correlator Analysis . . . . . . . . . . . . . . . . . . . . . . 132

6.4 Power and Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.5 Adaptability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7 Conclusion 141

7.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

References 147

A Code 157

xi

A.1 Unpack Stage Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

A.2 CMAC Stage Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

A.3 Polyphase Filter Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 167

xii CONTENTS

List of Figures

2.1 Jansky aerial system . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Parkes radio telescope . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Angular resolution of a telescope . . . . . . . . . . . . . . . . . . . . 18

2.4 Resolution of point sources . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 A two-element interferometer . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Correlator architectures . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.7 Serial FX correlator algorithm . . . . . . . . . . . . . . . . . . . . . . 30

2.8 Fourier transform spectral leakage . . . . . . . . . . . . . . . . . . . . 31

(a) Aligned frequency signal . . . . . . . . . . . . . . . . . . . . . . 31

(b) Non-aligned frequency signal . . . . . . . . . . . . . . . . . . . . 31

2.9 Fourier transform spectral response . . . . . . . . . . . . . . . . . . . 32

2.10 Polyphase Fourier transform response . . . . . . . . . . . . . . . . . . 36

2.11 Rectangular polyphase filter . . . . . . . . . . . . . . . . . . . . . . . 37

2.12 Polyphase sinc filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

(a) Aligned frequency signal . . . . . . . . . . . . . . . . . . . . . . 38

(b) Non-aligned frequency signal . . . . . . . . . . . . . . . . . . . . 38

2.13 Algorithmic structure tree . . . . . . . . . . . . . . . . . . . . . . . . 45

2.14 Flynn’s taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.15 Parallel performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

(a) Amdahl’s law . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xiii

xiv LIST OF FIGURES

(b) Gustafson’s law . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.16 Programmable rendering pipeline . . . . . . . . . . . . . . . . . . . . 50

2.17 Development of the GPU pipeline . . . . . . . . . . . . . . . . . . . . 55

2.18 A model of the graphics processing unit . . . . . . . . . . . . . . . . . 57

2.19 Evolution of the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . 58

(a) Processor performance . . . . . . . . . . . . . . . . . . . . . . . 58

(b) Memory bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.20 Thread topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.21 Memory locations available to CUDA threads . . . . . . . . . . . . . 65

2.22 GPU-enabled system architecture . . . . . . . . . . . . . . . . . . . . 66

4.1 GPU FX correlator pipeline . . . . . . . . . . . . . . . . . . . . . . . 82

4.2 Parallelism of the approaches . . . . . . . . . . . . . . . . . . . . . . 87

(a) Frequency parallel approach . . . . . . . . . . . . . . . . . . . . 87

(b) Stream parallel approach . . . . . . . . . . . . . . . . . . . . . . 87

(c) Group parallel approach . . . . . . . . . . . . . . . . . . . . . . 87

(d) Pair parallel approach . . . . . . . . . . . . . . . . . . . . . . . 87

4.3 GPU FX correlator data flow . . . . . . . . . . . . . . . . . . . . . . 92

5.1 Test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2 Bandwidth testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3 PCI-express data transfer rates . . . . . . . . . . . . . . . . . . . . . 103

5.4 Fast Fourier transform testing . . . . . . . . . . . . . . . . . . . . . . 104

5.5 GPU fast Fourier transform . . . . . . . . . . . . . . . . . . . . . . . 105

5.6 CMAC stage testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.7 CMAC stage results for a varying number of signals . . . . . . . . . . 109

(a) High L = 1024, varying N . . . . . . . . . . . . . . . . . . . . . 109

(b) Low L = 128, varying N . . . . . . . . . . . . . . . . . . . . . . 109

xv

5.8 CMAC stage results for different transform lengths. . . . . . . . . . . 110

(a) High N = 64, varying L . . . . . . . . . . . . . . . . . . . . . . 110

(b) Low N = 4, varying L . . . . . . . . . . . . . . . . . . . . . . . 110

5.9 GPU correlator testing . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.10 Test output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.11 Overview of stream bandwidth . . . . . . . . . . . . . . . . . . . . . . 117

5.12 The variation of stream bandwidth with N . . . . . . . . . . . . . . . 118

5.13 The variation of stream bandwidth with L . . . . . . . . . . . . . . . 119

5.14 Total data throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.15 Correlator FLOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.16 Performance per watt . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

(a) L = 128 Fourier transforms . . . . . . . . . . . . . . . . . . . . 123

(b) L = 1024 Fourier transforms . . . . . . . . . . . . . . . . . . . . 123

5.17 Polyphase filter testing . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.18 Polyphase filter performance . . . . . . . . . . . . . . . . . . . . . . . 126

(a) Performance by stream . . . . . . . . . . . . . . . . . . . . . . . 126

(b) Performance by tap . . . . . . . . . . . . . . . . . . . . . . . . . 126

xvi LIST OF FIGURES

List of Variables

Units are given in square brackets. Variables with no units are dimensionless.

a accumulation index within A

A size of accumulation

b longest array baseline [m]

b[k] polyphase filter output [V]

c speed of light [m/s]

C[ν] complex visibility [W s]

d difference in path length [m]

D telescope dish diameter [m]

δ(ω) Dirac delta function acting on ω

E(t) induced electromagnetic potential [V]

ǫ computational complexity per stream element

f frequency index within F

f(t) a time varying function

G size of stream groups

j iterator index

k discrete time [s]

K number of stream groups

l length index within L

L length of fast Fourier transform (FFT)

xvii

xviii List of Variables

LC length of complex to complex FFT

LR length of real to complex FFT

λ wavelength [m]

n index of telescope signal

m index of second telescope signal

µ substitution for 2π(ν − ν ′)/L

N number of signals from the telescope array

N number of units of execution in a parallel application

ν discrete frequency [Hz]

ν ′ a particular discrete frequency [Hz]

ω continuous frequency [Hz]

p tap index within P

p parallel instruction proportion by Amdahl’s law

p′ parallel instruction proportion by Gustafson’s law

pauto number of autocorrelations

ptotal total number of correlations

pcross number of crosscorrelations

φ angle of signal [rad]

Φ(ν) power spectral density [W s]

r performance factor in Amdahl’s law

r′ performance factor in Gustafson’s law

r[k] quantised, discrete-time sampling of E(t) [V]

ρ[l] polyphase filter weight

s serial instruction proportion by Amdahl’s law

s′ serial instruction proportion by Gustafson’s law

S(ω) continuous signal spectra [V s]

S(ν) discrete signal spectra [V s]

xix

t continuous time [s]

T number of taps in the polyphase filter

τ sampling period [s]

θ angular resolution [rad]

uS unpack scaling factor

uB unpack bias compensation

U unpack operator

x(t) time-varying telescope signal [V]

x[k] unpacked signal stream [V]

y(τ) spatially correlated signals [V 2]

xx List of Variables

Chapter 1

Introduction

Computation is now essential to scientific endeavours. Where previously science

has analysed models, or simple systems of models, now large complex systems are of

interest. This is true across a broad variety of fields. In the field of climate modelling,

simulations once dealt with a localised area and single aspects such as air pressure

or ocean currents. Now, they strive for a complex system accounting for all such

aspects on a global scale [11]. In the field of neural networking, computation has

progressed from simulations of single neurons, to simple systems of neurons [37], and

is now progressing to complex systems on the scale of mammalian brains [54]. In the

field of radio astronomy, observations have evolved from using single telescopes, to

interferometer arrays consisting of many telescopes, and are now looking to develop

an array of massive scale [38]. The emergence of these scientific endeavours exceed

the capabilities of traditional compute methods.

The growth of single-core central processing unit (CPU) technology has become

limited by a power wall [53]. The power wall limits serial CPU processing because

higher processing speeds would require additional power consumption, which in-

1

2 CHAPTER 1: Introduction

creases the heat generation beyond what can physically be dissipated. As a result,

software can no longer rely on an improving processor speeds to improve perfor-

mance. To overcome these problems, parallel architectures have been developed.

These architectures use many processing cores to obtain more processing perfor-

mance than a single core. However, serial code must be rewritten in parallel to take

advantage of these performance gains.

There is a wide range of parallel architectures currently available, and the num-

ber of processing cores vary. Serial single-core CPUs have been replaced by multicore

CPUs, which contain several processing cores. Aside from the multicore CPU, other

parallel architectures exist in the form of a coprocessor. A coprocessor is a sec-

ond processor that assists the primary processor, in order to improve computational

performance. The Cell is one such architecture, containing one main core called a

Power Processing Element (PPE) and eight assisting cores called Synergistic Pro-

cessing Elements (SPE) [81]. The GPU is another example, located at the other

end of the parallel architecture spectrum with hundreds of processing cores.

There are two significant advantages to hardware-accelerated parallel approaches.

Firstly, the additional processing performance will facilitate the growth to petascale

compute systems that will be required for future science. The improvements in

computational performance span orders of magnitude over legacy serial architec-

tures, effectively enabling the implementation of algorithms in real-time that would

be otherwise unattainable due to their computational cost. Secondly, the parallel

approaches are more cost effective because they use processor resources more ef-

ficiently, resulting in both lower initial purchasing costs and lower ongoing power

requirements for a desired level of performance. Low power consumption is also rel-

evant in minimising the environmental impact of research. Both of these advantages

are essential to the development of affordable large scale compute systems that will

3

ultimately be required for future science.

However, scientific algorithms must be parallelised in order to realise the perfor-

mance of these new architectures. This research explores how heterogeneous parallel

computing can be applied to radio astronomy signal correlators. Heterogeneous com-

puting is the strategy of deploying multiple types of processing elements within a

single workflow, allowing each to perform the tasks to which it is best suited [94].

This work utilises GPU computing techniques, in which the CPU and GPU are used

together to perform real-time signal correlation on a significantly larger scale than

that achievable by the CPU alone.

To observe the radio universe, signals from multiple telescopes can be correlated

to form an interferometer array. Data collected from the telescopes is used to obtain

images with an angular resolution, the ability to resolve distant objects, greater

than would be achievable with a single dish. In order to achieve superior images,

additional array elements are required to increase the collecting area and to provide

more unique viewpoints on the sky. However, increasing the size of the array also

increases the amount of computation necessary to correlate the signals. Given the

size of next generation telescope interferometers such as the Square Kilometre array,

which consist of hundreds of telescopes, this computation is on a massive exaflop

scale [18].

Of the correlator stages, the conjugate multiply and accumulate (CMAC) stage

is a computational bottleneck. It scales quadratically with the number of elements in

a radio telescope array. Thus, it is this part of the algorithm that is immediately of

interest for optimisation. This work demonstrates that of several potential parallel

approaches, two are superior depending on the size of the telescope array. One

is suited to small arrays, and a second to large arrays. For further performance


improvement, the acceleration of the other stages of the algorithm is also addressed.

Using the GPU as a co-processor, a digital radio astronomy signal correlator

is developed and tested. The model defines both the transfer of data between the

CPU host and GPU device, as well as the parallelisation of the correlator stages on

the GPU. This research demonstrates that this approach increases computational

performance by one to two orders of magnitude compared to a serial CPU approach.

Additional features are required to allow the GPU correlator to be used in a

wider range of radio astronomy applications. In order to investigate the reduction

of systematic noise known as spectral leakage or ringing [10], a polyphase filter is

implemented on the GPU. Testing examined the filter performance on the GPU for

a variety of filter lengths. It shall be shown that the additional operations for the

filter implementation are partially hidden by existing memory latency in the GPU

correlator. The addition of this feature to the parallel correlator demonstrates the

adaptability of the model.

Through the implementation and analysis of the GPU digital radio astronomy

signal correlator, this work will show that GPU computing is well suited to acceler-

ating digital radio astronomy correlation and related polyphase filtering techniques.

This is significant because the next generation of radio telescopes, such as the Square

Kilometre Array, will require correlator performance on a massive scale. The fac-

tor of a hundred performance improvement shown in this work makes significant

progress to meeting this scale.

Following is Chapter 2, which covers a background to the field of radio astronomy,

with an emphasis on the digital correlation algorithm. It also provides a background

to the use of the GPU for scientific computation. This is followed by a literature

5

review in Chapter 3, which discusses research relevant to that presented in this

work. In Chapter 4, my heterogeneous computing model for the implementation

of correlation algorithms on the GPU is presented. This model was implemented

and tested, and the results are presented in Chapter 5. An analysis of these results

follows in Chapter 6. Chapter 7 concludes this work, and presents potential areas

for further research. The accompanying appendices contain code samples from my

implementation.

Chapter 2

Background

The work presented in this thesis is multidisciplinary. It involves concepts from

several different fields of research, including radio astronomy, signal processing, par-

allel computing, and computer graphics. To make this research accessible to a broad

audience, this chapter introduces the key concepts of these fields. This begins with

an introduction to radio astronomy.

2.1 Radio Astronomy

In 1931, while working to reduce radio frequency interference in telecommunications,

Karl Jansky discovered radio signals emanating from beyond the Earth [43]. These

signals originated from our galaxy, the Milky Way, and this observation by Jan-

sky heralded a new field of astronomical research called radio astronomy. Up until

Jansky’s discovery, astronomy had been limited for the most part to observation

through the optical window, a range of frequencies in the visible electromagnetic

7

8 CHAPTER 2: Background

spectrum that can penetrate the Earth’s atmosphere from space. A second window

through the atmosphere exists in the radio frequencies of the electromagnetic spec-

trum. This radio window is bounded by absorption due to water vapour and oxygen

molecules at higher frequencies around 1.5 THz, and by absorption and reflection

due to the ionosphere at lower frequencies near 15MHz [86]. The boundaries of

the radio window vary with time, geographical location, and the sensitivity of the

observing radio telescope.

The observation of radio waves from the universe beyond the Earth is of great

benefit to astronomy. This is because the radio window offers a different view of

the universe. Electromagnetic radiation in the radio frequencies is generated by

different physical processes, and interacts with matter differently to that of the

optical frequencies [10].

This view of the radio universe has led to the discovery of a number of new

physical phenomenon. The cosmic microwave background radiation was discovered

in 1964 by Penzias and Wilson. This radiation is an almost uniform radiation found

in any direction, that is not associated with any particular celestial bodies [79]. The

radiation, and the pattern of variations it contains, supports the big bang model for

the creation of the universe. Another radio phenomenon discovered by Bell Burnell

and Hewish in 1967 is the pulsar. These rotating neutron stars emit beams of radi-

ation in the radio spectrum, in a manner similar to that of a lighthouse [41]. Only

radiation corresponding to when the pulsar beam is pointed toward the observer is

received, resulting in a regular pattern of pulses. Radio galaxies are another phe-

nomenon revealed by observation in the radio spectrum [50]. These observations

have not only revealed additional structural information regarding known optical

galaxies, but have also enabled the detection of galaxies that have no optical coun-

terpart.

9

These discoveries have shown that radio astronomy is a significant branch of

astronomy, which studies the universe using radio techniques [97]. This is typically

achieved through the use of either a radio aerial or radio telescope. The Jansky

aerial system, shown in Figure 2.1, is an example of a radio aerial. Electromagnetic

radio waves passing through the aerial induce an electric potential that is measured

by receiver equipment. The Parkes telescope, shown in Figure 2.2, is an example of a

single dish radio telescope. The dish reflects incoming electromagnetic radio waves to

a receiving aerial located at the focus. This increases the amount of electromagnetic

flux passing through the aerial, and thus amplifies the observed signal. As a result,

the telescope can detect weaker signals than that of the aerial alone. The science of

radio spectrometry measures and quantifies these radio signals, as discussed in the

following section.


Figure 2.1: Jansky aerial system. Shown is an example of a radio aerial.Observed electromagnetic radio waves induce an electric potential in the aerial thatis measured by receiver equipment. This particular aerial was used by Jansky in hisdiscovery of the first radio signals emanating from beyond the earth. Image courtesyof NRAO/AUI [73, 74].

11

Figure 2.2: Parkes radio telescope. Shown is an example of a radio telescopedish. The dish reflects incoming electromagnetic radio waves to a receiving elementlocated at the focus. This 64m telescope is located in New South Wales, Australia.Image courtesy of CSIRO [19, 20].


2.1.1 Radio Spectrometry

A radio spectrometer is designed to measure the power spectral density of a radio

signal [86]. Power spectral density is the distribution of power over the frequency, ω,

of a signal detected by the antenna and receiver, denoted Φ(ω). The spectral energy

distribution of a radio source is of interest in radio astronomy because it provides

insights into the physical processes that produce the radio emissions in the source.

It also allows more sophisticated imaging techniques.

The initial approach to radio spectrometry was the swept frequency receiver, also

called a radio spectrograph [108]. In operation, this analog receiver is repeatedly

tuned across the same wide band of frequencies by a mechanical or electronic device

that generally produces several frequency sweeps per second. The speed of these

sweeps must be fast enough to reveal variation in the intensity of the signal between

sweeps [49]. Recording across an entire band of frequencies produces the power

spectral density [7]. The advantage of the swept frequency receiver is that it provides

a very high frequency resolution due to the sweeping mechanism producing output

for a continuum of frequencies. However, in order to achieve this, the antenna must

have a response that is nearly constant over this frequency range [96]. The main

drawback to this approach is that such scanning prevents significant integration

that would increase the signal to noise ratio. Additionally, only a single frequency is

observed at any given time, and thus signal features in the unobserved frequencies

are lost.

The solution to these drawbacks is the multichannel receiver. In this approach

the entire band is analysed at the same time [10]. This is achieved by having multiple

instances of the previous approach, each of which monitor a set frequency channel

in the observation bandwidth [65]. Thus the multichannel receiver has a lower

13

frequency resolution than the swept frequency receiver. However, it has a much

higher sensitivity, as it does not require wide band aerials with even amplification

over the entire frequency range [108].

The next evolution of radio spectrometry was the autocorrelation spectrometer.

Consider the incoming time-varying electromagnetic signal E(t). The spectra of this

signal, S(ω), can be obtained via the Fourier transform:

S(ω) =

∫

∞

−∞

E(t)e−i2πωtdt (2.1)

The spectra is then squared to obtain the power spectral density of the signal, Φ(ω):

Φ(ω) = S(ω)S∗(ω) (2.2)

This approach surpasses the limitations of the previous approaches. In theory, it

obtains the frequency continuum results of the swept frequency receiver, except the

entire spectrum is obtained simultaneously by the autocorrelation spectrometer.

However, in practice this approach can only be implemented by applying a dis-

crete Fourier transform of length L to a digital sampling of the signal, E[k]:

S[ν] =L−1∑

k=0

E[k]e−i2πνk (2.3)

The power spectral density, Φ[ν], is obtained for the discrete range of frequencies, ν

using the equation:

Φ[ν] = S[ν]S∗[ν] (2.4)

Like all digital signal processing techniques, this approach is limited by the rate and

precision of the digital sampling of the signals. Assuming the required sampling rates


and precision can be physically achieved, the limitation for real-time observations

becomes the rate at which the digital hardware can process the signals. All of

these methods are also limited by the specification of the telescope used for the

observation. This can be improved by the use of multiple telescopes, which is next

discussed.

2.1.2 Radio Interferometry

Radio interferometry is the use of multiple telescopes to make observations with

a superior angular resolution compared to those made by a single telescope. The

angular resolution of a receiving instrument in radio astronomy is a measure of its

ability to separate two neighbouring sources, such as those shown in Figure 2.3. This

is desirable because superior angular resolutions reveal more detail of the observed

source. When observing radio waves of wavelength λ with a single dish of diameter

D, the angular resolution, θ, of the telescope is given by the Rayleigh Criterion [86]:

θ =λ

D(2.5)

As seen in Figure 2.4, two sources significantly closer than this distance are indistin-

guishable from a single source. Given the longer wavelengths of radio waves, the size

of a single dish required to resolve the detail can surpass what can realistically be

constructed. The current largest single-aperture radio telescope is Arecibo, located

on the island of Puerto Rico. It has a diameter of 305 metres [48], forming a limit

on the obtainable angular resolution with a single telescope.

Radio interferometry can be applied in order to address this restriction [60].

Specifically, using two or more telescopes in an array with a maximum separation

15

b, referred to as a baseline, results in the resolving power

θ =λ

b(2.6)

in the direction of the baseline projected onto the source. This resolving power can

be extended by rotating the baseline, or by acquiring baselines of a different direction

via additional telescopes. Thus two or more smaller telescopes with a baseline, b,

can be equivalent in resolving power as a large single dish with a diameter D equal

to that baseline distance. The collecting area of the array is equal to the sum of the

collecting area of the component telescopes.

Shown in Figure 2.5 is a simple single-baseline interferometer. The incoming

signal travelling at the speed of light, c, is measured by both receivers. However

due to the angle of the signal, φ, there is a difference in path length, d, between

the two receivers given by d = b sin(φ). This path difference results in a time lag,

tlag, defined by tlag = d/c. This, along with any other lag in the signal processing

system effected by cabling distances, is removed by delay compensation shown in

Figure 2.5.

A local oscillator signal is mixed with the signal to perform a frequency con-

version of the observed bandwidth. In radio astronomy this frequency conversion

is used to lower the frequency at which further processing occurs. This is because

it is technically easier to process signals at a set lower frequency, than to design

specialised equipment for any given observation frequency range [87]. The phase

shifter adjusts the phase of the local oscillator to account for any delay between the

two incoming signals. The resulting stream is then fed into a correlator for further

processing.


These principles were employed in the first radio interferometer, constructed

by Ryle and Vonburg with results published in 1946 [90]. The interferometer was

designed to observe the sun at 175 MHz (1.7m wavelength) and consisted of two

aerial systems with a horizontal separation of several wavelengths. This allowed

discrimination between the galactic background radiation and the signal from the

sun. The signals from the two aerials were combined and the sun produced an

oscillating signal as the minima and maxima of the interferometer moved across the

sun during the day.

To reduce the amount of noise, an early improvement to the radio interferometer

was the addition of phase switching by Ryle in 1952 [89]. This allowed a weak point

source to be recorded independently from an extended source of greater intensity. It

was also used to improve the accuracy in the determination of the source position,

as well as the measurement of angular diameter and polarisation of weak sources.

The system added a phase switcher to one of the antenna cables. When activated,

the phase switcher created an additional delay in the signal of that antenna, corre-

sponding to half a wavelength in the observed signal. Rapidly alternating the phase

switcher on and off caused a square wave component to be created in the signal.

An amplifier that responds to this square wave signal, as well as a phase sensitive

rectifier, was used to convert the square wave to a direct current [97]. In this way,

the noise generated in the receiver and interference from background radiation and

extended sources near the observed point source was greatly reduced [96].

In order to achieve finer angular resolution, radio telescopes situated thousands

of kilometres from each other were used for interferometry. This was referred to as

Very Long Baseline Interferometry (VLBI) [63]. The distance between the telescopes

had additional challenges over smaller arrays, as the exact location and timing of

each of the observations at the telescopes is needed to combine the signals and form

17

an image. This was resolved using accurate timing instruments and reference sources

in the sky.

While these interferometry techniques provide superior angular resolution, they

also introduce the complication of having multiple telescope signals. To produce

two-dimensional images of the observed radio source, the one-dimensional signals

from the telescopes of an interferometer must be combined. This process is called

aperture synthesis, and is next discussed.


D

=D

resolved

unresolved

sources

telescope

Figure 2.3: Angular resolution of a telescope. Shown are two sources with anangular separation of ψ. When observed on a wavelength of λ, the angular resolutionof a telescope with dish diameter D is θ = λ/D. When the angular resolution isfiner than the angular separation of the sources, θ < φ, the telescope can resolve thetwo objects as separate sources.

19

0

0.2

0.4

0.6

0.8

1

1.2

distant medium close

rela

tive

ampl

itude

angular separation

resolved unresolved

Figure 2.4: Resolution of point sources. Shown is a series of two point sources,sequentially moved closer together. As the angular distance, ψ, between any twosources decreases, they appear to overlap. Should two sources be sufficiently closesuch that ψ < θ, they are indistinguishable from a single source.


d

Mixer MixerLocal

OscillatorPhaseShifter

DelayCompensation

Correlator

b

AxisSignal

DelayCompensation

Figure 2.5: A two-element interferometer. Shown is a simple interferometrysystem [87]. Two receivers are used in order to increase the angular resolution ofthe observation. The signal frequency is lowered via mixing with a local oscillatorsignal of the correct phase. The signals are offset by a lag created by the physicalpath difference, d, between the two signals. There could potentially exist additionalelectronic delays such as those arising from differing cable lengths. These delays areremoved via the delay compensation shown. A correlator is then used to combinethe signals.

21

2.1.3 Aperture Synthesis

Aperture synthesis is the process used by radio interferometer arrays to obtain two-

dimensional images. The correlator forms the first stage of processing. In this stage,

signals from the individual telescopes are unpacked, transformed to the frequency

domain, and then conjugate multiplied to form complex visibilities for each non-

redundant pairing of signals [87]. The complex visibilities are a correlation of the

two signals in each pair, expressed in the frequency domain. The resulting complex

visibilities are next calibrated.

Calibration removes effects of instrumental and atmospheric factors in the mea-

surements [99]. Correct calibration will result in an isolated point source, producing

an ideal point spread function in the image space. This is the radio equivalent to

focusing an optical telescope [10]. Calibration values can be precalculated and then

repeatedly applied to a series of complex visibilities. The resulting output is called

a calibrated visibility.

The next stage of the image synthesis pipeline is to convert the calibrated vis-

ibilities to the spatial domain. The calibrated visibilities are first converted into a

two-dimensional grid. This process is referred to as gridding, and typically involves

the use of interpolation functions. Gridding is necessary because a two-dimensional

grid is required for the inverse Fast Fourier Transform (FFT) that is subsequently

used to convert to the spatial domain. The resulting transformed image still con-

tains artifacts produced by the synthesis algorithm, and is thus referred to as a dirty

image.

Following aperture synthesis, the artifacts in the dirty image are reduced via

deconvolution techniques. The two most common of these are the CLEAN algo-


rithm [42] and the maximum entropy method (MEM) [2]. The CLEAN method

iteratively removes point sources and their associated side lobe noise, and then re-

turns the sources to the image without the subtracted noise. The maximum entropy

method constructs a function to define the lack of information in an observation,

and then selects the outcome that corresponds to the maximum of this function.

The output of these noise reduction techniques is referred to as a clean image. Of

the stages used in aperture synthesis, the focus of this research is digital signal

correlation, which is next discussed.

2.2 Signal Correlation

A correlator combines the N signals of a radio telescope array to produce complex

visibility spectra. These spectra are produced for each of the pcross baseline pairs,

given by

pcross =N(N − 1)

2(2.7)

as well as each of the pauto = N autocorrelation pairs in which signals are correlated

with themselves. Thus the total number of output spectra ptotal is given by

ptotal = pcross + pauto (2.8)

=N(N − 1)

2+N (2.9)

=N(N + 1)

2(2.10)

There are two processes required to obtain the pth complex visibility spectra, Cp(ω),

from the N input signals, x(t): the correlation of each pair of signals denoted X, and

the conversion to the frequency domain denoted F. As shown in Figure 2.6, there

23

are two types of correlator: the XF correlator and the FX correlator. The letters

indicate the order of the correlation and the frequency transform.

XF correlators first calculates the temporal correlation function, yp(τ), using

yp(τ) =

∫

∞

−∞

xm(t)x∗n(t− τ)dt (2.11)

for the p ∈ [0, ptotal−1] corresponding to the pairing of signals m,n ∈ [0, N−1]. The

result is then transformed to the frequency domain to obtain the complex visibilities

with

Cp(ω) =

∫

∞

−∞

yp(τ)e−i2πωτdτ (2.12)

This order of processing is seen in the lower left path in Figure 2.6. This type of

correlator is also called a lag correlator, which involves multiplication of the two

signals in series of incremental offsets referred to as lags.

In contrast, the FX correlator first transforms the signals into the frequency

domain to obtain the signal spectra using

Sn(ω) =

∫

∞

−∞

xn(t)e−i2πωtdt (2.13)

then performs the correlation via conjugate multiplication in the frequency do-

main [13, 104, 107, 12] using

Cp(ω) = Sn(ω)S∗

n(ω) (2.14)

This order of processing is seen in the upper right path in Figure 2.6.

Note that because a multiplication in the frequency domain is equivalent to a


convolution in the spatial domain, these two methods produce equivalent results.

However, the FX correlator requires fewer operations than the XF correlator when

operating on more than a couple of streams [87]. This is because the XF correlator

requires N2 FFTs compared to the N needed by the FX correlator. Due to the

increasing trend in radio telescope array size, and the challenging scale of compu-

tation required for larger arrays, the FX correlator is the focus of this research and

next detailed.

25

S (w)

Spectra(frequency)

C (w)

Complex Visibilit ies(frequency)

y ( )

Correlated pairs( t ime)

FFT

FFT

CorrelationConjugate

Mult ipl ication

FX

XF

x (t)

Data streams(t ime)

Figure 2.6: Correlator architectures. The role of the correlator is to covert theinput time series (upper left) into the output visibility data (lower right). Shownare two approaches: the XF (or Lag) and the FX correlator. The XF correlator firstperforms a correlation in the spatial domain for each pair of input data streams,and then performs a fast Fourier transform (FFT) for each pair. The FX correlatorperforms an FFT for each data stream, and then a conjugate multiplication in thefrequency domain for each pair of streams. The two approaches are mathematicallyequivalent. This diagram is similar to that of an autocorrelation [86], except thereare two separate data streams undergoing correlation.


2.2.1 Digital FX Correlator

Prior to reaching the FX correlator each of the N radio frequency signals, x(t),

from the telescope receivers are digitally sampled and quantised to an integer bit

representation, rn[k]. This is represented by

rn[k] = xn(k∆τ) (2.15)

for k ∈ Z and where xn(t) is the quantisation of x(t). The sampling period is

defined by ∆τ , the amount of time between consecutive samples. The reciprocal

of the sampling period gives the sampling frequency. The sampling frequency must

be at least greater than the Nyquist rate, which is double the highest frequency

representable by a real-valued discrete time series. A more thorough analysis of the

effects of discrete sampling and quantisation is outside the scope of this background,

and the reader is referred to Oppenheim and Schafer for further details [75].

The digital FX correlator processes the digitised signals in the three stages shown

in Figure 2.7: the unpack, the frequency transform, and the conjugate multiply and

accumulate (CMAC) stages. The first stage unpacks the digitally sampled signals,

rn[k], to a floating point representation, xn[k]. The nature of the unpacking oper-

ation is dependent on the data packing scheme used by the interferometer receiver

hardware. Additional data shuffling must occur if the signals are organised to be

contiguous for a given time, the data must be shuffled such that the signals for each

stream are contiguous. This latter organisation is required for the fast Fourier trans-

form. This shuffle operation is referred to as corner turning. This work assumes

corner turning is not required, and only unpacks signals that are stream-contiguous

in this stage.

27

The second stage transforms the floating point signals to the frequency domain,

typically utilising a discrete Fourier transform, given by

Sa,n[ν] =

L−1∑

l=0

xn[l]e−i2πνl/L (2.16)

This produces the output spectra Sa,n[ν] over a discrete range of frequencies ν for

the ath spectra in the nth telescope stream. For the F desired frequency channels

in the complex visibilities output by the FX correlator a FFT of length L is used.

This can be either a real to complex FFT of length LR = 2F − 2, or a complex to

complex FFT of length LC = F , depending if the telescope data is real or complex.

The third stage of the FX correlator is the conjugate multiply and accumulate

(CMAC) stage. Each m-n pair of frequency spectra are conjugate multiplied, and

then accumulated across A transforms using

Cm,n[ν] =A−1∑

a=0

Sa,m[ν]S∗

a,n[ν] (2.17)

The size of the accumulation, A, is dependent on the interferometer specification.

The integration used in an accumulation reduces the effects of noise as the length of

the accumulation increases. This occurs because the frequency components of the

observed source are assumed to remain constant over the period of accumulation,

while the noise is a random process. However, accumulations must be sufficiently

short that effects due to the rotation of the earth are negligible. Furthermore,

variations in the observed signal that are significantly shorter than the accumulation

length are lost. This accumulation produces the complex visibilities Cm,n[ν] for the

ptotal pairs of signals that were defined previously in Equation 2.8. For the pauto

autocorrelation pairs, the complex visibility is an accumulation of the signal power

spectral density. Once an accumulation is complete, the results are output and the


next accumulation begins.

Because the FX approach is reliant on the Fourier transform, the algorithm

contains spectral leakage, which is also referred to as ringing [10]. This phenomenon

is due to frequencies not directly aligned with the output spectral channels appearing

to some extent in every other spectral channel. The FFT of an aligned, and a non-

aligned single frequency complex sinusoid signal is shown in Figure 2.8.

To demonstrate the cause of this effect, consider a complex sinusoid timeseries

xν′ [k] of frequency ν ′:

xν′ [k] = ei2πkν′/L (2.18)

substituting this into the discrete Fourier transform results in:

Xν′[ν] =

L−1∑

k=0

xν′ [k]e−i2πνk/L

=L−1∑

k=0

ei2πkν′/Le−i2πνk/L

=

L−1∑

k=0

ei2πk(ν′−ν)/L

=L−1∑

k=0

e−ikµ, µ = 2π(ν − ν ′)/L

this general equation can be evaluated to the following [52]:

Xν′[ν] = e−iµ(L−1)/2 sin (µL/2)

sin (µ/2)

= e−i(π(ν−ν′)−π(ν−ν′)/L) sin (π(ν − ν ′))

sin (π(ν − ν ′)/2L)

Thus, the ringing is caused by the discrete spectral channel output sampling the

29

continuous frequency response, Xν′[ν], shown in Figure 2.9. In particular, it is the

sampling of the side-lobes that are present each side of the main lobe that causes

the leakage. The frequency response of the FFT is caused by using a finite time

series to represent a continuous signal. Weighting functions, such as hamming and

hanning windows, are commonly used to efficiently reduce the leakage. A polyphase

filterbank is another technique used in radio astronomy to reduce the leakage effect,

and is used in combination with weighting functions. The polyphase filterbank is

next detailed.


Initialise

Readdigital samples

Unpack

Fast Fouriertransform

CMAC

Accumulat ioncomplete?

NO

YES

Write accumulatedcomplex visibi l i t ies

Is there moredata to process?

NO

YES

Finalise

Stage 1

Stage 2

Stage 3

Figure 2.7: Serial FX correlator algorithm. For each of the N signal streams,a time series of data corresponding to the fast Fourier transform (FFT) length Lis processed during each pass of the algorithm. Each correlation pass consists ofthree stages: unpack, frequency transform, and the CMAC. After sufficient passescomplete an accumulation, a complex visibility for each antenna pair is producedand the subsequent accumulation begins.

31

0

0.25

0.5

0.75

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ampl

itude

frequency channel

freq = 4

(a) Aligned frequency signal

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ampl

itude

frequency channel

freq = 4.5

(b) Non-aligned frequency signal

Figure 2.8: Fourier transform spectral leakage. Shown in the first graphis the Fourier transform of a complex sinusoid with a frequency aligned to one ofthe spectral channels. Shown in the second graph is the Fourier transform of an-other complex sinusoid with frequency directly between two of the spectral channels.When the frequency is aligned, the other spectral channels have no response. How-ever, for a non-aligned frequency there is a response in every other spectral channel.This response is called spectral leakage, and is also referred to as ringing [10].


0

0.25

0.5

0.75

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ampl

itude

frequency channel

Figure 2.9: Fourier transform spectral response. Shown in the graph is thespectral response for a Fourier transform of a complex sinusoid of a single frequency.The response is zero at values that differ from the driving frequency by a multiple ofthe spectral channel width. Between these zeros, there are side lobes, which causeof spectral leakage for non-aligned frequencies.

33

2.3 Polyphase Filter Techniques

The use of a Fourier transform that weights all lags evenly leads to a result that

is the true autocorrelation multiplied by a weighting function that has a Fourier

transform corresponding to a sinc function. This produces side-lobes either side of

the main peak that are responsible for the spectral leakage. Reduction of such lobes

is addressed using weighting functions and polyphase filter approaches.

A polyphase filter bank (PFB) efficiently implements a bank of evenly spaced,

digital Finite Impulse Response (FIR) filters. This approach effectively improves the

filter response of each channel in the Fourier transform [78, 85]. Consider a digitally

sampled and unpacked time series x[n], with F frequency channels in the desired

FFT output spectra, S[ν]. The polyphase filter then consists of T FIR filters, or

taps , each of which are the same length as the FFT transform.

An FIR filter of length L is represented by a series of L weights: ρ0...ρl...ρL. In

a traditional filter approach, the filter would be applied to the timeseries and the

spectra obtained via an FFT

S[ν] =L−1∑

n=0

ρ[n]x[n]e−2πiνn/L (2.19)

As described by Bunton [9], the polyphase filter reduces the number of transform

values calculated by a factor of the number of taps, T . Instead of ν ∈ [0, N − 1],

output is calculated for ν ′ ∈ [0, N/T − 1] using

S[ν ′] =L−1∑

n=0

ρ[n]x[n]e−2πiν′nT/L (2.20)


This is rearranged using M = L/T to reduce redundant operations

S[ν ′] =T−1∑

m=0

M−1∑

n=0

ρ[n +mM ]x[n +mM ]e−2πiν′(n+mM)T/L (2.21)

=

T−1∑

m=0

M−1∑

n=0

ρ[n +mM ]x[n +mM ]e−2πiν′nT/Le−2πiν′m (2.22)

=T−1∑

m=0

M−1∑

n=0

ρ[n +mM ]x[n +mM ]e−2πiν′n/M (2.23)

=

M−1∑

n=0

[

T−1∑

m=0

ρ[n +mM ]x[n +mM ]

]

e−2πiν′n/M (2.24)

=

M−1∑

n=0

b[n]e−2πiν′n/M (2.25)

Thus a smaller FFT of length M is used along with a filter b[n] defined as

b[n] =

T−1∑

m=0

ρ[n+mM ]x[n +mM ] (2.26)

The choice of weighting function for the polyphase filter is crucial. While using

a rectangular weighting function reduces the sidelobes as shown in Figure 2.10, this

occurs because the non-aligned frequency components in the Fourier transform are

entirely removed as shown in Figure 2.11. For radio astronomy, loss of spectral

information is undesirable. A hanning-windowed sinc filter defined as

ρ[l] = sinc ((L/2 − l)/F )(1

2−

1

2cos (2πl/L)) (2.27)

where

sinc (x) =

sin (πx)/πx ; x 6= 0

1 ; x = 0(2.28)

is used to reduce ringing while retaining spectral features. Figure 2.12 shows that

35

the ringing is significantly reduced, while the spectral response for non-aligned fre-

quencies is retained when using a polyphase sinc filter.

The FX correlation algorithm predominantly consists of floating point calcula-

tions, which are suitable for data parallel processing. Thus, it is an ideal candidate

for a GPU computing approach. This would utilise the parallelism of the GPU to

obtain processing performance, while maintaining some of the flexibility of software

correlation techniques traditionally applied to the CPU. To leverage the process-

ing power of the GPU, the algorithm must be implemented in a parallel manner.

Thus while still being mathematically identical, it fits optimally within the GPU

computing paradigm.


0

0.25

0.5

0.75

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ampl

itude

relative frequency

T = 1T = 2

Figure 2.10: Polyphase Fourier transform response. Shown in the first graphis the spectral response of a polyphase filter and subsequent FFT applied to the samesingle frequency sinusoidal signal used in Figure 2.9. The responses of polyphasefilters with taps T = 1 and T = 2 are present. As the number of taps increase, theresponse in the side lobes decreases significantly.

37

0

0.5

1

-0.5 0 0.5

effe

ctiv

e fil

ter

wei

ght

relative frequency

T=1

T=2

T=4

T=8

Figure 2.11: Rectangular polyphase filter. This graph shows the effect of aunit rectangular polyphase filter with T taps. The domain of the graph is a singleFourier transform frequency channel, where the origin represents an aligned fre-quency, and ±0.5 represents half the distance to the subsequent frequency channel.Thus as T increases, the polyphase filter shown selects closer to the aligned frequen-cies, and significantly reduces the non-aligned frequencies. This graph is periodic,and repeats for all frequency channels.


0

0.25

0.5

0.75

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ampl

itude

frequency channel

freq = 4

(a) Aligned frequency signal

0

0.25

0.5

0.75

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ampl

itude

frequency channel

freq = 4.5

(b) Non-aligned frequency signal

Figure 2.12: Polyphase sinc filter. Shown in the first graph is the Fouriertransform of a filtered complex sinusoid with a frequency aligned to one of thespectral channels. The other spectral channels have no response. Shown in thesecond graph is the Fourier transform of a filtered complex sinusoid with frequencydirectly between two of the spectral channels. There is only a response in the twospectral channels adjacent to the signal.

39

2.4 Parallel Processing

Parallel processing utilises multiple computational entities to solve a single prob-

lem [84]. In contrast, serial processing is limited to a single computational entity.

Following the conventions of Massingill, Mattson and Sanders [55], I will refer to a

unit of execution (UE) for a single computational entity. Computer programs must

make use of multiple UEs in order to utilise parallel hardware architectures.

Parallel architectures are classified as either homogeneous or heterogeneous. Ho-

mogeneous computing architectures use identical processing cores. An example of

a homogeneous parallel architecture is a multicore CPU. Homogeneous systems are

typically easier to program, given their single processor type. In contrast, hetero-

geneous computing architectures use more than one type of processing core. An

example of a heterogeneous parallel architecture is a machine that utilises both a

CPU and a GPU. The advantage of heterogeneous systems is that the different

processors can be used for the algorithms that they are best suited to. The re-

sulting performance of the heterogeneous architectures can be worth the additional

programming investment.

While parallel hardware has existed for decades, software has yet to adapt to

such architectures. In the past such adaption was not necessary, due to the steady

improvement of single core processing performance. However, the advent of the

power wall has restricted growth in processing speed of serial architectures [53].

The power wall limits the speed of serial processor technology, because increasing

power consumption beyond this wall causes the processors to overheat. In turn,

hardware architectures have adopted parallelism in order to continue performance

growth. Legacy serial codes must be adapted to use multiple UEs to take advantage


of these new parallel architectures.

Parallelisation of legacy serial code is achieved via several design stages. Firstly,

the application must be decomposed into a collection of tasks that can be processed

by UEs. Those tasks that can be executed concurrently are identified, along with

dependencies that prevent the parallel execution of some tasks. The tasks and their

dependencies are classified with an algorithmic structure tree. Shown in Figure 2.13,

the algorithmic structure tree organises different parallel algorithm approaches, or

patterns. Similar patterns are grouped into three categories that form the three main

branches of the tree: organisation by task, organisation by data decomposition, and

organisation by data flow [56].

The organisation by task branch contains patterns in which the focus of the par-

allelism is on the tasks performed by a program. Patterns of this branch requires

tasks within an application that can be processed independently of one another. If

these tasks are all identical and independent, the pattern is referred to as embar-

rassingly parallel. Another pattern in this branch is divide and conquer. The divide

and conquer pattern first splits a problem into smaller concurrent subtasks, and

merges the results of these subtasks to solve the original problem. If the tasks are

instead dissimilar, the structure is classified as task parallelism. The task parallelism

pattern uses a single UE for each identified task.

The organisation by data decomposition branch of the algorithmic structure tree

contains patterns in which data parallelism is optimal. This type of pattern requires

the UEs to access the data in an independent manner. If the data access the data

access of the UEs is geometrically structured, and the UEs communicate only with

close neighbours, the structure is classified as geometric decomposition. This case

is similar to the embarrassingly parallel case, however some data must be shared

41

between UEs. If instead the data is inherently recursive, such as that of a tree

or graph data structure, the algorithmic structure is classified as a recursive data

pattern. In this case, the level of parallelism possible can vary as the data structure

is traversed.

The organisation by data flow branch of the algorithmic structure tree contains

patterns in which the focus of the parallelism is the flow of the data through the

program. In these algorithms, the data flows are processed by a series of tasks. UEs

requiring one chunk of data may be computed in parallel with other UEs requiring

other data chunks. If all of the data flows through the same sequence of tasks,

this structure is classified as a pipeline. Each stage of the pipeline is executed

concurrently as the data chunks make their way through. If instead the tasks are

processed in an irregular manner, the structure is classified as event based. In this

pattern, UEs generate tasks that are processed by other UEs. Where these tasks

are independent, concurrent processing can occur.

Once an algorithm has been classified the actual implementation is developed

on a particular parallel hardware architecture. The choice of hardware architecture

depends on the pattern of parallelism in the algorithm. This is because different

architectures are better suited to different types of parallelism. For example, task

parallel algorithms typically suit the multicore architecture, while data parallel al-

gorithms typically suit the GPU architecture. The architectures are classified to

enable a match between parallel pattern and architecture.

Computing architectures are classified using Flynn’s taxonomy [30], as shown

in Figure 2.14. There are four main classifications. Single Instruction Single Data

(SISD) architectures are serial, performing a single instruction on a single data

element at a time. Single Instruction Multiple Data (SIMD) architectures perform


the same instruction concurrently on multiple data elements at a time, and suit data

parallel algorithms. Multiple Instruction Single Data (MISD) architectures perform

multiple instructions on the same data element in a pipeline, and suit pipeline

algorithms. Multiple Instruction Multiple Data (MIMD) architectures are capable

of performing multiple instructions concurrently on multiple data elements.

There exists a sub-category of MIMD, known as single program multiple data

(SPMD) [46]. In this scheme, each UE is executing the same program, but are not

necessarily at the same stage of the program as the others. With scheduling capabil-

ities, this allows memory latency to be hidden by the processing of other UEs. That

is, while a UE waits for a memory access to be processed by the memory subsystem,

it is suspended by the scheduler and a different UE is executed to make use of the

otherwise idle processor. If a UE has sufficiently high ratio of computation to data

transfer, or arithmetic intensity [21, 82], the memory latency may be completely

concealed. The GPU is an example of the SPMD architecture, and is suited to a va-

riety of the patterns in the algorithmic structure tree, including the embarrassingly

parallel, geometric decomposition, and pipeline patterns.

The performance improvement obtained from parallel methodology varies, and

is dependent on both the algorithmic structure of an application as well as the com-

puter architecture. Amdahl’s law [4] is used to estimate the maximum improvement

in performance. Specifically, for that proportion of the algorithm that is parallelised,

p, between N UEs, and the remaining serial proportion, s = 1− p, the performance

will improve by a factor r estimated using

r =1

s+ (1−s)N

(2.29)

Thus, the performance improvement of a parallel program is limited by the time

43

needed for the sequential fraction of the program. As shown in Figure 2.15(a), as

the number of processors N becomes sufficiently large (N >> 1 − s) the value of r

becomes constant. The corresponding value of the plateau is r = 1/s.

However, as noted by Gustafson [36], the serial proportion typically decreases

with increasing parallelism. This is because the absolute size of the serial part of a

program does not necessarily increase as the program becomes more parallel. Thus

as the program becomes more parallel, the serial part of the program as a proportion

decreases. Gustafson’s law [36] uses a scheme in which proportions are calculated

based on the code run by a single UE, and thus the serial proportion, s′, is constant

with increasing parallelism for real world problems. As shown in Figure 2.15(b),

this results in additional UEs continuing to improve performance, estimated by

s′ =Ns

(N − 1)s+ 1(2.30)

p′ =p

(1 −N)p +N(2.31)

r′ = s′ + p′N (2.32)

In the above equations, performance continues to increase for all values of N . Sub-


stituting Equations 2.30 and 2.31 into Equation 2.32

r′ = s′ + p′N (2.33)

=Ns

(N − 1)s+ 1+

p

(1 −N)p +NN (2.34)

=Ns

(N − 1)s+ 1+

(1 − s)N

(1 −N)(1 − s) +N(2.35)

=N

Ns+ 1 − s(2.36)

=1

s+ 1−sN

(2.37)

= r (2.38)

yields an equation for r′ identical to that of Equation 2.29. Thus for any given

problem, Amdahl’s and Gustafson’s laws provide identical performance predictions.

45

Algori thmicPatterns

GeometricDecomposit ion

RecursiveData

Data

Type ofDecomposit ion

Pattern

EmbarassinglyParallel

Pipeline

Eventbased

Data Flow

TaskParallelism

Task

Divide andConquer

Figure 2.13: Algorithmic structure tree. The algorithmic structure tree or-ganises different parallel algorithm approaches, or patterns. Similar patterns aregrouped into three categories that form the three main branches of the tree: decom-position by task, decomposition by data, and decomposition by data flow [56]. Theshading indicates the patterns that are most suited to GPU processing. It shouldbe noted that the task parallelism in GPU processing occurs in the distribution ofwork between the CPU and GPU, while the other patterns are suited to the parallelprocessing of work by the GPU.


Single Data(SD)

Mult iple Data(MD)

SingleInstruction

(SI)

Mult ipleInstruction

(MI)

MIMD

MISD

SIMD

SISD

SPMD

Figure 2.14: Flynn’s taxonomy. Parallel computing architectures are classifiedusing Flynn’s taxonomy [30]. There are four main classifications. Single InstructionSingle Data (SISD) architectures are serial, performing a single instruction on asingle data element at a time. Single Instruction Multiple Data (SIMD) architecturesperform the same instruction concurrently on multiple data elements at a time, andsuit data parallel algorithms. Multiple Instruction Single Data (MISD) architecturesperform multiple instructions on the same data element in a pipeline, and suitpipeline algorithms. Multiple Instruction Multiple Data (MIMD) architectures arecapable of performing multiple instructions concurrently on multiple data elements.Single program multiple data (SPMD) is a subcategory of MIMD, in which eachprocess is executing the same program of instructions, but are not necessarily at thesame stage of the program as the other processes.

47

1

4

16

64

256

1024

1 4 16 64 256 1024

R (

spee

dup)

N (number of parallel threads)

s = 0.5

s = 0.2

s = 0.1

s = 0.05

s = 0.02

R = 1/N

(a) Amdahl’s law

1

4

16

64

256

1024

1 4 16 64 256 1024

R (

spee

dup)

N (number of parallel threads)

s’ = 0.5

s’ = 0.2

s’ = 0.1

R = 1/N

(b) Gustafson’s law

Figure 2.15: Parallel performance. Amdahl’s law shows that there is a limit tothe effectiveness of parallel programming [4]. As shown in Figure 2.15(a), for a con-stant serial code portion s, the improvement in performance plateaus as the numberof UEs increases. However, for real world problems the serial portion typically de-creases with increasing parallelism. This was formalised in Gustafson’s law [36]. Inthis scheme, the portions are calculated based on a single UE, and thus the serialportion, s′, is constant with increasing parallelism. This results in Equation 2.32, inwhich additional UEs continue to improve performance as shown in Figure 2.15(b).


2.5 Programmable Graphics

Prior to the introduction of the graphics processing unit (GPU), video output on

the personal computer transmitted to the screen via a video graphics array (VGA)

controller. The image was generated, or rendered, entirely by the CPU, and the VGA

controller functioned as an interface between the CPU and the computer screen [29].

Rendering is a multi-stage process that produces a two-dimensional screen image of

a virtual three-dimensional space, as shown in Figure 2.16.

The pipeline begins with a 3D scene of shapes defined by vertices. Vertices are

vectors in the three dimensional space of the scene that locate the corners of these

shapes. These vertices are then transformed by the vertex processor, from their

position in the 3D space in the scene into the corresponding 2D position in the

screen. The transformed vertices then undergo primitive assembly to obtain the

shapes they represent, called primitives. The next stage, rasterisation, uses the 2D

screen coordinates of the primitives to determine which of the screen pixels are within

the shape. These pixels along with additional data interpolated from the vertices

are output from the rasterisation unit and are referred to as fragments. In the final

stage of the pipeline the fragment processor determines the final colour of the pixels.

For each fragment from the rasteriser, it uses the contained interpolated data to

sample textures. Textures are typically two dimensional images, which are overlayed

onto the primitives by this process. The pixels are output to the framebuffer, from

where they may be displayed to the screen or saved back to the texture memory for

subsequent reuse.

The CPU is an excellent general purpose processor, with real estate split between

floating point, logic, and cache units. However, the desire to bring realtime video

49

output to photorealistic quality necessitated the creation of a processor dedicated

to rendering, the GPU. Because the GPU devotes the majority of the real estate to

floating point units [102], it could obtain superior performance for graphics render-

ing. Over the history of its development, the GPU gradually took over the various

stages of graphical computation from the CPU, as next discussed.


Rasterisation

VertexProcessor

&PrimitiveAssembly

FragmentProcessor

Screen

GraphicsAPI

Textures

Vertices Primitive

Pixels

Graphics API

Fragments

Figure 2.16: Programmable rendering pipeline. Shown are the stages re-quired to convert 3D virtual spaces to 2D images. In this example, a square inthree dimensional space defined by its four corner vertices is rendered. The verticesare first transformed by the vertex processor into the 2D space of the screen, andassembled into a square primitive. The primitive is converted into fragments, whichare the pixels that fall within the primitive along with additional information. Thefragments are then textured by the fragment processor to produce a final 2D imagethat is either displayed to the screen or stored in texture memory.

51

2.5.1 Development of the Programmable GPU

Three dimensional (3D) computer graphics rendering began with the rendering

pipeline, shown in Figure 2.17. This pipeline was originally executed in software on

the CPU, or by specialised computer systems dedicated to this purpose [61]. How-

ever, both of these approaches were unable to meet a growing demand for consumer

graphics. The specialised computer systems were too expensive for this market.

While the CPU could produce quality graphics, it was computationally outmatched

by the by the introduction of consumer graphics hardware, called the Graphics Pro-

cessing Unit (GPU).

GPUs were introduced in 1990s with the release of graphics cards such as NVIDIA’s

RIVA TNT card, ATI’s Rage 128 card, and 3dfx’s Voodoo3. As shown in Fig-

ure 2.17, these first GPUs performed rasterisation and fragment shading with a

fixed function capacity. A fixed function pipeline stage performs standard render-

ing operations, and customisation is limited to altering the values of parameters.

Table 2.1 summarises the release dates and associated computational power of the

TNT and subsequent NVIDIA graphics hardware.

In 1999 the functionality of the GPU was increased to support the entire ren-

dering pipeline with fixed function capability. The NVIDIA Corporation released

the GeForce256 graphics card, and this was soon followed by the Radeon card from

ATI. The addition of the initial pipeline stages reduced the computational pressure

on the central processing unit. At this time, graphics cards were highly configurable

but not programmable. Thus, the cards were limited in their functionality to that

of the standard rendering pipeline.

This limitation was removed with the next generation of GPUs, released in 2001.


These included NVIDIA’s GeForce3 and later GeForce4, as well as ATI’s Radeon

8500 card. These early programmable GPUs contained a programmable vertex pro-

cessor, which allowed user-written programs to be executed on a per-vertex basis in

the rendering pipeline. The programmable vertex processor was located at the be-

ginning of the pipeline, prior to the primitive assembly, as can be seen in Figure 2.18.

While the inclusion of the vertex processor made general purpose programming on

graphical hardware possible, due to its location early in the pipeline the results of

the program were lost in the remaining graphics pipeline processing.

In 2003, NVIDIA’s GeForceFX and ATI’s Radeon 9700 graphics cards were re-

leased, and introduced a programmable fragment processor that allowed fragment

programs to be run on a per-fragment basis in the rendering pipeline. The pro-

grammable fragment processor was located at the end of the pipeline after the

rasterisation, as can be seen in Figure 2.18. Its proximity to the end of the pipeline

made the fragment processor preferred for general purpose computing, since the re-

sults of the program could be read directly from the framebuffer. The term GPGPU

was coined for the programming of the GPU for general purpose algorithms using

graphics programming interfaces.

The GeForceFX was the first graphics card that supported NVIDIA’s Cg lan-

guage [29]. Standing for “C for Graphics”, Cg is a C-like high level programming

language for NVIDIA’s programmable graphics hardware. The release of a high-level

programming language was an important step forward for programmable graphics,

comparable to the development of high level languages for the CPU. At this stage of

development, for the first time the GPU could be programmed for general purpose

applications with a high level graphics programming language.

NVIDIA released its GeForce6 card in September 2004. The card had significant

53

performance improvements, in particular the introduction of 32 bit floating point

calculations. This GPU also supported the new PCI Express bus architecture, which

is not only twice as fast as the previous AGP 8x bus, but also allowed the use of

several graphics cards in parallel. These two developments significantly increased

the advantages of the graphics hardware for general purpose processing.

The GeForce 8800 was released in late 2006. Combined with the release of

NVIDIA’s CUDA programming API [69, 72], this card revolutionised the use of

GPU’s for general purpose computing. The underlying architecture of the GeForce

8800 broke the tradition of using the pipelined hardware implementations of previ-

ous generations. Instead it used a unified architecture, in which one parallel device

was used for each stage of the graphics pipeline via scheduling. At the same time,

the CUDA programming API allowed the programmer to use the GPU as purely a

computational device, by avoiding the graphical paradigms inherent in the languages

used previously in GPGPU techniques. This new non-graphical GPU programming

method is referred to as GPU computing. It made the GPU accessible to a broader

range of programmers, because a computer graphics background was no longer re-

quired to use the GPU.

In 2007, NVIDIA released the Tesla series. These systems, ranging from a single

card to a desk side server, were based on the GeForce 8800 architecture. However,

unlike the GeForce 8800, there was no video output port and significantly more on-

board memory on the Tesla. These cards were thus designed for the general GPU

computing market, in particular applications limited by the on board memory of

existing cards. The GeForce 9800 GX2, released in 2008, contained two GPUs and

surpassed the milestone of a teraflop of GPU processing power on a single card.

The growth of graphics hardware performance, shown in Figure 2.19, is another


incentive to explore its general purpose applications. The CPU speeds over the

past two decades have doubled every eighteen months in accordance with Moore’s

law [62]. This law defines the speed as the density of transistors at a given die size.

In contrast, the GPU speeds have doubled every six months, which NVIDIA refers

to as “Moore’s law cubed” [68]. This is due to the design of the GPU, which can not

only improved in a similar manner as the CPU, but also in the degree of parallelism.

This enables GPUs to use additional transistors for computation, achieving higher

arithmetic density with the same transistor count [77]. With this current higher rate

of growth, the advantages gained through the utilisation of the graphics hardware

can only increase.

55

3Dscene

VertexTransformation

PrimitiveAssembley

RasterisationFragmentShading

2Dimage

Rendering Pipeline

Programmableon CPU

Programmableon CPU

Programmableon CPU

Programmableon CPU

Programmableon CPU

Programmableon CPU

Fixed Functionon GPU









Programmableon GPU



Programmableon GPU

Programmableon GPU

Dawn of GPU Computing

previously

1998

2000

2001

2003

2006

Figure 2.17: Development of the GPU pipeline. The rendering pipelineshown at the top of this figures was first implemented on the CPU. Over time, partsof the pipeline were implemented on the GPU with a fixed functionality. A fixedfunction pipeline stage performs standard rendering operations, and customisationis limited to altering the values of parameters. Over time, these fixed function stagesbecame fully programmable.


Year Product Name Process Transistors Fill Rate FLOPS1998 RIVA TNT 0.25µm 7 M 50 M*1999 GeForce 256 0.22µm 23 M 120 M*2001 GeForce3 0.15µm 57 M 800 M2003 GeForce FX 0.13µm 125 M 2000 M2004 GeForce6 0.13µm 220 M 6400 M2006 GeForce 8800 GTS 0.09µm 511 M 10.3 B 345.6 G2006 GeForce 8800 GTX 0.09µm 681 M 13.8 B 518 G2008 GeForce 9800 GX2 0.065µm 1508 M 19.2 B 1.152 T

Table 2.1: A history of NVIDIA graphics hardware. A table showing thedevelopment of the processing power of NVIDIA graphics hardware [29]. The num-ber of transistors is measured in millions and is representative of the complexity ofthe hardware. The process is the minimum feature size, or die size, of the circuiton the silicon chip, measured in micrometers. The antialiasing fill rate is measuredin million pixels per second and represents the speed of pixel computations. Valuesmarked with a * are aliased fill rates due to a lack of hardware support for an-tialiased rendering. The polygon rate is measured in million polygons per second,and is a measure of the GPU’s ability to draw polygons. The GPU’s peak theoreticalfloating point operations per second (FLOPS) is also listed for the GPU computingproducts. The emphasised GeForce 8800 GTS is the GPU that was used for thisresearch.

57

CPU

GPU

PretransformedVertices

TransformedVertices

Front EndPrimitive

Assembley

VertexIndex

Stream Rasterisation &Interpolat ion

RasterOperations

FrameBuffer

FragmentLocationStream

PretransformedFragments

TransformedFragments

Pixels

3D API Application

Output

Primitives

CPU/GPU Bus: AGP or PCI-Express

ProgrammableFragment Processor

ProgrammableVertex Processor

Figure 2.18: A model of the graphics processing unit. This diagram isadapted from NVIDIA’s Cg Manual [29]. It shows the location of the fragment andvertex processors in the rendering pipeline. The proximity of the fragment processorto the end of the pipeline made it easier to capture the output. As a result, thefragment processor was preferred for general purpose computing compared to thevertex processor.


10

100

1000

2003 2004 2005 2006 2007 2008

com

pute

spe

ed (

GF

LOP

S)

date (years)

GPU

CPU

(a) Processor performance

10

100

2003 2004 2005 2006 2007

band

wid

th (

GB

/s)

date (years)

GPU

CPU

(b) Memory bandwidth

Figure 2.19: Evolution of the GPU. Shown is the historical growth of thegraphics and central processing units. The GPU has consistently increased its float-ing point performance and memory bandwidth at a significantly faster rate thanthe CPU. This has been achieved through the use of a massively parallel computingarchitecture.

59

2.5.2 Compute Unified Device Architecture (CUDA)

The Compute Unified Device Architecture (CUDA) is a parallel programming model

and software environment [72]. CUDA has dedicated libraries for the Fast Fourier

Transform (FFT) [71] and Basic Linear Algebra Subprograms (BLAS) [70]. As seen

in Figure 2.22, there are the two main parts that make up a CUDA enabled machine:

a CPU host, and one or more devices that correspond to a graphics card. The main

process runs on the host machine, and coordinates the execution of lightweight

parallel programs on the device, called kernels.

Each kernel consists of a number of UEs, which are called threads. As shown in

Figure 2.20, the threads in CUDA are organised into a distinct topology [72]. In

this topology, threads are grouped into a block. Each thread is indexed within its

parent block, and this indexing can be in up to three dimensions. The blocks are

then grouped into a grid. The blocks each have an index within the grid, which

currently can be in up to two dimensions. Future versions of CUDA will support

three dimensional grids.

A grid of threads may use up to the entire GPU processing resources. The grid

is broken into blocks to control the division of threads between the GPU multipro-

cessors. Threads within a block are guaranteed to be on the same multiprocessor

and thus can communicate using shared memory. Threads in different blocks may

not be on the same multiprocessor, and cannot use shared memory to communicate

between blocks.

Kernels can be classified based on their data access patterns. There are four

classifications relevant to this work: map, gather, scatter and reduction. In a map

kernel, each input value is independently processed to produce a corresponding


output value. The input and output values are arranged in the same order in

memory. In contrast, gather kernels read inputs in a non-ordered fashion and scatter

kernels write output in a non-ordered fashion. Reduction kernels have more than

one input for each output. The outputs are still produced independently, which

results in a reduction in the size of the data.

In a typical CUDA program, data is first acquired and placed into the host

memory as depicted in Figure 2.22. The data acquisition may take many forms,

such as being generated by the host program, reading from a storage medium, or

via a network interface. The data is then transfered to the GPU for subsequent

processing via the PCI-express bus. The host process will then instantiate a kernel

on the device. The kernel reads data from the device memory, perform computations,

and writes results to device memory. Multiple kernels may be executed, and the

results remain resident in device memory between kernels. The final results are

transfered to the host machine by the host process for subsequent output.

There are two modes available for transferring data between the host and device

in CUDA. The default mode transfers data to and from pageable memory on the host.

Pageable memory may be swapped out of the host memory into virtual memory

located on a storage device on the host, such as a hard disk drive. Data in virtual

memory must be read back from the storage device with significant latency in order

to be used. There is also a page-locked mode, in which page-locked memory is

allocated for transfer. Page-locked memory is never swapped out to virtual memory.

The advantage of the page-locked mode is that the underlying system copies directly

out of the page-locked memory. In contrast to this direct copy, there is an extra copy

involved with the normal pageable memory. However, the size of the allocatable

page-locked memory is relatively small, and for large data transfers the pageable

mode is superior.

61

On the device, CUDA utilises a number of memory spaces, as shown in Fig-

ure 2.21. There are two main types of memory spaces. The first are located on the

device processor. These memory spaces include the registers and shared memory.

The registers are used for holding small amounts of data to be operated on directly

by the kernel code. The shared memory is used as a programmer-manageable ad-

dress space for communication between threads located in the same block. As they

are part of the GPU itself, these memories are limited in size but have fast access

speeds.

In contrast the device memory is located next to the GPU on the device. It

has a larger capacity, but also a higher latency, compared to memories located

on the device processor. Device memory is capable of high parallel bandwidth.

However, memory access must be ordered according to specific coalescing rules to

realise these rates. The device memory has several designated memory areas with

separate purposes: global, constant, local, and texture memory. These memory areas

and their interaction with the thread topology is shown in Figure 2.21.

Global memory serves as a staging area for input data obtained by the host ma-

chine and for results ready for transfer back to the host machine. Data transfers

between the host and device are handled by the host machine. While running, GPU

kernels can read from and write to the global memory. However, a kernel may not

read from the same memory allocation to which it writes within a single kernel

execution. Data can remain resident in global memory between kernel executions

should it be required for later computation on the device. For the hardware inves-

tigated in this research, coalesced memory access requires consecutive threads in a

warp to access consecutive memory addresses, aligned to the the total size of the

memory accessed. A warp is a group of 32 threads that are processed in a SIMD

manner by the GPU. More extensive details of coalesced memory access on the GPU


can be found in the CUDA Programming Guide [72].

Constant memory is used to pass values common to all threads to the CUDA

kernel. In this way, a kernel’s behaviour can be altered depending on the value of

the constant at runtime. The values of constants are set by the host machine, and

are used in the computation of kernels subsequently executed on the GPU.

Local memory is an overflow for the registers of a GPU. Being located in the

device memory rather than on the GPU itself, it has a much higher latency than

that of the registers. Ideally, a kernel will not require more registers than present

and local memory will not be required. Due to the fact that if one thread overflows

to local memory all of them will, the local memory access conforms to the constraints

for parallel device memory access.

Texture memory is a feature still present from the GPU’s graphical heritage.

It supports hardware accelerated memory sampling, such as various interpolation

functions. For example, consider a value that is the interpolation of several data

elements stored in device memory. Using global memory access, all of these data

elements must be transfered, and then the interpolation calculated on the GPU

itself. If texture memory is used, the interpolation is calculated by the texturing

hardware and the single value is transferred to the GPU. This reduces both the

computational load on the GPU and the amount of data that is transferred.

While these memory spaces do not have the low latencies of those located on the

device processor itself, the latency can be hidden by processing. As discussed in Sec-

tion 2.4, the GPU is a SPMD machine, consisting of several SIMD multiprocessors.

These process the blocks of the topology in groups of threads called warps. Should

a warp be waiting on a global memory fetch, the thread scheduler can suspend that

63

warp, and begin the processing of another warp to amortise the latency. As long as

a kernel has sufficient arithmetic intensity, the ratio of computational operations to

memory access operations, the global memory latency can be completely hidden.

The development of CUDA has made programming the GPU for general pur-

pose applications accessible to a wider range of programmers. This has been made

possible by pioneering research that stretched the capabilities of the GPU, when it

was used only for the graphical rendering for which it was designed. This research

is reviewed in the next chapter.


Kernel 1

HOST DEVICE

Grid 1

Threads

Block 0,0

Threads

Threads Threads

Block 1,0

Block 1,1Block 0,1

Grid 2

Threads Threads

Block 0 Block 1Kernel 2

Figure 2.20: Thread topology. Each kernel consists of a number of UEs, whichare called threads. The threads in CUDA are organised into the topology shown [72].In this topology, the threads are grouped into a block. Each thread is indexedwithin its parent block, and this indexing can be in up to three dimensions. Theblocks are then grouped into a grid. The blocks each have an index within thegrid, which currently can be in up to two dimensions. Future versions of CUDAwill support three dimensional grids. In terms of the underlying hardware, thegrid of threads may use up to the entire GPU processing resources. The grid isbroken into blocks such as to mirror the hardware dividing the threads betweenthe GPU multiprocessors. Threads within a block are guaranteed to be on thesame multiprocessor and thus can communicate using shared memory, which isdescribed later in this section. Threads in different blocks may not be on the samemultiprocessor, and cannot use shared memory to communicate between blocks.

65

Grid

Block

Shared Memory

Block

Shared Memory

Global Memory

Constant Memory

Texture Memory

Local Memory

Registers

Thread

Local Memory

Registers

Thread

Local Memory

Registers

Thread

Local Memory

Registers

Thread

Figure 2.21: Memory locations available to CUDA threads. Shown arethe various memories available for access from within a CUDA thread. Memoriesshaded yellow are located in the GPU multiprocessor and have a low access latency.Memories shaded green are located in the GPU device memory and have a higherlatency. The host memory, not shown here but in Figure 2.22, is unaccessible fromwithin a thread, and data must be explicitly transfered into the GPU device memorybefore kernel execution begins. Kernel results must be explicitly transfered back tothe host.


Chipset

MemoryController

Hub

Memory

FrontSideBus

HostMemory

(RAM)

HostProcessor

(CPU)

PCI

HOST

DeviceMemory

GPUMemory

Bus

DeviceProcessor

(GPU)

DEVICE

ExpressBus

Bus

Figure 2.22: GPU-enabled system architecture. Shown are the various pro-cessing, chipset and memory systems of a host connected to a GPU device. Codeis executed on the host processor. The host processor first copies device programsto the device. It then controls the staging of data in the host memory to the devicememory via the chipset. It then signals the device processor to commence processingof data, and is then free to work on other tasks. Once the device kernel is complete,the host processor then manages the retrieval of the data from the device memoryto host memory.

Chapter 3

Literature Review

The next generation radio telescope, called the Square Kilometre Array (SKA),

will be far larger than the interferometer arrays of today [38]. As the number

of receiving elements in an array increases, the computational resources required

to process the data scale quadratically. The SKA will require a massive level of

computation compared to current arrays [17, 51, 25]. The traditional correlator

algorithms required for this computation are well researched [13, 104, 107, 12].

However, the processing traditionally has taken place on application specific in-

tegrated circuits (ASIC) or more recently on field programmable gate array (FPGA)

architectures [22]. Beowulf clusters consisting of commodity CPU processors have

been used recently, in applications such as very long baseline interferometry (VLBI) [23].

In preparation for the SKA, prototype pathfinder arrays are currently being devel-

oped, such as the Murchison Widefield Array (MWA) [26]. These prototypes provide

the opportunity to consider alternate computing architectures, and to assess poten-

tial gains that could be achieved through their use.

67

68 CHAPTER 3: Literature Review

GPU acceleration of scientific algorithms is a field that has undergone rapid

development, from initial GPGPU research to GPU computing today. The GPU has

been shown to be a powerful co-processor in a variety of application areas; including

general mathematics [47], image processing [16, 83], signal processing [44, 105],

and physical simulation [39] and cryptography [14]. To date, GPU computing has

featured in several mainstream graphics publications [28, 82, 67], in which the use

of the GPU for non-graphical computation is presented. In addition, there exist

several surveys of the GPU computing field [77, 76, 35]. These surveys highlight

both the initial pioneering GPU research as well as subsequent advances that are

significant to the entire GPU computing research area.

The potential power consumption of the hardware required to perform SKA-

scale correlation is sufficiently large to affect the choice of different designs being

considered [38]. The GPU architecture has been shown to be power efficient in

comparison to CPU architectures. While the GPU itself typically requires more

power than a CPU, when the corresponding processing capabilities are taken into

account the GPU consumes fewer watts per flop [59]. For a stand-alone system,

the addition of GPUs to a CPU cluster has been shown to produce higher speedups

but with smaller additional power consumption than upgrading to a CPU cluster

of comparable computational performance [101]. For computing clusters, expansion

via GPUs has been shown to result in double the processing speed for a 20% power

increase [33].

This chapter reviews research that is significant to this thesis. Firstly, the devel-

opment of GPU computing programming languages is discussed. Subsequently, the

development of a GPU implementation of the FFT is presented. This is followed by

a survey of GPU research in the field of astronomy and astrophysics.

69

3.1 GPU Programming Languages

As the vertex and pixel shader units of the GPU became programmable in the

first few years of this decade, a new field of research called general purpose GPU

(GPGPU) programming developed [77]. Rather than using the graphics hardware

for rendering, this field saw the use of the GPU’s parallel processing power applied

to the acceleration of general purpose computing algorithms. GPGPU programming

was achieved by reinterpreting the rendering pipeline concepts into those of general

computing, using graphics application programming interfaces (API). The two most

commonly used were OpenGL [106] and DirectX [24]. The host program ran on the

CPU and interacted with the GPU using these APIs. It compiled and transferred

small programs called shaders to the GPU for processing. The shaders were written

in specialised languages, such as Cg [29] and the OpenGL shader language [88].

A paradigm shift in GPGPU programming came in 2003, when a language called

Brook for GPUs emerged from the Stanford University Graphics Lab [8]. It adapted

to the GPU aspects of the Brook programming language, which was designed as

an extension of C with support for stream processing. BrookGPU made use of

features such as inline shader programs [57] to abstract the underlying graphics

architecture, and instead presented to the programmer a parallel compute oriented

language. It had several backends for both GPUs and CPUs of the time, which

included the NVIDIA NV30 driver, the Microsoft DirectX9 driver, the OpenGL

ARB driver, multithreaded CPUs, and standard CPUs. This broad range gave

BrookGPU excellent portability.

In BrookGPU, streams were treated as variables, and accessed from the host by

transferring data to and from arrays. Kernels and reductions were written as func-


tions that can be called from the program, and compiled into a fragment program

when the rest of the code was compiled. Parameters were passed to the kernel as a

variable. When compiled, programs were created for all backends. During runtime,

the backend was selected by setting an environment variable.

Another strength of BrookGPU was that it completely abstracted the GPU. The

programmer wrote stream programs that ran on any programmable graphics hard-

ware without needing to learn the languages for each manufacturer. Its integrated

support of GPU emulation on the CPU also enabled rudimentary time comparisons

during development. The function approach taken by BrookGPU did have its dis-

advantages. The stream was copied to the GPU prior to the kernel being processed,

and copied back to the CPU after processing. If the next kernel required the out-

put of the previous one, the stream was copied back to the GPU again. Thus, a

bottleneck was created between the CPU and GPU in programs with multiple ker-

nels. Consequently, BrookGPU was limited to applications with individual kernels

that were complex enough that the speed of calculation compensated for the stream

transfer time.

Following the development of BrookGPU, several other GPU computing lan-

guages emerged. These include solutions from the graphic hardware vendors, such

as AMD’s Brook+ [3] and NVIDIA’s CUDA API [72]. Other third parties also de-

veloped solutions, such as Rapidmind [58]. BrookGPU, along with these subsequent

frameworks, removed the requirement of graphical computing knowledge for general

purpose GPU computing. For clarity, techniques that used graphics APIs are re-

ferred to as GPGPU, while the new non-graphical techniques are referred to as GPU

computing. Because a graphical background was no longer needed in GPU comput-

ing, it opened the field to a larger portion of the research community, including radio

astronomy.

71

3.2 Fast Fourier Transform

The development of efficient FFT algorithms on the GPU is of particular interest

to this work, because the FFT is required in the second computational stage of

the correlation algorithm. For radio interferometry involving only a small num-

ber of streams, this is the most computationally intensive stage of the correlation

algorithm. The FFT is a fundamental transform required by signal and image pro-

cessing. For this reason parallel implementations of the FFT have existed prior to

the use of the GPU for computational processing [80], and the GPU implementation

of the FFT has been well researched by the GPU computing field [64, 34, 71].

The Discrete Fourier Transform (DFT) enables the spectral analysis of discrete-

time signals [75]. However, a DFT of length N has a computational complexity of

O(N2). In 1965, Cooley and Tukey derived the FFT [15], which obtained the result

of the DFT with a computational complexity of O(N log(N)). This significant

reduction in complexity led to the mainstream adoption of the FFT and associated

digital techniques for signal processing.

The FFT consists of a number of intermediate stages, the size of these stages are

referred to as radices. Different combinations of radices can used to obtain the same

result. However, a particular combination of radices may suit the memory caching

of a particular hardware architecture. For this reason in 1997, Fringo and Johnson

developed the FFTW library [31]. This library conducts planning in which FFT

performance for a variety of radix combinations is measured. The best combination

is then used to provide the best performance on any given architecture during FFT

execution. The best combination of radices, called plans, can be stored to avoid the

need to recalculate during subsequent program operation.


The FFT was shown to be achievable on the parallel architecture of the GPU

by Moreland and Angel [64]. Their implementation consisted of a two-dimensional

FFT, and included both forward and reverse transforms. This utilised an approach

that alternated between FFT stages and tangle stages. The tangle stages were used

to allow for efficient packing of the FFT data. The resulting performance of this

approach was was slower than a comparative CPU implementation by a factor of

six. However, this research was significant as the first implementation of the FFT

on a GPU.

Following this early implementation, improved performance was obtained by

using algorithms more suited to the GPU architecture. Key to the development of

these algorithms was a better understanding of the GPU memory, such as the model

developed by Govindaraju, Larsen, Gray and Manocha [34]. Their model describes

the underlying mechanisms used to access data and perform computation on the

GPU. Using this knowledge, they implemented a one-dimensional FFT algorithm

that used cache-efficient memory access patterns. This implementation achieved

performance results that were three times faster than dual-core CPU implementa-

tions.

Modern GPU computing languages now include FFT libraries [72, 71], which

are utilised by the implementations presented later in this work. These libraries

are developed by the GPU vendors for their hardware, and the knowledge of their

own architectures results in these libraries being highly optimised. Timings of the

CUDA FFT library are presented in Chapter 5.

73

3.3 GPUs in Astronomy and Astrophysics

In the field of astrophysics, GPUs have been used to simulate the behaviour of a

large number of astronomical bodies. This has traditionally been achieved by the use

of custom processing hardware, called GRAPE. Zwart, Belleman and Geldof used

the Cg shader language and GPGPU techniques to perform these simulations [109].

They subsequently updated their work with the advent of GPU computing, using

the CUDA language [5]. They note a large advantage of the GPU is that it can hold

significantly more particles in memory than the GRAPE.

GPUs have been shown to accelerate high performance computing clusters. Fan,

Qiu, Kaufman and Yoakum-Stover have developed a 30 node GPU cluster [27]. They

then simulated the dispersion of airborne contaminants in the Times Square area of

New York City. This resulted in performance increases of a factor of 4.6 compared to

a 30 node single-core CPU implementation. Schive, Chien, Wong, Tsaia, and Chiueh

have subsequently developed a 16 node dual-card GPU cluster [92] for astrophysics.

They have performed the n-body simulations described previously, and have shown

this system to be capable of simulating up to 320 million particles. It outperforms a

custom hardware GRAPE-6A by a factor of two, at a superior cost-per-dollar ratio.

As well as research in the field of astrophysics, there is some research relevant

to radio astronomy signal correlation. For example, while not directly related to

astronomy, research into the transfer of data between the host machine and the GPU

device is required for correlation on the GPU device. As such, it is important to

ensure that the bandwidth between the two is used optimally. Research has shown

that data transfer should occur in large batches, rather than in smaller frequent

amounts [40]. This can effect the achieved throughput by up to a factor of four.


The conjugate multiply and accumulate (CMAC) stage of the correlation algo-

rithm is the most data intensive stage of the correlation algorithm. The underly-

ing sum-product machine operation had been shown on the GPU to be 270 times

faster [95]. Research by Schaaf and Overeem has previously implemented the CMAC

stage on the GPU [91]. However, it was constrained due to its GPGPU implemen-

tation using a legacy graphics API, which resulted in heavy global memory access.

Additionally, because only the CMAC stage was implemented, the data transfer

between the host and device consisted of unpacked floating point values. As the

implementation used a graphical API, this data transfer was not able to occur con-

currently with kernel execution. The development of GPU computing APIs that are

not dependent on graphical paradigms allows greater flexibility in the implementa-

tion of the CMAC stage on the GPU.

Synthesis imaging consists of more than the correlation, and there has been some

research relevant to the post-correlation processing. In particular, the gridding of

data in preparation for a two dimensional reverse Fourier transform has been shown

to be possible for a variety of gridding methods. Schiwietz, Chang, Speier, and

Westermann demonstrate the gridding of data for magnetic resonance image (MRI)

reconstruction [98]. They show a two order of magnitude performance increase us-

ing the GPU. Schomberg and Jan Timmer also demonstrate the parallel gridding

of data for X-ray computed tomography (CT) [93]. As well as the one dimensional

FFTs required for correlation, two dimensional inverse FFTs for deconvolution are

available in the CUDA FFT library [71]. Wayth and Dale have implemented the

post-correlation realtime system for the Murchison Widefield Array (MWA) proto-

type on the GPU [103].

The current state of GPU computing research, as it pertains to radio signal

processing, has been presented. This has revealed the GPU to be a high performance

75

computing architecture with promise for power efficient processing. The related

research reviewed in this chapter indicate that radio signal correlation is a potential

application area for GPU computing. This work thus continues with a detailed

discussion of a GPU FX Correlation model, presented in the next chapter.

Chapter 4

Model

My model for a heterogeneous parallel FX correlator is now presented. This model

uses the Murchison Widefield Array (MWA) prototype [26] as a basis. The MWA

prototype uses the serial FX approach, shown previously in Figure 2.7. To develop

the GPU model, the parallel pattern methodology presented previously in Section 2.4

was applied to the serial design. The model uses these patterns to utilise both the

heterogeneous nature of the GPU-enabled host, as well as the data parallelism of

the GPU device.

The GPU computing architecture is inherently heterogeneous, in that both the

host processors (CPUs) and the device processors (GPUs) are available for process-

ing. For this reason task parallelism is utilised by the model. Shown in Figure 4.1

is the heterogeneous parallel model, in which tasks are split between the host and

device. In operation, the host first acquires radio signal data for processing. The

host next transfers this data to the device, and triggers the processing of the three

correlation stages on the device. While the device is processing these stages, the

host is free to work on other tasks, such as acquiring the next batch of data. Results

77

78 CHAPTER 4: Model

are retrieved from the device by the host. Upon completion of all data processing,

the host frees allocated memory on both the host and device.

For the GPU to process tasks, data must first be replicated in device memory

space. This requires the allocation of sufficient device memory to store the data. The

data must then be transfered onto the GPU device from the host memory prior to

GPU processing. Once a batch of processing is complete, the results are transfered

from the GPU device memory to the host memory. These memory transfers occur

via the PCI-express bus. Figure 4.1 shows these additional steps in the GPU FX

correlation algorithm. It is noted that between these stages the intermediary results

are stored in the GPU device memory, avoiding unnecessary data transfer.

Figure 4.1 shows the three correlation stages implemented with GPU kernels on

the device: the unpack stage, the Fourier transform stage, and the CMAC stage.

These kernels parallelise the serial correlator stages using embarrassingly parallel and

geometric decomposition patterns of parallelism. The application of the patterns to

each of the stages is next detailed. For reference, the kernel codes can be found in

Appendix A.

The unpack kernel of the FX correlator is an ideal candidate for processing

on the GPU, because it has an embarrassingly parallel pattern. The characteristic

of this pattern is that consists of many identical and independent tasks that can be

computed in parallel. Each unpack task reads in a single data value and output a

single corresponding unpacked value. The input value is a 8 bit integer datatype

that is unpacked into a 32 bit floating point datatype for subsequent processing. For

the tth packed value in the nth stream, rn[t], the unpacked value, xn[t], is calculated

using the equation

xn[t] = uSrn[t] + uB (4.1)

79

The input value rn[t] is first converted to a 32 bit floating point value and multiplied

by a scaling factor uS. Where the units of the signal are unimportant, or if such

considerations are accounted for post-correlation, this scaling factor may be omitted.

If a bias is present in the signal due to the packing scheme of the sampling hardware,

it is removed with uB. The computation required to calculate Equation 4.1 has a

complexity of O(N). Thus, the amount of processing scales linearly with the number

of streams N . This is the least computationally intensive kernel.

As well as the kernel to be executed by each thread on the GPU, the thread

topology must also be defined. Topologies in the CUDA API were discussed in Sec-

tion 2.5.2. They define the distribution of threads across the GPU multiprocessors.

For the unpack kernel, there are two considerations affecting the choice of topology.

The primary consideration is that enough threads must be used on the GPU de-

vice to ensure it is at full thread capacity. There should not be significantly more

threads than required to satisfy this condition to avoid overheads related to thread

instantiation. The topology used for the unpack stage was three blocks, with 128

threads per block, for each of the twelve multiprocessors on the GPU. Once the

unpack kernel processing is complete, it is then followed by the Fourier transform

stage.

The Fourier transform kernel is a more challenging kernel to implement using

the GPU. This is because of the highly interleaved pattern of memory accesses that

occur during the transform. A single interleaved access is referred to as a butterfly.

However, as the Fourier transform is fundamental to the fields of signal and image

processing, a significant amount of research into its optimal implementation on the

GPU has occurred as discussed in Section 3.2. Resulting from that work, the CUDA

language has a high performance FFT library called CUFFT.

80 CHAPTER 4: Model

The CUFFT library provides a FFTW style approach [31], in which a planning

stage is executed prior to any actual transforms. This planning stage runs a series

of test FFTs on the GPU device, with a variety of different radices to determine

the most optimal for the particular hardware configuration. As this planning stage

occurs once during initialisation, there is no additional overhead once the algorithm

begins to process data.

Consider the ath transform of length L in the nth telescope stream. CUFFT

takes the unpacked data, xa,n[t], in the time domain, t, as input and outputs the

spectra, Sa,n[ν], in the frequency domain, ν, such that

Sa,n[ν] =

L−1∑

t=0

xa,n[t]e−i2πνt/L (4.2)

CUFFT implements this efficiently using a parallel FFT algorithm. It processes the

entire buffer in a single library call through the use of batching. The larger the

number of transforms in a batch, the better the performance of the library. Thus,

larger buffers are preferred for this stage.

ForN data streams and a transform of length L, the computational complexity of

the Fourier transform stage is O(NL log2(L)). Extremely long transform lengths are

not typically used in FX correlation. For a given L, this stage also scales linearly

with N . The number of floating point operations used in this stage depends of

the particular radices selected in planning. Research in the field uses a standard

5L log2(L) for the transform [31]. Thus, there are 5 log2(L) floating point operations

per stream element. The GPU FX correlator algorithm then continues with the

CMAC stage.

The CMAC kernel is a reduction kernel that takes the FFT output spectra S

81

as input. For each m-n stream pair, it conjugate multiplies and accumulates a total

of A spectra pairs to produce the complex visibilities, C, using the equation

Cm,n[ν] =

A−1∑

a=0

Sa,m[ν]S∗

a,n[ν] (4.3)

Should an accumulation span the spectra buffer, the complex visibility buffer is used

to store intermediary results. Mathematically, this is expressed as

C ′

m,n[ν] = Cm,n[ν] +

q∑

j=p

Sj,m[ν]S∗

j,n[ν] (4.4)

for a range delimited by the memory indices p and q in the spectra buffer. C denotes

the previous accumulation subtotal, and C ′ denotes the new accumulation result.

The complex visibility buffer must be reset between accumulations.

The computational complexity of the CMAC stage is O(N2). Specifically, 3N+3

floating point operations must be performed per stream operation. This complexity

scales quadratically with the number of data streams, while the complexity of the

previous two stages scales linearly. Therefore, the CMAC stage becomes a bottleneck

as the number of streams increases, and is the most important consideration for

optimisation. As such, optimisation of the CMAC stage is explored in greater detail

in the next section.

82 CHAPTER 4: Model

Initialise

Readdigital samples

NO

YES


Transfer datato GPU

Retrieve resultsfrom GPU

Unpack


Kernel 1

Kernel 2

CMAC

Kernel 3



NO

YES

Finalise

Figure 4.1: GPU FX correlator pipeline. Shown is a simplified diagram of theGPU correlator algorithm flow. Each pass of the algorithm processes significantlymore data than the serial version in order for optimal parallelisation. Two datatransfers outlined in bold have been added in which data is transfered to the deviceand results are retrieved from the device. Intermediate results remain in the deviceglobal memory between the kernels, and are not transferred to the host machine.Operations processed by the host are coloured yellow, while the kernels that executeon the device are coloured green.

83

4.1 CMAC Stage Optimisation

The advent of GPU computing and the associated high level languages has exposed

the full potential of the graphics hardware, and removed the burden of a graphics

rendering application programming interface (API). However, implementing optimal

algorithms on the GPU remains non-trivial. Algorithms that are both optimal and

generalised are difficult to program on the GPU. The cause is the thread and memory

architecture of the GPU.

In the pursuit of extreme parallelism, a large number of simultaneously execut-

ing threads has been made possible by restricting the resources available for each

thread. Consideration must be given to ensure that the individual threads remain

light enough to run on GPU in terms of the required hardware resources; in par-

ticular the memory requirements and access patterns. The thread topology must

also be managed to ensure enough threads are running on the hardware to utilise

its full capabilities. Controlling these factors for a number of algorithm variables is

challenging.

Since the CMAC stage scales quadratically with the number of telescope data

streams, it the most computationally expensive stage of the FX correlator algorithm

for large telescope arrays. Because of this, optimal performance of this stage in the

GPU implementation is crucial to the overall performance of the algorithm. The

resources required by the CMAC stage varies with the number of telescope streams,

N , and the length of the FFT spectra, L. Several different approaches have been

explored to investigate the most optimal implementation of the CMAC stage, which

are next discussed.

This work has investigated using several different levels of parallelism in order

84 CHAPTER 4: Model

to determine the best solution for a given set of correlation parameters. The dif-

ferent methods are presented in increasing levels of parallelism: a single thread

CPU approach, a frequency parallel approach (1xNxN), a stream parallel approach

(1x1xN), a group parallel approach (1xGxG), and finally a pair parallel approach

(1x1x1). The abbreviated form I use for these methods refers to the number of

pair results calculated by a single thread, or G threads in the case of 1xGxG. The

three values referring to the number of frequency channels, first streams, and second

streams respectively. N refers to the number of telescope streams, and G refers to

the number in a subset of those streams. Figures 4.2(a), 4.2(b), 4.2(c) and 4.2(d)

show the differences between the CUDA approaches respectively. Each individual

block in these figures represents the result for the fth frequency channel for one

pair of streams m and n. The results calculated for each thread, or G threads for

1xGxG, are spaced apart in the Figure from those calculated by other threads to

illustrate the parallelism of these approaches.

The different approaches used the block and grid dimensions of the CUDA topol-

ogy that maximised their performance. For the majority of the approaches, a block

consists of 64 threads, corresponding to 2 warps of 32 threads. As detailed previ-

ously in Section 2.5.2, a warp is a group of 32 threads that are processed on the GPU

in a SIMD paradigm. Each consecutive thread in the block reads adjacent frequency

channels to ensure global coalescence to the complex transform data. The exception

is the 1xGxG approach, which utilises a two-dimensional block topology of 32x4,

consisting of four warps of 32 threads. The 4 warps sample the same 32 frequencies

for G = 4 adjacent streams in a staggered coalescent manner. That is, each indi-

vidual warp is coalesced, however consecutive warps are accessing non-consecutive

memory. This allows the kernel to acquire data for four different streams while still

maintaining coalesced memory access. For all approaches, the first dimension of a

85

grid ensures enough threads for all frequency channels. The second dimension then

allows for sufficient threads for the parallelism of the approach.

The serial approach uses a single thread to compute all of the FN(N + 1)/2

cross spectra frequency values, where N is the number of data streams, and F is the

number of frequency channels in a complex visibility. The thread calculates results

for all frequencies and pairs serially. This method is the base case for comparison

purposes. The serial processing in this approach has all the results for each frequency

of a non-redundant stream pair processed by a single thread. In my implementation,

the CPU thread was processed by one core of a dual core CPU, while the other core

was used by the underlying operating system.

In the frequency parallel approach (1xNxN), each of F threads compute

N(N+1)/2 cross spectra frequency values. In Figure 4.2(a), the result of each ith-jth

pair of telescope streams for each f th frequency is represented by a single cube. The

cubes are separated into slices corresponding to the results of all pairings of streams

for a single frequency. In the 1xNxN approach, a single thread calculates the results

for one frequency of all N pairings of all N streams, which corresponds to one slice

in the figure. Note that results for redundant pairs are not calculated.

In the stream parallel approach (1x1xN), each ofNF threads compute N−n

cross spectra frequencies values. In Figure 4.2(b), the result cubes are separated into

columns corresponding to the results for all pairings for a single stream and for a

single frequency. In the 1x1xN approach, a single thread calculates the results for

one frequency of one stream’s N pairs, which corresponds to one column in the

figure. Results for redundant pairs are not calculated this approach as well.

In the group parallel approach (1xGxG), the N streams are split into K

86 CHAPTER 4: Model

groups of size G, and K2GF/2 threads compute G cross spectra output frequencies.

In Figure 4.2(c), the result cubes are separated into square groups. In the 1xGxG

approach, G threads calculate the results for one frequency of G pairings of G

threads, which correspond to a single group in the figure. Groups composed entirely

of redundant pairs are not processed, and groups composed partially of redundant

pairs discard those results. The extra groups are included for efficient indexing, and

the extra pairs within groups are an unavoidable result of SIMD (Single Instruction

Multiple Data) nature of blocks in CUDA [32], and become a negligible overhead

for sufficiently large K. The group size in the diagram, G = 4, matches the size

used in my testing.

In the pair parallel approach (1x1x1), N2F threads compute one cross spec-

tra output frequency each. This is the method with the largest degree of parallelism

investigated. In Figure 4.2(d), each result cube is separated from every other result

cube. In the 1x1x1 approach, each thread calculates results for one frequency of one

stream paired with one other stream. Threads for redundant pairs are launched but

perform no processing. These extra threads are included for efficient indexing.

For all of these approaches, it was crucial to obtain global memory coalescence.

However, for a length L transform, the real to complex CUFFT library routine

produces spectra consisting of L/2 + 1 complex values. Since the value of L used

for radio astronomy is typically a power of two, the output spectra size is not. The

extra complex value adds to the offset of each subsequent spectra. Thus, memory

access of the CMAC approaches is increasingly offset by one complex value. Since a

complex floating point value is 8 bytes in size, and coalescent global memory access

requires alignment to a minimum of 32 bytes, this offset prevents optimal memory

coalescence by the CMAC approaches.

87

(a) Frequency parallel ap-proach (1xNxN)

(b) Stream parallel approach(1x1xN)

(c) Group parallel approach(1xGxG)

(d) Pair parallel approach(1x1x1)

Figure 4.2: Parallelism of the approaches. In each of these four diagrams,each block represents the result for a single frequency channel f of a single pairof streams m,n. The results are grouped together with other results calculated bythe same thread. The three dimensions in the abbreviations for these approachesrefer to the number of frequency channels, m streams and n streams computed bya single thread respectively. Non-shaded threads indicate where redundant threadshave been instantiated with no instructions in order to simplify indexing.

88 CHAPTER 4: Model

For this reason, the complex to complex transform CUFFT library routine was

used. For a length L transform, this routine produces L complex values. When

used on a real signal, the extra values replicate the L/2 + 1 output values. This

data padding results in spectra that are aligned for coalesced global memory access.

Extremely small values of L, such as L < 16 would require alternate approaches.

However, such small values of L are not typically used in radio astronomy correlation.

For a complex to complex transform, the unpack algorithm must convert the real

input data to complex values, by padding each floating point unpacked data value

with an additional floating point value set to zero. These modifications increase the

size of the required device memory for the unpacked signal data and transformed

spectra. The use of host and device memory in the GPU FX correlator model is

now discussed.

89

4.2 Memory Management

Correct management of the host and device memory is critical to enabling optimal

performance. Kernel execution cannot commence until all the required data has

completed the host to device memory transfer. In a similar manner, host to device

transfers must complete before results can be accessed by the host. These two

additional transfer stages are shown in Figure 4.1. The communication of data

between host and device memory occurs via the PCI-express bus.

The PCI-express 1.1 bus currently supports a maximum transfer rate of 4 giga-

bytes per second in each direction. This is sufficient for my model. As shown in

Figure 4.1, data is transfered to the GPU once, and results are retrieved from the

GPU once. There are no additional transfers required between the three computa-

tional stages. Such transfers would have a significant detrimental effect on the per-

formance of the algorithm. Page-locked memory spaces, introduced in Section 2.5.2,

were not used based on the results of preliminary testing. Future versions of CUDA

will use multiple independent processing pipelines that will allow the transfer of

data to occur simultaneously with kernel execution. This will effectively hide the

necessary communication between the host and device. The resulting effect on the

correlator performance is left for future research to explore.

The size of memory buffers can potentially limit the parallelism of the kernels.

Because memory buffers cannot be refreshed during kernel execution, it is important

that large data buffers are used. Larger amounts of data available to a kernel

increases the scope for parallelism. The memory buffer size is set during host and

device memory allocation. The buffer size is also relevant to data transfer, as the

GPU is more efficient when transferring larger amounts of data [40]. Thus, entire

90 CHAPTER 4: Model

buffers of data should be copied between device and host in a single transfer, rather

than in a series of smaller transfers.

As the incoming data streams are of arbitrary length, the GPU operates on

a buffer that is a portion of the entire data stream. The size of the buffer for

these portions is therefore dependent on the capabilities of the GPU hardware.

However, the size of an accumulation is dependent on the desired science outcomes

of an observation and the specifications of the particular telescope array. Arising

from this are two algorithmic features not present in a simple CPU implementation:

the accumulation is not necessarily aligned with the buffer boundaries, and the

accumulation may span buffer boundaries. Consequently, the CMAC kernel uses

the result buffer to hold intermediary accumulation values while the spectra buffer

is refreshed. This enables accumulations that span consecutive spectra buffers of

data.

The relationship between the GPU global memory and the GPU is similar in

some respects to that between the RAM and the CPU. It serves as a large data

staging area for the lower latency shared and register memory on the GPU itself,

as RAM does for the CPU cache. The interface between global memory and an

individual GPU multiprocessor is a parallel memory interface, which is accelerated

only for specific coalesced memory access patterns [72]. Hence the ordering of data

chosen in an algorithm has significant performance effects during GPU memory

operations.

For the GPU FX correlator algorithm, it is beneficial for the input data streams

to be grouped by stream rather than by time. That is, the sequential samplings of

any given data stream are contiguous in memory. Should the data instead have the

values corresponding to a given time from all streams contiguous in memory, corner

91

turning must be applied to shuffle the data into the correct ordering. The model

presented here assumed that corner turning is not required, and the implementation

of this operation is left to future research.

Shown in Figure 4.3 is the data flow of my GPU correlator. During computation,

a series of memory buffers in device memory is used to store the packed signals, R;

unpacked signals, X; spectra, S; and complex visibilities, C. These buffers are

accessed by the GPU via the GPU memory bus in each stage of the algorithm.

In order to transfer data to and from the device, the initial and final buffers are

allocated in both the host and device memory. Data is transfered between these

buffers explicitly during the program execution, as detailed previously in Figure 4.1.

92 CHAPTER 4: Model

SpectraBuffer

Visibil ityComplex

Buffer

Signal

BufferData

UnpackedBuffer

ArraySignals

Digit ised

Unpack

Random Access Memory (RAM)

GPU Shared Memory and Registers

AccumulateTransform

Visibil ityData

Complex

PackedBuffer

GPU Global Memory

DataResult

Buffer

(S)(X)(R) (C)

(R) (C)

Figure 4.3: GPU FX correlator data flow. During computation, a series ofmemory buffers in device memory is used to store the packed signals, R; unpackedsignals, X; spectra, S; and complex visibilities, C. These variables were introducedin Section 2.2. The buffers are accessed by the GPU via the GPU memory bus ineach stage of the algorithm. In order to transfer data to and from the device, the firstand last buffer also exist in the host’s RAM. Data is transfered between these hostand device memory buffers explicitly during the program execution. This transferoccurs via the memory bus, chipset, and PCI-express bus as shown in Figure 2.22.

93

4.3 Polyphase Filter

The GPU polyphase filter builds on the approach taken for the unpacking kernel of

the vanilla GPU correlator. The stage launches sufficient threads to make use of the

GPU compute resources. Each thread processes a portion of the data in the packed

buffer, R, and output to the unpacked buffer X. The threads access the input data

in a coalesced manner as discussed for the unpack stage in previous sections.

The main algorithmic addition is a circular buffer that is kept in shared memory.

This buffer stores multiple consecutive reads, up to the number of taps required

for the calculation, for each thread in a block. The minimum size of this buffer is

the number of threads in a warp multiplied by the number of taps. The maximum

size of this buffer is the shared memory available on each multiprocessor. Ideally,

the buffer should be small enough that multiple blocks may be run on the same

multiprocessor.

In operation, the polyphase kernel first fills the buffer with input data. Each

thread then unpacks each tap value in the buffer, multiplies by a preprocessed filter

function also located in shared memory, sums the resulting values, and outputs to

the unpacked buffer. The next tap is then read into the circular buffer, overwriting

the first tap, and the process continues until all data has been processed. The input

buffer is increased to provide enough data of the subsequent input buffer for the

final results to be read. This requires copying a negligibly small additional amount

of data in each data copy from the CPU to the GPU.

The unpacking operations in this scheme occur multiple times on the same data

element, one for each tap. This approach has been taken, as it allows the circular

buffer to contain packed data, and thus have a much smaller size. This allows the

94 CHAPTER 4: Model

GPU multiprocessor to process multiple blocks concurrently, allowing for improved

performance. The additional unpacking operations should be hidden by memory

latency, with no loss in performance.

The unpack stage uses a kernel to launch enough threads to make use of all

the GPU’s available compute resources. These threads then process the data in

the packed buffer, R, and output to the unpacked buffer X. For optimal process-

ing speeds, the global memory in which these buffers reside must be accessed in a

coalescent manner. That is, the sequential threads within the same single instruc-

tion multiple data (SIMD) warp must access corresponding sequential memory ad-

dresses [72]. For interleaved array data, wherein all timesamples from each signal for

a given time are adjacent in memory; the data must be shuffled to a non-interleaved

form for the FFT, wherein all consecutive timesamples for a given signal are adja-

cent. In order to both read and write to global memory in a coalesced manner, the

data must be shuffled in shared memory. This work has assumed non-interleaved

data, and has not investigated the interleaved case.

Chapter 5

Testing

A GPU FX correlator was successfully implemented and tested, using the hetero-

geneous parallel model presented in Chapter 4. The testing was used to investigate

the GPU FX correlator implementation, using a single core CPU implementation

for comparison. The purpose of this comparison is to determine the suitability of

the heterogeneous parallel architectures to radio astronomy signal correlation. This

suitability consists of a number of criteria. Most importantly, the GPU correlator

implementation must produce correct results with a sufficient performance increase

over the serial implementation to warrant the additional parallel programming over-

head. As power consumption has become a significant factor in computing, the

power usage of the correlator implementation is investigated. Finally, the adaptabil-

ity of heterogeneous parallel architectures is also considered. For this, a polyphase

filter was added to the GPU FX correlator implementation. The ability to add new

algorithmic features to a correlator widens the scope of its potential scientific appli-

cations. I address all of these criteria with the test results I present in this chapter,

which are summarised in Table 5.1.

95

96 CHAPTER 5: Testing

The test data consisted of four digital signal streams recorded from prototype an-

tenna tiles in the Mileura Widefield Array (MWA) low frequency demonstrator [26].

A short sample of the data used for testing is shown in Figure 5.1. The signals were

sampled at a rate of 16 MHz, which corresponds to one sample every 62.5 nanosec-

onds. Each data sample had a precision of 8 bits. Four streams of data, totalling

one gigabyte, was collected for testing. For tests that required more than four signal

streams, the original four signals were replicated. The performance of the algorithm

is not dependent on the values of the input data.

I investigated a variety of the two most significant correlation parameters: the

length of the frequency transform L, and the number of data streams N . These

parameters have a significant impact on the thread resources, thread load on the

GPU, and overall memory usage that could effect the operating speed. For the

transform length parameter L, testing values varied by powers of two from L = 128

to L = 2048, since these are considered to be lengths for which a GPU correlator is

most likely to be used. For the number of data streams N , the values tested varied

in powers of two from N = 1 streams up to N = 128 streams. A lower limit of

N = 4 was chosen for some of the optimisation techniques, which became somewhat

trivialised for N = 1, 2. The upper limit was chosen as multiple GPU approaches

must be considered past this degree of processing. Such approaches would split the

streams between cards using the smaller stream sizes presented here.

The test system hardware consisted of a Tyan Thunder K8WE (S2895) mother-

board, with a Dual Core AMD Opteron Processor 265 CPU. The power consumption

of this CPU is rated at 90W. As this work has not addressed multicore CPU ap-

proaches, only a single core was utilised in the testing. The GPU used was the

NVIDIA GeForce 8800 GTS with 320MB of memory. This GPU has 96 streaming

processors (SP) with a clock rate of 1200 MHz. With each SP capable of three

97

floating point operations per clock, the maximum theoretical performance for the

8800 GTS is 3×96×1200×106 = 345.6 GFLOPs. The maximum bandwidth to the

onboard memory of this GPU is 64 GB/s. The power consumption of the 8800 GTS

is rated at 135W. For RAM, the system had 2GB of DDR400 memory. A Seagate

Barracuda ST3250620AS 250GB SATA2 hard disk drive was used for storage. This

system utilised a PCI-Express bus architecture for communication between the host

and GPU device.

The test system ran the Ubuntu Fiesty Linux v7.04 operating system, using

version 2.6.20-16 of the Linux kernel. The GPU accessed via the NVIDIA Linux

Display Driver x86 version 100.14.11. The libraries required for testing included:

libc 6, libgcc 4.1.2, libcuda 1, libcufft 1, and libfftw 3. The FFTW library was used

for Fourier transform processing by the CPU correlator. The CUDA and CUFFT

libraries supported CUDA compute version 1.0. All timing tests utilised the ftime

routine of the sys/time.h system header. Timing tests were run ten times averaged

to produce results, with outliers due to the operating system removed. Aside from

these rare outliers, the obtained timing results were identical due to the 100ms

granularity of the system timer. Multiple iterations of tests were used to increase

the test runtimes to at least 10 seconds each to ensure several significant figures of

accuracy.


Section Test FigurePreliminary PCI-express data transfer rates 5.3

GPU fast Fourier transform 5.5

CMAC Stage CMAC stage results for a varying number of signals 5.7CMAC stage results for different transform lengths 5.8

GPU Correlator Test output 5.10Overview of stream bandwidth 5.11The variation of stream bandwidth with N 5.12The variation of stream bandwidth with L 5.13Total data throughput 5.14Correlator FLOPS 5.15Performance per watt 5.16

Polyphase Filter Polyphase filter performance 5.18

Table 5.1: Testing Summary. This table summarises the testing results pre-sented in this chapter. The preliminary testing provided insight into the computa-tional ability of the test system. The CMAC stage testing investigated how corre-lation parameters affected the different potential approaches for the CMAC kernel.Testing then examined the overall GPU correlator to determine the performance ofthe implementation. Finally, the addition of the polyphase filter and its associatedperformance was tested to explore the adaptability of the GPU implementation.

99

-16

-8

0

8

16

0 32 64 96 128

bit v

alue

timesample

Figure 5.1: Test data. Shown is a short sample of the data used for testing. Thisdata was sampled from signals observed by prototype antenna tiles in the MileuraWidefield Array (MWA) low frequency demonstrator [26]. The signals were sampledat a rate of 16 MHz, which corresponds to one sample every 62.5 nanoseconds.


5.1 Preliminary Testing

This section presents the results of preliminary tests performed prior to implement-

ing the full correlator algorithm. These results were required to obtain insight into

the computational ability of the test system. First examined are the available com-

putational resources of the GPU device. Following, performance details regarding

the host-device bandwidth and the CUDA fast Fourier transform library, CUFFT,

are examined.

Device resources are an important consideration in obtaining optimum GPU

compute performance. Each resource examined is finite, and thus approaches that

exceed a given resource will not execute. Furthermore, some resources are shared

between threads. Thus the amount of a resource that each thread requires deter-

mines the maximum number of threads that may execute concurrently. This directly

affects the performance of the system if there are too few threads for computation.

The CUDA SDK was used to determine the computational capabilities of the

GPU device in the test system. It contained 12 multiprocessors, and a compute

capability of 1.0. The thread topology could have blocks with maximum dimensions

of 512 by 512 by 64, with a maximum of 512 threads. The topology supported up

to two dimensional grids with dimensions not exceeding 65,536 by 65,536.

The CUDA SDK was used to probe the memory of each type on the GPU device.

Each multiprocessor on the GPU device contained 4096 32-bit registers and 16,384

bytes of shared memory. The GPU device contained 288,210,994 bytes of global

memory, 65,536 bytes of constant memory. However, the amount of allocatable

global memory could vary depending on the required usage for rendering of the

desktop user interface. As this value can change arbitrarily, no formal testing was

101

performed. All testing ensured a sufficient buffer of memory for the user interface

was maintained, and there was minimal user interface usage during testing. This

phenomenon could be avoided with the use of a dedicated compute GPU in addition

to that used for graphical rendering.

I next examined the rate of data transfer across the PCI-express bus between

the host and device memories, which occurs in the data transfer stage of the GPU

model shown in Figure 5.2. Since the signal data is one-dimensional, concerns

common to two-dimensional data transfers such as byte alignment and padding is

not considered. Measurements were taken for the page-locked mode [72] as well

as the normal transfers. I considered one CUDA API call to instantiate a host to

device transfer to correspond to one transfer. The bandwidth for the two modes

were measured for a variety of packet sizes, and the results are shown in Figure 5.3.

The performance of the CUFFT library was tested to compare performance to

the leading single core CPU fast Fourier transform software, FFTW [31]. These

tests were performed because the FFT is required for the second stage of the FX

correlation algorithm as shown in Figure 5.4. I investigated two modes of GPU

operation as well as the serial FFTW implementation on the CPU. In the first

GPU mode, the library directly transformed values resident in the GPU’s global

memory. In the second, values resident in host memory were transfered to the

GPU, transformed by the library, and then transfered back to host memory. The

GPU FX correlator model uses the FFT to process data already resident on the

GPU device, however the latter mode was included in this testing to demonstrate

the costs incurred by transferring unpacked floating point data to the device, and

non-accumulated results back to the host. These modes were tested for transforms

of length L = 128 to L = 222, and the performance results are shown in Figure 5.5.


Initialise

Readdigital samples

NO

YES


Transfer datato GPU


Unpack


Kernel 1

Kernel 2

CMAC

Kernel 3



NO

YES

Finalise

Figure 5.2: Bandwidth testing. Shown is a diagram of the GPU correlatoralgorithm flow. The bandwidth testing was performed to determine the maximumtransfer rate achievable between the host and device as highlighted in this figure.Due to the accumulation in the third computational stage of the algorithm, there issignificantly less data to be transferred to from the device to the host later in thealgorithm. For his reason only results of testing for the highlighted stage will bepresented.

103

0

1G

2G

3G

0 16M 34M 50M 67M

tran

sfer

ban

dwid

th (

byte

s pe

r se

cond

)

size of transfer packets (bytes)

Normal

Page-locked

Figure 5.3: PCI-express data transfer rates. This graph shows the achiev-able rate of data transfer across the PCI-express bus between the host and devicememories. Rates are shown for a variety of transfer sizes. One CUDA API callto instantiate a host to device transfer is considered to correspond to one trans-fer. Measurements for the page-locked mode [72] are included as well as the normaltransfers, to verify that page locking is not suitable for the data streaming used inthe correlator algorithm.


Initialise

Readdigital samples

NO

YES


Transfer datato GPU


Unpack


Kernel 1

Kernel 2

CMAC

Kernel 3



NO

YES

Finalise

Figure 5.4: Fast Fourier transform testing. Shown is a diagram of the GPUcorrelator algorithm flow, with the FFT kernel highlighted. The FFT kernel testingwas performed to determine the performance of the CUFFT 1 library on the GPUdevice, as well as the FFTW 3 library on the host system. Testing of the CUFFTlibrary included tests both with and without data transfer. Testing without datatransfer is representative of the performance of this kernel in the GPU implementa-tion. The tests that included data transfer used unpacked floating point input andnon-accumulated output, which is not representative of the GPU implementation.These latter tests have been included to demonstrate the loss of performance in-curred by these communications that can be reduced by the other two GPU kernelstages.

105

10M

100M

1G

128 1024 16k 262k 4.2M

com

plex

val

ues

tran

sfor

med

per

sec

ond

transform length

GPU CUFFTGPU CUFFT with transfer

CPU FFTW

Figure 5.5: GPU fast Fourier transform. This graph compares the perfor-mance of the CUFFT library to the leading single core CPU software fast Fouriertransform implementation, FFTW [31]. Testing of the CUFFT library investigatedtwo modes of operation. In the first, the library directly transformed values residentin the GPU’s global memory. In the second, values resident in host memory weretransfered to the GPU, transformed by the library, and then transfered back to hostmemory.


5.2 CMAC Stage Testing

I next examined how correlation parameters affected the different potential ap-

proaches for the CMAC kernel, highlighted in Figure 5.6. The correlation parameters

tested included the length of the FFT, L and the number of telescope signal streams,

N . Of the approaches presented in the previous chapter, testing considered the se-

rial, 1x1xN, 1xGxG and 1x1x1 approaches. The 1xNxN approach is unable to be

implemented on current NVIDIA hardware for more than a limited number of input

data streams. This is because the number of registers required by the kernel scales

quadratically beyond the number available on the GPU.

I first investigated how the transform length parameter affected computational

speed. Testing values varied by powers of two from L = 128 to L = 2048, since

these are considered to be lengths for which a GPU correlator is most likely to be

used. Two sets of these results have been graphed. Figure 5.8(a) shows how the

performance of the various approaches varies with transform length for N = 64

streams. To examine the effect of a low thread configuration, Figure 5.8(b) shows

how the performance varies for N = 4 streams.

I next investigated how the number of streams in a correlation affected perfor-

mance. The values tested varied in powers of two fromN = 4 streams up to N = 128

streams. The lower limit was chosen as the techniques become somewhat trivialised

for N = 1, 2 and the upper limit was chosen as multiple GPU approaches must be

considered past this degree of processing. Such approaches would split the streams

between cards using the smaller stream sizes presented here. Figure 5.7(a) shows

the effect of varying the number of streams on processing performance for transform

length L = 1024. To examine the effect of a low thread configuration, Figure 5.8(a)

107

shows the performance for transform length L = 128.


Initialise

Readdigital samples

NO

YES


Transfer datato GPU


Unpack


Kernel 1

Kernel 2

CMAC

Kernel 3



NO

YES

Finalise

Figure 5.6: CMAC stage testing. Shown is a diagram of the GPU correlatoralgorithm flow, with the CMAC kernel highlighted. I will now examine a numberof different potential implementations of this stage, in order to determine whichperforms the fastest for a given set of correlation parameters.

109

10kHz

100kHz

1MHz

10MHz

100MHz

1GHz

4 8 16 32 64 128

band

wid

th p

er s

igna

l

number of data streams

CPUcomp 1x1xNcomp 1xGxGcomp 1x1x1

real 1x1xNreal 1xGxGreal 1x1x1

(a) High L = 1024, varying N

10kHz

100kHz

1MHz

10MHz

100MHz

1GHz

4 8 16 32 64 128

band

wid

th p

er s

igna

l




(b) Low L = 128, varying N

Figure 5.7: CMAC stage results for a varying number of signals. Shownare the rates achieved. The number of signals, N , varied from 4 to 128. Eachof the approaches were tested on real to complex (real) and complex to complex(comp) transform data. The bandwidth is the half the number of samples perstream per second the correlator can compute in real time, assuming real inputsignals in accordance with Nyquist’s theorem.


10kHz

100kHz

1MHz

10MHz

128 256 512 1024 2048

band

wid

th p

er s

igna

l

transform length



(a) High N = 64, varying L

1MHz

10MHz

100MHz

1GHz

128 256 512 1024 2048

band

wid

th p

er s

igna

l

transform length



(b) Low N = 4, varying L

Figure 5.8: CMAC stage results for different transform lengths. Shownare the rates achieved. The transform length, L, varied from 128 to 2048. Eachof the approaches were tested on real to complex (real) and complex to complex(comp) transform data. The bandwidth is the half the number of samples perstream per second the correlator can compute in real time, assuming real inputsignals in accordance with Nyquist’s theorem.

111

5.3 GPU Correlator Results

Results of testing for the entire GPU correlator, shown in Figure 5.9, are now pre-

sented. I first examined correctness, to ensure the produced output was valid. Cor-

rectness tests ran the correlator implementations using the 1 gigabyte of MWA

signal data as input. Shown in Figure 5.10 is an autocorrelation of one of the sig-

nals produced by the GPU FX correlator. This output matched the standard output

supplied with the MWA tile data.

A direct comparison of the serial and parallel correlator output revealed slight

differences. Forty values taken from the real channels of the first autocorrelation

spectra were compared; these values are listed in Table 5.2. An average relative

error of 0.0000131, and a largest relative error of 0.0000477 was observed. These

variations were assumed to be due the FFT radix used by the two implementations

being different. Both the FFTW and CUFFT libraries automatically select the

optimal FFT radix for the hardware, and it is unlikely that the same set of radix

will be optimal for both CPU and GPU.

Shown in Figure 5.11 are the correlation parameters ranges that were varied for

the testing: the length of the Fourier transform, L, and the number of streams,

N . As discussed previously in the chapter, the transform length range was chosen

to include those typically used in current radio correlators: 128 <= L <= 1024,

incremented in powers of two as these are typically used lengths. Figure 5.13 shows

how the real time bandwidth per stream varies with L. The number of streams tested

covered the range 1 <= N <= 128, starting from a single stream and incrementing

to the maximum value of N = 128. This value was chosen as the total global

memory of the card began to reduce the allowable transform length. Finally, the


total throughput of the correlation is shown in Figure 5.14.

In addition to the correlator throughput, the number of floating point operations

per second (FLOPS) performed by the implementation was calculated. This was

achieved by multiplying the measured throughput by the number of floating point

operations per stream sample, producing the results shown in Figure 5.15.

The number of FLOPS for each stage was derived in the following manner. In

the first stage, each sample must be unpacked. For the 8 bit samples used in the

testing, this required two floating point operations, a multiply and an add. Thus

the FLOP per sample in the first stage is A = 2. For a polyphase filter with T

taps, 2T −1 additional operations are required per element [100]. The card actually

performs 4T −3 additional operations since shared memory resources are scarce and

sharing packed data allows higher thread occupancy and thus better performance

despite more compute. However the smaller value is used for the purpose of standard

comparison.

In the second stage, a fast Fourier transform is applied to a series of L samples,

where L is the size of the transform. This requires a number of floating point

operations that depends on this length, specifically 5L log2 L [31]. This is only

precise for radix-2 Cooley-Tukey algorithms, but is the standard used for comparison

with other approaches. Thus there are 5 log2 L FLOPS per stream elements in this

stage.

In the final stage, the number of FLOPS for the multiply add is 6 per channel

per pair. For N streams there are N(N+1)/2 pairs. The FFT produces an identical

number of channels as there are samples. Thus there are 3N + 3 flops per stream

element in this stage. Thus the total float operations are 2 + 5 log2 L + 3N + 3 =

113

3N +5 log2 L+5 per sample. 2T − 1 additional operations per element are required

for the polyphase filter stage.

The amount of power used by the entire computer under several different oper-

ating loads was measured, in order to determine the FLOP per watt efficiency of

the implementations. Results were taken for both the CPU and GPU correlation,

as well as when the machine was idle. This was repeated for with and without the

X11 graphical interface. It should be noted that the CPU power usage does include

an idle GPU in it’s power consumption. The test machine motherboard would not

support booting without a graphics card. However, the power rating of the CPU is

95 watts. These power results are shown in Table 5.3.

The performance results were divided by the number of watts required, to pro-

duce the graph shown in Figure 5.16. Also plotted are the maximum possible power

efficiencies, which assume that the power supply, motherboard, and all other pe-

ripherals except for the GPU or CPU draw zero power. The power ratings supplied

by the manufacturers were used in this calculation.


Initialise

Readdigital samples

NO

YES


Transfer datato GPU


Unpack


Kernel 1

Kernel 2

CMAC

Kernel 3



NO

YES

Finalise

Figure 5.9: GPU correlator testing. Shown is a diagram of the GPU correlatoralgorithm flow. Operations processed by the host are coloured yellow, while thekernels that execute on the device are coloured green. The entire algorithm ishighlighted, as the correctness testing and performance testing now presented arerepresentative of the entire algorithm.

115

-30

-20

-10

0

256

rela

tive

pow

er p

er c

hann

el [d

B]

frequency channel

Figure 5.10: Test output. Shown is an autocorrelation of one of the signals,produced by my GPU FX correlator. This output was correct compared to theCPU FX correlator I also implemented, although there was a slight variation in theresults attributable to different radices used in the FFT libraries.


CPU results GPU results relative error304972431360.000000 304975446016.000000 0.000010

37821464.000000 37822180.000000 0.00001937308172.000000 37309220.000000 0.00002837475216.000000 37475148.000000 0.00000237237392.000000 37238140.000000 0.00002037065256.000000 37065048.000000 0.00000637616392.000000 37616380.000000 0.00000037669348.000000 37670068.000000 0.00001938732304.000000 38731776.000000 0.00001437922300.000000 37923044.000000 0.00002038540296.000000 38540232.000000 0.00000239222000.000000 39222508.000000 0.00001339706708.000000 39706900.000000 0.00000540351944.000000 40352472.000000 0.00001340454904.000000 40455124.000000 0.00000540564384.000000 40565260.000000 0.00002240726024.000000 40726428.000000 0.00001041891344.000000 41891848.000000 0.00001242424596.000000 42424804.000000 0.00000543427080.000000 43427108.000000 0.00000143503580.000000 43503688.000000 0.00000244464684.000000 44464852.000000 0.00000445469620.000000 45470324.000000 0.00001546780332.000000 46780484.000000 0.00000346963360.000000 46964528.000000 0.00002548391492.000000 48392124.000000 0.00001350482428.000000 50484836.000000 0.00004851228304.000000 51229236.000000 0.00001853108032.000000 53108508.000000 0.00000954400584.000000 54398692.000000 0.00003556066080.000000 56066444.000000 0.00000658752968.000000 58751976.000000 0.00001759759816.000000 59760360.000000 0.00000961693108.000000 61692876.000000 0.00000464569492.000000 64570884.000000 0.00002266858544.000000 66859796.000000 0.00001969568000.000000 69569272.000000 0.00001873219928.000000 73221752.000000 0.00002577314432.000000 77313896.000000 0.00000779746600.000000 79746688.000000 0.000001

Table 5.2: Accuracy test data . Shown are the real values for the first fortyfrequency channels of the first autocorrelation for the first stream. No normalisationor calibration has been applied. The relative error is listed in the third column. Amean of 0.0000131 and maximum of 0.0000477 for the relative error was obtaineddirectly from the output values using float point precision arithmetic.

117

1 2

4 8

16 32

64 128

128

256

512

1024

10kHz

100kHz

1MHz

10MHz

100MHz

band

wid

th p

er s

igna

l

CPUGPU

number of signals

length of transform

Figure 5.11: Overview of stream bandwidth. Shown is an overview of the realtime bandwidth per stream for the range of correlator parameters tested: the lengthof Fourier transform and the number of data streams. Refer the cross-sections shownin Figures 5.12 and 5.13 for a clear comparison of results.


10kHz

100kHz

1MHz

10MHz

100MHz

1 2 4 8 16 32 64 128

band

wid

th p

er s

trea

m


GPU

CPU

L = 128L = 1024

Figure 5.12: The variation of stream bandwidth with N. Shown is the realtime bandwidth per stream as the number of data streams, N , varies. Results areplotted for the minimum and maximum transform lengths tested, L = 128 andL = 1024. The bandwidth calculation assumes sampling at the Nyquist rate.

119

10kHz

100kHz

1MHz

10MHz

100MHz

128 256 512 1024

band

wid

th p

er s

trea

m

length of FFT

GPU

CPU

Figure 5.13: The variation of stream bandwidth with L. Shown is the realtime bandwidth per stream as the FFT length, L, varies. Results are plotted forN = 16. Although the magnitude of the lines differ for other values of N , the trendsare similar to the two present in the CPU and GPU lines respectively.


100k

1M

10M

100M

1G

1 2 4 8 16 32 64 128

tota

l sam

ples

per

sec

ond


GPU

CPUL=128L=1024

Figure 5.14: Total data throughput. Shown is the total data throughput insamples per second. Results are shown as they vary with the number of streams fortwo transform lengths: L = 128 and L = 1024.

121

100M

1G

10G

100G

1 2 4 8 16 32 64 128

FLO

PS


GPU

CPUL=128L=1024

Figure 5.15: Correlator FLOPS. Shown is the rate of floating point operationper second (FLOPS) achieved by the correlators. For N streams and a length Lfast Fourier transform, each stream element requires 3N +5 log2 L+5 floating pointoperations in the correlation pipeline.


Correlation X11 Min Watts Max Watts Av WattsNone Yes 178 182 180CPU Yes 190 194 192GPU Yes 185 198 191.5None No 183 186 184.5CPU No 195 198 196.5GPU No 186 198 192

Table 5.3: Observed power usage. Shown is the amount of power used by theentire computer under several different operating loads. Results were taken for boththe CPU and GPU correlation, as well as when the machine was idle. This wasrepeated for with and without the X11 graphical interface. It should be noted thatthe CPU power usage does include an idle GPU. The test machine motherboardwould not support booting without a graphics card. However, the power rating ofthe CPU is 95 watts, and the GPU is rated at 135 watts. These figures are used toproduce the upper limits in Figure 5.16.

123

10k

100k

1M

1 2 4 8 16 32 64 128

tota

l sam

ples

per

sec

ond

per

wat

t


GPU

CPU

ideal GPU

ideal CPU

(a) L = 128 Fourier transforms

10k

100k

1M

1 2 4 8 16 32 64 128

tota

l sam

ples

per

sec

ond

per

wat

t


GPU

CPU

ideal GPU

ideal CPU

(b) L = 1024 Fourier transforms

Figure 5.16: Performance per watt. Shown is the total correlation throughputdivided by the power consumption. The throughput is measured in samples persecond, and the power is measured in Watts. The measured power usage for thesystem includes the power supply, motherboard, hard disk drive, optical drive, andperipherals excluding the display. Also shown are values calculated using the idealpeak usage taken from the power rating of the CPU and GPU respectively.


5.4 Polyphase Filter Testing

I next investigated the addition of a polyphase filter to the unpack stage of the GPU

FX correlation model. The modified correlator algorithm is shown in Figure 5.17.

This was achieved using kernel code to implemented the polyphase filter, b[n], defined

in Equation 2.26. For testing, an implementation for corresponding to a subsequent

FFT of length L = 128 was developed. A more generalised GPU polyphase filter is

left for future research.

For this implementation, testing examined how the rate of execution varied both

with the number of taps in the filter as well as the number of streams in the buffer.

The number of taps varied across powers of two from T = 1 to T = 8. The number

of streams varied across powers of two from N = 1 to N = 128. For the T = 1 case,

the polyphase filter is equivalent in performance to the unpack stage it replaces.

While it does contain an additional filter multiplication for each data value, this is

hidden by the latency of the memory fetch for that value.

The polyphase filter kernel scales linearly with the number of streams, as opposed

to the quadratic scaling of the CMAC stage. For this reason, it takes significantly

less time than the CMAC stage. Thus, overall performance tests would not reveal

how the polyphase filter is affected by N and T . For this reason, the performance

tests consider the performance for only the polyphase filter kernel corresponding

to b[n], and not the complete correlation implementation. The rate at which the

GPU polyphase filter processed the data for a various number of streams is shown

in Figure 5.18(a). This is also shown for a various number of taps in Figure 5.18(b).

125

Initialise

Readdigital samples

NO

YES


Transfer datato GPU


Polyphase


Kernel 1

Kernel 2

CMAC

Kernel 3



NO

YES

Finalise

Figure 5.17: Polyphase filter testing. Shown is a diagram of the GPU correlatoralgorithm flow. Highlighted is the polyphase filter kernel, which has replaces theunpack kernel for testing. The resulting change in performance for a variety ofcorrelation parameters is now examined.


1M

10M

100M

1G

1 2 4 8

sam

ples

per

str

eam

per

sec

ond

number of taps

N=4

N=8

N=16

N=32

N=64

N=128

(a) Performance by stream

1M

10M

100M

1G

4 8 16 32 64 128

sam

ples

per

str

eam

per

sec

ond

number of streams

taps=1taps=2taps=4taps=8

(b) Performance by tap

Figure 5.18: Polyphase filter performance. Shown are the total streamthroughput for the GPU polyphase filter. In the first figure, performance is measuredfor a varying number of taps in the polyphase filter. The amount of computationaloperations in the filter kernel scales with the number of taps. However, the per-formance of the filter is not fully effected by the additional computation, due tocomputation being hidden with global memory latency. The second graph showsthe effect of varying the number of streams on the kernel. The performance is in-versely proportional with the number of streams, which corresponds to the workloadrequired for a given number of streams.

Chapter 6

Discussion

The results presented in the previous chapter are now discussed. I will examine the

performance gains, computational precision, memory bandwidth, power consump-

tion, and ease of programming of the GPU implementation. This will demonstrate

the suitability of the graphics processing unit to accelerate signal correlation algo-

rithms used in radio interferometry. Furthermore, I will illustrate that significant

progress toward satisfying the processing requirements of the next generation of

scientific computation can be achieved by parallel processing architectures.

This chapter will first discuss the preliminary investigation of the GPU. Sub-

sequently, the effect of the correlation parameters on the choice of kernel for the

CMAC stage is explored. This is followed by a discussion of the overall GPU cor-

relator implementation. An analysis of power usage by the implementation is then

presented. The chapter closes with a discussion of the relative ease with which the

GPU algorithm can be modified.

127

128 CHAPTER 6: Discussion

6.1 Preliminary Analysis

Research presented in the literature review revealed potential GPU computing bot-

tlenecks [40, 64]. These bottlenecks could significantly impact the performance of a

GPU implementation. Preliminary testing was thus carried out to ensure that per-

formance was not significantly impacted by these bottlenecks before development of

the GPU FX correlator began. There were two areas that were investigated: the

transfer of data between host and GPU device, and the GPU FFT library imple-

mentation.

The transfer of data between the host and GPU device has been shown to become

limited depending on the size of the packets used for transfer [40]. In CUDA, the

size of the packets refers to the number of bytes specified in a single cudaMemcpy

routine call. For this reason, a range of packet sizes were tested to determine their

corresponding performance. The preliminary testing also examined the difference

between pageable and pinned data transfer modes to determine the most optimal

method of copying data to and from the GPU device. The results of these tests

were shown in Figure 5.3. These results showed that the pageable memory transfers

provide superior performance, and that the packet size must be larger than approx-

imately eight megabytes. This was used in the GPU FX correlator implementation

to minimise the performance impact of data transfer between the host and device.

Research that has used an implementation of the FFT on the GPU architecture

has seen mixed performance results [64, 34, 71]. As the FFT forms the second

computational stage of the FX correlation algorithm, it was important to determine

the performance of the CUDA FFT library. The results of the FFT testing revealed

that the overhead created by the transfer of unpacked floating point values before

129

and after the transform reduced the performance of the card. However, the GPU FX

correlator model does not transfer 32 bit unpacked floating point values. Instead it

transfers data to the GPU device in an 8 bit packed form, reducing the effect of the

data transfer to the GPU device by a factor of four. Furthermore, the accumulation

in the CMAC stage reduces the results to a negligible fraction of the original data

size. Thus the effect of data transfer from the GPU device is reduced.

Performing the additional correlation stages on the GPU device reduces the effect

of data transfer by approximately a factor of eight. This results from the factor of

four decrease in the size of the input data stream, and the factor of two decrease from

the reduction of the result data stream to a negligible size. However, this has not

considered the effect of moving the computation of the other two stages to the GPU

device from the CPU. Performance gains in these other stages will also mitigate the

cost of transferring data between the host and GPU device. The performance of the

CMAC stage is next discussed.

6.2 Optimisation Analysis

The optimisation analysis investigated how the correlation parameters of FFT length

and the number of telescope streams affected the best GPU CMAC approach. This

is important because the CMAC stage requires the greatest amount of computation

for a non-trivial number of telescope streams, and the trend in radio astronomy

interferometry is for an increasingly large number of telescopes in interferometer

arrays [38]. For the cross multiplication and accumulation stage of an FX correla-

tor, testing demonstrated the performance of the GPU. In the test results, shown

previously in Figure 5.7(b), 5.7(a), 5.8(b), and 5.8(a), the GPU was tens to hun-


dreds of times faster than the serial implementation. While there is potential for

optimisation in the CPU implementation, the achievable performance gains would

not be significant when compared to the orders of magnitude performance increase

required to reach that of the GPU.

The GPU correlator by Van Der Schaaf and Overeem [95], reviewed earlier in

Chapter 3, saw performance improve by just under a factor of five when compared

to a serial implementation. In contrast, the CMAC stage presented in Chapter 4

increased this improvement to over a factor of a hundred. There are two reasons for

this large leap in computational power. Firstly, the GPU has grown in power much

faster than the CPU in the intervening three years. This is shown in Figure 2.19(a).

Secondly, the advent of GPU computing discussed in Section 3.1 has allowed a

greater flexibility in algorithm design. This has resulted in a significant reduction

in number of global memory accesses and kernel executions required.

Testing revealed two main factors that determine the best CMAC kernel for a

given set of correlation parameters. The first is the coalescence of the global memory

access on the GPU. The effect of this can be seen in the superior performance of the

complex to complex data implementations over those of the real to complex data.

This effect also leads to the 1xGxG approach, while carrying out some redundant

processing unlike the 1x1x1 and 1x1xN approaches, being significantly faster due

to more efficient memory access. The use of the shared memory as a cache to

share data between threads in a group results in a speed up proportional to the

reduced global memory access, as seen in the majority of the correlation parameter

space. However, this approach became inefficient for low FFT lengths and numbers

of telescope streams due to a lack of GPU processing threads.

The GPU has the capability of actively processing hundreds of threads of exe-

131

cution simultaneously. Furthermore, it has the capability of scheduling threads. If

a group of threads stall from waiting on the latency involved in a global memory

access, while the execution of another group of thread proceeds. Thus a kernel may

be working on thousands of threads at any given time. However, as these threads

are processed in parallel, the GPU will take a similar amount of time to process one

thread as it would to process it’s maximum thread capacity. In this way, failing to

parallelise an algorithm to keep the GPU at maximum capacity will result in a loss

of performance, which I refer to as thread deficiency .

Thread deficiency is the second factor in determining the best approach for a

given set of correlation parameters. This can be seen for smaller transform lengths

and numbers of streams in Figure 5.7(b) and 5.8(b). The 1x1xN and 1xGxG meth-

ods begin to lose performance, whereas the 1x1x1 remains unaffected. This is due to

the finer parallelisation of this approach has more threads than the others. If there

are not enough threads to fill the GPU, it is taking the same amount of time to pro-

cess less work and thus performance drops. The thread deficiency in an approach

occurs approximately when the total number of threads for an approach drops be-

low the maximum thread occupancy of the GPU for that approach. This is distinct

from theoretical thread occupancy for a GPU, as other hardware restrictions for re-

sources such as shared memory and registers may cause the actual maximum thread

occupancy for an approach to be less than the theoretical maximum occupancy for

the GPU.

These hardware specifications vary from one GPU to another. Thus there is

not a defined boundary of correlation parameters where one approach GPU CMAC

approach surpasses the other in performance. For this reason, it is recommended

that the performance of the approaches be measured for the desired correlation

parameters on the GPU hardware that is to be used in order to select the best


approach. Such a measurement can be performed during the initialisation stage of

the GPU correlator. Because this measurement only needs to be performed once, it

will not reduce the performance of the GPU correlator.

Utilising the GPU to optimally process the cross multiplication and accumula-

tion stage of a correlation algorithm is non-trivial. However, the gains that can

be achieved both over a traditional CPU approach, and through the correct choice

of optimised approaches make this worthwhile. This work has investigated sev-

eral possible implementations to obtain a significant gain in overall GPU algorithm

performance.

6.3 GPU FX Correlator Analysis

The results of the preliminary testing and CMAC stage testing were used to develop

the GPU FX correlator, to investigate the performance of the GPU architecture for

FX correlation. An important consideration is whether the GPU implementation

produces correct results. The CUDA programming language used in the testing fol-

lows the IEEE-754 standard for single-precision binary floating-point arithmetic [1].

Some of the more advanced features of this standard are not supported. However,

it is more than sufficient for the calculations presented in this work.

The results of the correctness tests shown previously in Figure 5.10 match those

supplied with the test data. A comparison of the correctness tests from the CPU

and GPU implementation revealed an average relative difference between results of

0.000013. These differences arise because the order of floating point operations in

the two implementations may be different. In particular, the implementations select

the most optimal set of FFT radices for each hardware architecture. While mathe-

133

matically these radices are interchangeable, minutely different results are obtained

when floating point arithmetic is used. In terms of a real world implementation,

noise both from the environment and the receiving equipment [6] would be far more

significant. It should be noted that both the CPU and GPU are providing 32 bit

floating point approximations to the correct result, and neither should be consid-

ered the absolute truth. More accurate results could be obtained by using double

precision floating point calculations at the cost of performance. Double precision is

available on both the CPU and modern GPU architectures.

The GPU correlator consistently outperforms the CPU version. As seen in Fig-

ure 5.12, the GPU performance advantage varies between a factor of 3 for the corre-

lations that least suit the GPU, to a factor of 70 for those parameters most optimal

for the GPU. For the majority of the typical correlation parameter space, the GPU

correlator performs faster by over an order of magnitude. However, the performance

of the GPU implementation is reduced for correlation parameters that are low in

FFT lengths and number of telescope streams. This is due to thread deficiency,

since the low correlation parameters result in a lower number of threads for CMAC

stage.

The computational load is most demanding for large transform lengths and num-

bers of streams. For this reason, the GPU optimisation concentrated on such param-

eter values. There are parallelisation approaches that could be applied to increase

the number of threads for low correlation parameters, and improve the GPU perfor-

mance. Currently, the GPU correlator uses one thread for each frequency channel

and each pair in the CMAC stage for cases where thread deficiency may be en-

countered. It may be possible to increase the number of threads by using multiple

threads in place of the current single thread to each accumulate a separate part.

This would add an overhead of an additional step which would add these subtotals


together, however this would most likely be more than accounted for by the result-

ing performance boost. Investigation of improving the correlator performance in the

thread deficient regime is left for future research.

A close examination of Figures 5.12 and 5.13 will reveal that the CPU correlator

prefers the smaller transform lengths (L), whereas the GPU prefers the longer trans-

form lengths. For the CPU, the net computational complexity per stream element

ǫ is given by

ǫ ∈ O[N + log2(L)] (6.1)

for T timesamples per each of N streams and an FFT length L. Thus for a given

number of streams and total number of timesamples, the complexity will scale with

log(L). This increased complexity accounts for the slower CPU performance as L

increases.

However, for the GPU case the complexity is identical yet the results are the

reverse. This is due to the fact that longer transform lengths give the GPU corre-

lator more scope for parallelism, particularly in the CMAC stage of the algorithm.

This can be seen in the GPU L = 128 result in Figure 5.12 as the performance

drops significantly for low numbers of streams. The GPU L = 1024 fairs better

in this regime as the higher transform length results in more active threads during

the CMAC stage. It should be noted at higher lengths than typically used in ra-

dio astronomy correlation, once there are sufficient threads for the GPU to perform

optimally, subsequent increases in transform length result in a similar performance

decline as seen in the CPU. The GPU is by no means immune to the limits of compu-

tational complexity, but rather has thread deficiency as an additional consideration

that is skewing the most optimal configuration higher than it would otherwise be.

135

Aside from the effects from thread deficiency, the GPU correlator is bound by

the memory bandwidth to the GPU memory. This is demonstrated by the 1xGxG

method achieving twice the performance of the other methods in Figures 5.7 and 5.8,

because the shared memory techniques in the 1xGxG method reduce the memory

access by a factor of two. It should be noted that the GPU has the highest memory

bandwidth of the currently available commodity computing devices. The 2006 model

GeForce 8800 GTS used in this research has a memory bandwidth of 64 GB/sec.

Due to the use of the CUFFT library, an exact count of the memory operations in

the GPU correlator is not possible. However assuming the minimum required access,

the total bytes of global memory access per 1 byte sample would be approximately

26 + 4N . Choosing N = 128 to avoid the effects of thread deficiency, a 64GB/s

memory bandwidth should result in a data rate of 118 megasamples per second.

This is consistent with Figure 5.14, with the relevant data point corresponding to

105 megasamples per second. It is expected that the rising trend for the GPU in

Figure 5.15 due to diminishing thread deficiency will level out having reached this

saturation point.

An estimate of the proportion of the GPU computational resources used by these

approaches can be obtained. The GPU used in for this research has a maximum

theoretical performance of 345.6 GFLOPs. Using 3N+5 log2 L+5, and selecting N =

128 and L = 1024 to avoid thread deficiency effects, each stream element requires

439 floating point operations. For the measured performance of 105 megasamples

per second from Figure 5.14, this corresponds to 46.1 GFLOPs. It should be noted

that this value only includes operations directly applied to the data. Additional

necessary operations, such as for memory addressing, have not been included as

they can vary between implementations. However, this result does show that if

memory operations could be reduced, the performance of the GPU implementation


can increase by up to a maximum theoretical factor of 7.5.

Testing did not examine a number of streams beyond N = 128. This is because

the GPU global memory would begin to impose a restriction on the length of trans-

form range tested. This could be overcome by using a multiple GPU approach, in

which each GPU correlates a portion of the stream pairs. As the number of streams

increases such a solution would already be required in order to obtain real time

bandwidths.

It is also clear that the PCI-express bus is not a bottleneck, this is shown by

Figure 5.14. The graph shows the total rate of input data that the correlators can

process in realtime. The CPU and the GPU are bound by their computational

ability rather than the bus bandwidth through which the input data can reach

the device. For a correlation algorithm, the input bandwidth dominates and the

output bandwidth is negligible in comparison due to the data reduction effect of

accumulation. The GPU correlator is almost saturating the maximum SATA2 data

rate. However, realtime GPU correlator data acquisition will not occur via physical

hard disks, as the highest capacity disks would be processed in a matter of minutes.

Instead the host machine would receive the data streamed over a higher bandwidth

network connection.

With the current results, the GPU processing power would have to grow by

an order of magnitude to saturate the current PCI-express architecture. In the

meantime, future bus architectures such as PCI-express 2 and 3 will continue to

increase the bandwidth between host and device. Many of the compute vs bandwidth

concerns have already been addressed for the CPU, and suggested solutions suit the

parallel nature of the GPU [45]. It is possible that the CPU and GPU architectures

will merge in future hardware designs, removing the need for data transfer between

137

the two over PCI-Express.

The data used in the testing consisted of real 8 bit samples. While the unpacking

data of differing bit precision should have no significant impact on the performance

of the GPU correlator, there may be a slight performance decrease for higher bit

precision due to the associated additional data transfer between the GPU and the

host machine. Conversely a lower bit precision should result in slightly higher per-

formance. The data streams themselves were not interleaved, and thus consecutive

timesamples of a given stream were contiguous in memory. For interleaved samples,

where the samples for all streams that correspond to a given time are contiguous

in memory, the samples would need to be deinterleaved prior to the Fast Fourier

transform. The best way to address this is left for future research.

6.4 Power and Cost Analysis

The exponential growth in the computational performance of processors detailed in

Section 2.5.1 and shown in Figure 2.19(a) has come at the cost of a similar growth in

their power consumption. The power consumption of processing systems has become

a significant budgeting concern for the next generation of radio telescope arrays. For

this reason the energy usage of both the parallel and serial FX correlator models was

explored. This was achieved by measuring the power consumption of the GPU FX

correlator. The power consumption of the serial CPU implementation was estimated

using the power specification provided by the manufacturer. This value does not

include inefficiencies of the power supply, and additional power consumption by the

motherboard and other internal components of the test system.

The results of this exploration were presented in Section 5.3. The direct mea-


surement the net power usage of the GPU FX correlator initially showed that it was

higher than the power rating of the serial CPU implementation. The performance

results for the correlator implementations were then taken into consideration. In

terms of performance per watt, the results of Figure 5.16 show the parallel imple-

mentation to be superior. Thus the GPU is significantly more power efficient than

the CPU for this application. Indeed, even if the CPU were running in a system

with a perfect power supply, a motherboard and peripherals with zero power re-

quirements, and no graphics card; it would still be less power efficient than the real

world GPU.

There is also a trend present in Figure 5.16. The power advantage of the GPU

scales with the size of the array. As the array gets larger the power efficiency

improves. This is caused by the superior performance of the CMAC stage kernel

for larger numbers of telescope streams. Thus for the truly large scale instruments

required for future radio astronomy science, a GPU-accelerated correlator should

provide a higher power efficiency than a correlator based on CPUs alone.

Providing a detailed analysis of the relative cost of the CPU and GPU imple-

mentation is problematic. A comparison of performance per dollar, based on the

purchasing price of the equipment, would not necessarily be representative. This is

because the cost of the hardware varies dramatically over time. It should be noted

that such a comparison should take into account the cost of the entirety of the two

systems, and not just compare the CPU and GPU components separately.

139

6.5 Adaptability Analysis

The real advantage of software correlators is their ability to be adapted easily to

new algorithms for different interferometer configurations and science outcomes.

The polyphase filter stage described in Section 2.3 as added to the GPU correlator

implementation to show that is retains this adaptability. As detailed in Section 4.3,

the polyphase filter stage was added to the unpack stage of the correlation algorithm.

In order to obtain the desired performance, the approach was critically analysed

in a manner similar to that presented in the CMAC stage. The hardware specifi-

cation was considered to ensure sufficient threads to realise the parallelism of the

GPU while not exceeding the available compute resources. These resources include

the number of registers, the available shared memory, and the thread capacity of

the GPU multiprocessor. While finding optimal solutions that fit within these con-

straints is certainly a time consuming process for a novice to the GPU computing

paradigm, with experience this becomes a more expedient process.

The testing of this filter stage was presented in Section 5.4. The results shown

in Figure 5.18(a) and 5.18(b) reveal that not only was this stage successfully im-

plemented, but that the resulting increase in processing time was less than the

additional computation required for the filter. This is possible due to the SPMD

memory latency hiding of the GPU architecture, described in Section 2.4. To sum-

marise, the additional computation occurred during memory latency already present

in the original algorithm. This has the implication that some additional features can

be added to the algorithm with little performance impact. Some of these features

are discussed as potential future research in the next chapter.

Chapter 7

Conclusion

This chapter summarises the work presented in this thesis. Beginning with the

concept that parallel computing architectures can be used to meet the processing

demands of science, this research has revealed significant results. This includes a

data parallel model of a FX radio signal correlator using a GPU computing ap-

proach. The model has shown that the techniques presented in this work can yield

a system with output matching that of traditional serial approaches, with perfor-

mance gains measured in orders of magnitude. At the same time, these results have

demonstrated that this performance is obtainable at a lower power cost per FLOP

than the serial approach and still maintains a degree of adaptability to new algorith-

mic features. This chapter summarises the individual contributions of this thesis,

and then concludes with future considerations for extending this work.

141

142 CHAPTER 7: Conclusion

7.1 Thesis Summary

I first conducted preliminary testing, to address potential bottlenecks in the GPU

compute paradigm revealed by my background research. The purpose of this testing

was to ensure that these bottlenecks were not a significant obstacle before commit-

ting to further development on the GPU. The preliminary tests first investigated

factors affecting the transfer of data between the host and GPU device. The re-

sults of the tests indicated that pageable memory transfers with a minimum size

of eight megabytes resulted the most optimal host-device bandwidth. Preliminary

testing then investigated the performance of the CUDA fast Fourier transform li-

brary, CUFFT. Results showed that CUDA FFT was roughly ten time faster than

a FFTW CPU implementation, but that data transfer of 32 bit floating point val-

ues reduced this performance considerably. Since the correlator data consists of a

packed 8 bit integer format, it is one quarter the size of an equivalent 32 bit floating

point representation. I concluded that the transfer of data in its existing 8 bit in-

teger form and subsequent unpacking to floating point on the GPU would mitigate

the performance drop caused by date transfer to the GPU device. Furthermore, the

accumulation that occurs in the CMAC stage of the FX correlation algorithm would

reduce the data transfer from the GPU device to a negligible amount.

I then developed several potential parallel approaches for the CMAC stage kernel.

The purpose of these approaches was to investigate two main correlation parameters:

the length of the FFT, and the number of telescope data streams. The approaches

varied from memory efficient models that reused memory fetches from the GPU

device memory, to extremely parallel approaches that contained a larger number

of threads. These approaches were tested for the ranges of correlation parameters

commonly used in radio astronomy, to determine the best approach for a given set

143

of parameters. The results of my testing showed that for FFT lengths larger than

512, or numbers of telescope streams larger than 16, the memory efficient model was

superior. However, for small FFT lengths and small numbers of telescope streams,

the approach which contained more threads was more appropriate.

Taking the best CMAC stage kernels, I then implemented the entire GPU FX

correlation algorithm. The purpose of this implementation was to determine the

suitability of the GPU architecture to radio astronomy correlation. The GPU im-

plementation was tested for correctness and performance. My results showed that

the GPU implementation produced correct results, and performed up to a hun-

dred times faster than a comparative serial CPU implementation. The performance

trends of the previous CMAC stage testing were evident in the full FX correlation

implementation results. This is due to the CMAC stage being the most computa-

tionally intensive stage in the algorithm. From the performance results, I concluded

that the GPU architecture was indeed suited to radio astronomy correlation.

However, the power usage of the GPU was a concern, since the power consump-

tion of computing facilities has become a significant budget consideration. For this

reason, I measured the power consumption of the GPU FX correlator. The power

consumption of the serial CPU implementation was estimated using the power spec-

ification provided by the manufacturer. This value does not include inefficiencies of

the power supply, and additional power consumption by the motherboard and other

internal components of the test system. While the GPU correlator did use more

power than the CPU rating, I also considered the relative performance output of

each implementation. In terms of performance per watt the GPU implementation

was superior by up to a factor of 30.

Finally, I also modified the GPU implementation with the addition of a polyphase


filter stage. The purpose of adding this stage was to investigate how easily GPU

algorithms could be modified. In order to achieve desirable performance, I applied

the GPU programming techniques developed while investigating the best CMAC

stage approach. My understanding of the GPU computing paradigm was critical.

The resulting polyphase filter implementation was then tested. Since the polyphase

filter included additional computation, I expected the performance of the GPU im-

plementation to drop accordingly. However, the implementation performance was

better than expected. I concluded that some of the additional computation was

used by the memory latency hiding mechanisms of the GPU hardware.

7.2 Future Research

This research has thoroughly investigated the parallel implementation of a FX cor-

relator on the GPU architecture. However, there are many related areas yet to be

explored. This chapter lists some of these areas. These include additional features

of the correlator itself, and the rest of the aperture synthesis pipeline. The scaling

of this work to cluster computing, and alternative hardware architectures is also

discussed.

The FX correlation algorithms, both CPU and GPU, represent a simple cor-

relation benchmark framework. Additional features for specific correlation array

configurations; such as delay compensation, corner turning, and fringe rotation; are

not implemented. The omission of these features was due to time constraints, and

there is no barrier to their implementation on the GPU. Should their computation

fall within global memory latency, it is possible that there will be little additional

overhead in the GPU correlator.

145

Other frequency filters could also be examined. The vanilla FFT approach con-

tains inherent leakage of a non-aligned frequency into the other spectral bins, increas-

ing the signal to noise ratio [52]. This is traditionally addressed in radio astronomy

by the polyphase filter approach also presented in Chapter 4. It is possible that

given the low arithmetic intensity of the FFT, that an alternative approach that

traditionally has a higher computation cost may be viable on the GPU architecture.

The aperture synthesis pipeline, as introduced in Section 2.1.3, consist of several

sequential parts that convert the one dimensional radio signals collected by the tele-

scopes into two dimensional images of the radio source. The parallel FX correlator

demonstrated in this work forms the first of these parts. As reviewed in Chapter 3,

Wayth and Dale have implemented a parallel version of the latter stages of aper-

ture synthesis [103]. Subsequent work could focus on parts of the pipeline not yet

addressed, such as image deconvolution techniques introduced in Section 2.1.3.

Although the parallel implementation of the correlator is significantly faster than

a serial approach, it is still only able to process a finite amount of data in realtime.

In order to deal with the scale of data foreshadowed in Section 3, a multitude of

GPU devices would be required. Consequently, the correlation implementation must

be parallelised across multiple GPU devices.

A possible approach would be to copy the techniques used by radio spectrometry

hardware. The multichannel receiver introduced in Section 2.1.1 split the incoming

frequencies into bands. This approach could also be used in the case of a GPU

correlator cluster. In this scheme, each GPU device correlates a band of the overall

bandwidth.

Another potential approach would be to parallelise by data streams. In this


scheme, each GPU device would process a group of baseline pairs. A drawback of

this approach is that the unpacking and Fourier transform stages of the correlator

pipeline, that was shown in Figure 4.1, would need to be processed multiple times

for some streams. Although the computational complexity of these stages is less

than that of the CMAC stage, they are by no means negligible.

While CUDA is an excellent parallel language for implementing scientific algo-

rithms on the GPU, it limits the resulting program to vendor specific hardware.

The GPU computing field is rapidly maturing, and approaching standardisation.

OpenCL (Open Computing Language) is an open royalty-free standard for general

purpose parallel programming across CPUs, GPUs, and other processors, giving

software developers portable and efficient access to the power of these heteroge-

neous processing platforms [66]. The implementation of a parallel correlator in such

a language would increase its accessibility for the radio astronomy community.

References

[1] IEEE standard for binary floating-point arithmetic. ANSI/IEEE Std 754-1985,1985. Technical Report.

[2] J. G. Ables. Maximum Entropy Spectral Analysis. Astronomy and Astro-physics Supplement, 15:383–+, June 1974.

[3] AMD. Amd stream computing: Software stack. 2007. Inter-net, http://ati.amd.com/technology/streamcomputing/resources.html, ac-cessed 03/12/2008.

[4] Gene M. Amdahl. Validity of the single processor approach to achieving largescale computing capabilities. Readings in computer architecture, pages 79–81,2000.

[5] R. G. Belleman, J. Bedorf, and S. Portegies Zwart. High Performance Di-rect Gravitational N-body Simulations on Graphics Processing Units – II: Animplementation in CUDA. ArXiv e-prints, 707, July 2007.

[6] F. H. Briggs, J. F. Bell, and M. J. Kesteven. Removing Radio Interferencefrom Contaminated Astronomical Spectra Using an Independent ReferenceSignal and Closure Relations. 120:3351–3361, December 2000. arXiv:astro-ph/0006222.

[7] R. H. Brown and A. C. B. Lovell. The exploration of space by radio. Chapmanand Hall Ltd, 1957.

[8] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian,Mike Houston, and Pat Hanrahan. Brook for GPUs: stream computing ongraphics hardware. ACM Trans. Graph., 23(3):777–786, 2004.

[9] John Bunton. Multi-resolution fx correlator. ALMA memo 447, Feb 2003.

[10] B. F. Burke and F. Graham-Smith. An Introduction to Radio Astronomy.Cambridge University Press, 1997.

147

148 References

[11] I-Liang Chern and Ian T. Foster. Parallel implementation of a control volumemethod for solving pdes on the sphere. In Proceedings of the Fifth SIAMConference on Parallel Processing for Scientific Computing, pages 301–306,Philadelphia, PA, USA, 1992. Society for Industrial and Applied Mathematics.

[12] Y. Chikada, M. Ishiguro, H. Hirabayashi, M. Morimoto, K. I. Morita,K. Miyazawa, K. Nagane, K. Murata, A. Tojo, S. Inoue, T. Kanzawa, andH. Iwashita. A Digital FFT Spectro-Correlator for Radio Astronomy. In J. A.Roberts, editor, Indirect Imaging. Measurement and Processing for IndirectImaging, page 387, 1984.

[13] S. Chikada, Y.; Ishiguro, M.; Hirabayashi, H.; Morimoto, M.; Morita, K.; Kan-zawa, T.; Iwashita, H.; Nakazima, K.; Ishikawa, S.; Takahashi, T.; Handa, K.;Kasuga, T.; Okumura, S.; Miyazawa, T.; Nakazuru, T.; Miura, K.; Nagasawa.A 6 320-MHz 1024-channel FFT cross-spectrum analyzer for radio astronomy.Proceedings of the IEEE, 75(9):1203–1210, September 1987.

[14] D. Cook, J. Ioannidis, A. Keromytis, and J. Luck. Cryptographics: Secret keycryptography using graphics cards, 2005.

[15] James W. Cooley and John W. Tukey. An algorithm for the machine calcula-tion of complex fourier series. Mathematics of Computation, 19(90):297–301,1965.

[16] Greg Coombe, Mark J. Harris, and Anselmo Lastra. Radiosity on graphicshardware. In Graphics Interface, pages 161–168, 2004.

[17] T. J. Cornwell. Ska and Evla Computing Costs for Wide Field Imaging.Experimental Astronomy, 17:329–343, June 2004.

[18] T.J. Cornwell and Ger van Diepen. Scaling mount exaflop: from the pathfind-ers to the square kilometre array. 2008.

[19] CSIRO. The csiro parkes radio telescope. 2007. Internet,http://www.scienceimage.csiro.au/index.cfm?event=site.image.detail&id=4030,accessed 24/12/2008.

[20] CSIRO. Science image : Pricing and licences. 2007. In-ternet, http://www.scienceimage.csiro.au/index.cfm?event=site.pricing, ac-cessed 24/12/2008. Permission to use images free of charge obtained via email.

[21] W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte, J-H A.,N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, and I. Buck. Merrimac:Supercomputing with streams. In SC’03, Phoenix, Arizona, November 2003.

[22] A. Deller, S. Tingay, M. Bailes, and C. West. Distributed FX software correla-tion for eVLBI. In Proceedings of the 8th European VLBI Network Symposium,2006.

149

[23] Adam T. Deller, S. J. Tingay, M. Bailes, and C. West. DiFX: A software cor-relator for very long baseline interferometry using multi-processor computingenvironments. 2007. astro-ph/0702141.

[24] Kelly Dempski. Real-time Rendering Tricks and Techniques in DirectX. Thom-son Course Technology, 2002.

[25] S. W. Ellingson and W. Cazemier. Efficient multibeam synthesis with inter-ference nulling for large arrays. IEEE Transactions on Antennas and Propa-gation, 51:503–511, March 2003.

[26] Bowman J. D. et al. Field Deployment of Prototype Antenna Tiles for theMileura Widefield Array Low Frequency Demonstrator. 133:1505–1518, April2007. arXiv:astro-ph/0611751.

[27] Zhe Fan, Feng Qiu, Arie Kaufman, and Suzanne Yoakum-Stover. GPU clus-ter for high performance computing. In SC ’04: Proceedings of the 2004ACM/IEEE conference on Supercomputing, page 47, Washington, DC, USA,2004. IEEE Computer Society.

[28] Randima Fernando, editor. GPU Gems: Programming Techniques, Tips, andTricks for Real-Time Graphics. Addison-Wesley, 2004.

[29] Randima Fernando and Mark J. Kilgard. The Cg Tutorial: The DefinitiveGuide to Programmable Real-Time Graphics. Addison-Wesley Longman Pub-lishing Co., Inc., Boston, MA, USA, 2003.

[30] M.J. Flynn. Very high-speed computing systems. Proceedings of the IEEE,54(12):1901–1909, Dec. 1966.

[31] M. Frigo and S. G. Johnson. The fastest fourier transform in the west. Tech-nical report, Cambridge, MA, USA, 1997.

[32] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamicwarp formation and scheduling for efficient gpu control flow. In MICRO ’07:Proceedings of the 40th Annual IEEE/ACM International Symposium on Mi-croarchitecture, pages 407–420, Washington, DC, USA, 2007. IEEE ComputerSociety.

[33] Dominik Goddeke, Robert Strzodka, Jamaludin Mohd-Yusof, Patrick Mc-Cormick, Sven H. M. Buijssen, Matthias Grajewski, and Stefan Turek. Explor-ing weak scalability for fem calculations on a gpu-enhanced cluster. ParallelComput., 33(10-11):685–699, 2007.

[34] Naga K. Govindaraju, Scott Larsen, Jim Gray, and Dinesh Manocha. A mem-ory model for scientific algorithms on graphics processors. Technical report,UNC, 2006.

150 References

[35] GPGPU. General-purpose computation using graphics hardware. 2008. In-ternet, http://www.gpgpu.org/, accessed 16/12/2008.

[36] John L. Gustafson. Reevaluating amdahl’s law. Commun. ACM, 31(5):532–533, 1988.

[37] K.G. Haines, J.A. Moya, and T.P. Caudell. Modeling nonsynaptic communica-tion between neurons in the lamina ganglionaris of musca domestica. NeuralNetworks, 1999. IJCNN ’99. International Joint Conference on, 1:131–136vol.1, 1999.

[38] P. J. Hall. The Square Kilometre Array: An Engineering Perspective.The Square Kilometre Array: An Engineering Perspective, Edited by PeterJ. Hall. 2005 V, 430 p. 1-4020-3797-X. Berlin: Springer, 2005., 2005.

[39] Mark J. Harris, Greg Coombe, Thorsten Scheuermann, and Anselmo Lastra.Physically-based visual simulation on graphics hardware. SIGGRAPH Euro-graphics Workshop on Graphics Hardware, 2002.

[40] Owen Harrison and John Waldron. Optimising data movement rates for paral-lel processing applications on graphics processors. In Parallel and DistributedComputing and Networks, 2007.

[41] A. Hewish, S. J. Bell, J. D. Pilkington, P. F. Scott, and R. A. Collins. Ob-servation of a Rapidly Pulsating Radio Source. Nature, 217:709–+, February1968.

[42] J. A. Hogbom. Aperture Synthesis with a Non-Regular Distribution of Inter-ferometer Baselines. Astronomy and Astrophysics Supplement, 15:417, June1974.

[43] K. G. Jansky. Directional Studies of Atmospherics at High Frequencies. InN. Kassim, M. Perez, W. Junor, and P. Henning, editors, Astronomical Societyof the Pacific Conference Series, volume 345 of Astronomical Society of thePacific Conference Series, pages 3–15, December 2005.

[44] Marcin Jedrzejewski and Krzyszt Marasek. Computation of room acoustics us-ing programmable video hardware. In International Conference on ComputerVision and Graphics, September 2004.

[45] Eric E. Johnson. Graffiti on the memory wall. SIGARCH Comput. Archit.News, 23(4):7–8, 1995.

[46] Arvind Krishnamurthy and Katherine A. Yelick. Optimizing parallel spmdprograms. In LCPC ’94: Proceedings of the 7th International Workshop onLanguages and Compilers for Parallel Computing, pages 331–345, London,UK, 1995. Springer-Verlag.

151

[47] Jens Krueger and Ruediger Westermann. Linear algebra operators for GPUimplementation of numerical algorithms. ACM Transactions on Graphics(TOG), 22(3):908–916, 2003.

[48] S. R. Kulkarni, S. B. Anderson, T. A. Prince, and A. Wolszczan. Old pulsarsin the low-density globular clusters M13 and M53. Nature, 349:47–49, January1991.

[49] Muckul. R. Kundu. Solar Radio Astronomy. John Wiley & Sons Inc, November1965.

[50] S. J. Lilly. Discovery of a radio galaxy at a redshift of 3.395. AstrophysicsJournal, 333:161–167, October 1988.

[51] Colin J. Lonsdale, Sheperd S. Doeleman, and Divya Oberoi. Efficient imagingstrategies for next-generation radio arrays. The Square Kilometre Array: AnEngineering Perspective, pages 345–362, January 2005.

[52] Richard G. Lyons. Understanding Digital Signal Processing (2nd Edition).Prentice Hall PTR, Upper Saddle River, NJ, USA, 2004.

[53] John Markoff. Intels big shift after hitting technical wall. The New YorkTimes, 2004.

[54] H. Markram. The blue brain project. NATURE REVIEWS NEURO-SCIENCE, 7(2):153–160, 2006.

[55] Berna L. Massingill, Timothy G. Mattson, and Beverly A. Sanders. Patternsfor parallel application programs. In Proceedings of the Sixth Pattern Lan-guages of Programs Workshop, 1999.

[56] Berna L. Massingill, Timothy G. Mattson, and Beverly A. Sanders. Reengi-neering for parallelism: an entry point into plpp for legacy applications: Re-search articles. Concurrent Computint : Practice and Experience, 19(4):503–529, 2007.

[57] Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shadermetaprogramming. In HWWS ’02: Proceedings of the ACM SIG-GRAPH/EUROGRAPHICS conference on Graphics hardware, pages 57–68,Aire-la-Ville, Switzerland, Switzerland, 2002. Eurographics Association.

[58] Michael D. McCool, Kevin Wadleigh, Brent Henderson, and Hsin-Ying Lin.Performance evaluation of gpus using the rapidmind development platform. InSC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing,page 181, New York, NY, USA, 2006. ACM.

[59] J. Michalakes and M. Vachharajani. Gpu acceleration of numerical weatherprediction. Parallel and Distributed Processing, 2008. IPDPS 2008. IEEEInternational Symposium on, pages 1–7, April 2008.

152 References

[60] A. A. Michelson. On the Application of Interference Methods to AstronomicalMeasurements. Proceedings of the National Academy of Science, 6:474–475,August 1920.

[61] John S. Montrym, Daniel R. Baum, David L. Dignam, and Christopher J.Migdal. Infinitereality: a real-time graphics system. In SIGGRAPH, 1997.

[62] G. E. Moore. Cramming more components onto integrated circuits. Electron-ics, 38(8):114–117, 1965.

[63] J. M. Moran. Thirty Years of VLBI: Early Days, Successes, and Future. InJ. A. Zensus, G. B. Taylor, and J. M. Wrobel, editors, IAU Colloq. 164: RadioEmission from Galactic and Extragalactic Compact Sources, volume 144 ofAstronomical Society of the Pacific Conference Series, 1998.

[64] Kenneth Moreland and Edward Angel. The FFT on a GPU. Graphics Hard-ware, 2003.

[65] S. R. Mosier and J. Fainberg. A new high-speed solar spectrograph for meterand decameter wavelengths. Solar Physics, 40:501–509, February 1975.

[66] Aaftab Munshi. The OpenCL specification. Technical report, 2008.

[67] Hubert Nguyen, editor. GPU Gems 3. Addison-Wesley, 2007.

[68] NVIDIA. New nvidia GPU breaks one billion pixels per second barrier. PressRelease, Internet, 2000. http://www.nvidia.com/.

[69] NVIDIA. Nvidia unveils cuda - the gpu computing revolution begins, Novem-ber 2006. NVIDIA Press Release.

[70] NVIDIA. CUDA CUBLAS Library 1.0. June 2007.

[71] NVIDIA. CUDA CUFFT Library 1.0. June 2007.

[72] NVIDIA. CUDA Programming Guide 1.0. June 2007.

[73] National Radio Astronomy Observatory. Jansky antenna. 2008. Internet,http://images.nrao.edu/Historical/Telescopes/107, accessed 15/12/2008.

[74] National Radio Astronomy Observatory. Nrao image use policy. 2008. Internet,http://images.nrao.edu/image use.shtml, accessed 15/12/2008.

[75] Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck. Discrete-TimeSignal Processing (2nd Edition). Prentice Hall, February 1999.

[76] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips.Gpu computing. Proceedings of the IEEE, 96(5):879–899, May 2008.

153

[77] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krger,Aaron E. Lefohn, and Timothy J. Purcell. A survey of general-purpose compu-tation on graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007.

[78] Aaron Parsons, Donald Backer, Chen Chang, Daniel Chapman, Henry Chen,Patrick Crescini, Christina de Jesus, Chris Dick, Pierre Droz, David MacMa-hon, Kirsten Meder, Jeff Mock, Vinayak Nagpal, Borivoje Nikolic, ArashParsa, Brian Richards, Andrew Siemion, John Wawrzynek, Dan Werthimer,and Melvyn Wright. Petaop/second fpga signal processing for seti and ra-dio astronomy. Signals, Systems and Computers, 2006. ACSSC ’06. FortiethAsilomar Conference on, pages 2031–2035, Oct.-Nov. 2006.

[79] R. B. Partridge. 3K: The Cosmic Microwave Background Radiation. Cam-bridge University Press, September 1995.

[80] Marshall C. Pease. An adaptation of the fast fourier transform for parallelprocessing. Journal of the ACM, 15(2):252–264, 1968.

[81] D.C. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox,P. Harvey, P.M. Harvey, H.P. Hofstee, C. Johns, J. Kahle, A. Kameyama,J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluszny, M. Riley, D.L.Stasiak, M. Suzuoki, O. Takahashi, J. Warnock, S. Weitzel, D. Wendel, andK. Yazawa. Overview of the architecture, circuit design, and physical imple-mentation of a first-generation cell processor. IEEE Journal of Solid-StateCircuits, 41:179–196, 2006.

[82] Matt Pharr, editor. GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation. Addison-Wesley,2005.

[83] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan. Raytracing on programmable graphics hardware. ACM Transactions on Graphics,21(3):703–712, July 2002.

[84] Michael J. Quinn. Parallel Computing. McGraw-Hill Inc., 1994.

[85] Lawrence R. Rabiner. Multirate Digital Signal Processing. Prentice Hall PTR,Upper Saddle River, NJ, USA, 1996.

[86] K. Rohlfs, T. L. Wilson, and S. Huttemeister. Tools of Radio Astronomy.Springer, 2009.

[87] J. D. Romney. Cross Correlators, volume 180 of Astronomical Society of thePacific Conference Series. 1999.

[88] Randi J. Rost. OpenGL(R) Shading Language (2nd Edition). Addison-WesleyProfessional, 2005.

154 References

[89] M. Ryle. A new radio interferometer and its application to the observationof weak radio stars. Proceedings of the Royal Society of London. Series A,Mathematical and Physical Sciences, 211(1106):351–375, 1952.

[90] M. Ryle and D. D. Vonberg. Solar Radiation on 175 Mc./s. Nature, 158:339–340, September 1946.

[91] Kjeld Schaaf and Ruud Overeem. Cots correlator platform. ExperimentalAstronomy, 17(1-3):287–297, June 2004.

[92] Hsi-Yu Schive, Chia-Hung Chien, Shing-Kwong Wong, Yu-Chih Tsai, andTzihong Chiueh. Graphic-card cluster for astrophysics (graCCA) performancetests. ArXiv e-prints, July 2007.

[93] H. Schomberg and J. Timmer. The gridding method for image reconstructionby fourier transformation. Medical Imaging, IEEE Transactions on, 14(3):596–607, Sep 1995.

[94] Amar Shan. Heterogeneous processing: a strategy for augmenting moore’slaw. Linux Journal, Jan 2006.

[95] Mark Silberstein, Assaf Schuster, Dan Geiger, Anjul Patney, and John D.Owens. Efficient computation of sum-products on gpus through software-managed cache. In ICS ’08: Proceedings of the 22nd annual internationalconference on Supercomputing, pages 309–318, New York, NY, USA, 2008.ACM.

[96] A. G. Smith. Radio exploration of the sun. Van Nostrand Momentum Books,Princeton: Van Nostrand, 1967, 1967.

[97] J. L. Steinburg and J Lequeux. Radio Astronomy. McGraw-Hill Book Com-pany, Inc., 1963.

[98] R. Westermann T. Schiwietz, T. Chang, P. Speier. MR image reconstructionusing the GPU. In Proceedings of SPIE Medical Imaging 2006, San Diego,CA, February 2006. SPIE.

[99] A. R. Thompson, J. M. Moran, and G. W. Swenson, Jr. Interferometry andSynthesis in Radio Astronomy, 2nd Edition. Wiley, April 2001.

[100] Jack Tomlinson. Computation of flops requirements for a wideband spectrumanalyzer. Texas Memory Systems, Inc, May 2004.

[101] P. Trancoso and M. Charalambous. Exploring graphics processor performancefor general purpose applications. Digital System Design, 2005. Proceedings.8th Euromicro Conference on, pages 306–313, Aug.-3 Sept. 2005.

155

[102] Suresh Venkatasubramanian. The graphics card as a stream computer.In SIGMOD-DIMACS Workshop on Management and Processing of DataStreams, 2003.

[103] R. Wayth, K. Dale, L. J. Greenhill, D. A. Mitchell, S. Ord, and H. Pfister.Data Processing Using GPUs for The MWA. In Bulletin of the AmericanAstronomical Society, volume 38 of Bulletin of the American AstronomicalSociety, pages 744–+, December 2007.

[104] S. Weinreb, A. H. Barrett, M. L. Meeks, and J. C. Henry. Radio Observationsof OH in the Interstellar Medium. Nature, 200:829–+, November 1963.

[105] Sean Whalen. Audio and the graphics processing unit. In IEEE Vis 2004GPGPU Tutorial, March 2004.

[106] Mason Woo, Jackie Neider, Tom Davis, and Dave Shreiner. OpenGL Program-ming Guide: The Official Guide to Learning OpenGL, Version 1.2. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.

[107] J. L. Yen. The Role of Fast Fourier Transform Computers in Astronomy.Astronomy and Astrophysics Supplement, 15:483, June 1974.

[108] V. V. Zheleznyakov. Radio Emission of the Sun and Planets. Pergamon Press,1970.

[109] Simon Portegies Zwart, Robert Belleman, and Peter Geldof. High performancedirect gravitational n-body simulations on graphics processing units, 2007.

156 APPENDIX : REFERENCES

Appendix A

Code

The following sections contain the GPU kernels and accompanying wrapper func-

tions for the various stages of the correlation algorithm, as well as for the polyphase

filter. Additional host code pertaining to the initialisation, memory management,

host-device memory transfer, I/O, and algorithm control has been omitted for the

sake of brevity.

A.1 Unpack Stage Kernel

/**

* These routines take N telescope streams consisting of a

* packed 8 bit samples and unpack them to a floating point

* representations. For input, they take a pointer to a

* GPU-resident buffer that contains the packed data, grouped

* by stream. They output to a GPU-resident buffer specified

157

158 APPENDIX A: Code

* by a second pointer.

*

* The following variables are used:

* out - a pointer to the output buffer

* in - a pointer to the input buffer

* size - the total number of samples in the input buffer

*

* The following routines are available:

* up : the unpack routine

*/

// GPU Kernel for unpack operation

// (called from the wrapper below)

__global__ void unpack(float2 *ubuff, uchar1 *pbuff, int s)

{

const int index = __mul24(blockIdx.x,blockDim.x)

+threadIdx.x;

const int inc = __mul24(gridDim.x,blockDim.x);

for (int pos = index; pos < s; pos += inc) {

uchar1 word_c = pbuff[pos];

float2 word_f;

word_f.x = 1.0*word_c.x-128.0;

word_f.y = 0.0;

ubuff[pos] = word_f;

}

}

// Kernel wrapper routine for unpack operation

159

void up(float2 *out, float2 *in, int size) {

dim3 grid = NULL;

dim3 block = NULL;

grid.x = 3*12;

grid.y = 1;

grid.z = 1;

block.x = 128;

block.y = 1;

block.z = 1;

unpack<<<grid,block>>>(out,in,size);

}

A.2 CMAC Stage Kernels

/**


* timeseries of S spectra with L frequency channels,

* conjugate multiply and accumulate the signals to produce

* into N(N+1)/2 output spectra. For input, they take a

* pointer to a GPU-resident buffer that contains the spectra,

* grouped by stream. They output to a GPU-resident buffer

* specified by a second pointer.

*





* l - the length of the fourier transform used to produce

* the complex spectra

* n - the number of telescope signals that are present

* in the input buffer

* t0 - the spectra index to begin accumulation for this

* kernel call

* tN - the spectra index to stop accumulating for this

* kernel call

* tT - the total number of spectra per accumulation,

* possibly spanning multiple calls.

*


* a_1x1 : the 1x1x1 approach

* a_1xG_4 : the 1xGxG approach, for G=4

*/

// GPU Kernel for 1x1x1 accumulation


__global__ void accumulate_1x1(float2 *out, float2 *in,

int lo2, int n, int t0, int tN, int tT)

{

int ni = blockIdx.y/n;

int nj = blockIdx.y%n;

if (ni<=nj)

{

int idx = __mul24(blockIdx.x,blockDim.x)+threadIdx.x;

float2 l_sum = make_float2(0.0,0.0);

161

for (int pos=t0*(lo2*2)+idx; pos<tN*(lo2*2); pos+=lo2*2)

{

float2 chj = in[nj*(lo2*2)*tT+pos];

float2 chi = in[ni*(lo2*2)*tT+pos];

l_sum.x += chj.x*chi.x + chj.y*chi.y;

l_sum.y += chj.y*chi.x - chj.x*chi.y;

}

int pos = (((nj*(nj+1))/2)+ni)*lo2+idx;

float2 g_sum = out[pos];

g_sum.x += l_sum.x;

g_sum.y += l_sum.y;

out[pos] = g_sum;

}

}

// Kernel wrapper routine for 1x1x1 accumulation

void a_1x1(float2 *out, float2 *in,

int l, int n, int t0, int tN, int tT)

{

dim3 grid = NULL;

dim3 block = NULL;

grid.x = l/128;

grid.y = n*n;

grid.z = 1;

block.x = 64;

block.y = 1;

block.z = 1;

accumulate_1x1<<<grid,block>>>(out,in,l/2,n,t0,tN,tT);


}

// GPU Kernel for 1xGxG (G=4) accumulation


__global__ void accumulate_1xG_4(float2 *out, float2 *in,

int lo2, int no4, int t0, int tN, int tT)

{

int mj = blockIdx.y/no4;

int mi = blockIdx.y%no4;

if (mj<=mi)

{

int lx = __mul24(blockIdx.x,blockDim.x)+threadIdx.x;

int nj = threadIdx.y;

int xx = threadIdx.x;

float2 l_sum0 = make_float2(0.0,0.0);




__shared__ float2 x_ni[4][32];

for (int tx=t0; tx<tN; tx++)

{

float2 x_nj = in[((4*mj+nj)*tN+tx)*(lo2*2)+lx];

x_ni[nj][xx] = in[((4*mi+nj)*tN+tx)*(lo2*2)+lx];

__syncthreads();

l_sum0.x += x_nj.x*x_ni[0][xx].x + x_nj.y*x_ni[0][xx].y;

l_sum0.y += x_nj.y*x_ni[0][xx].x - x_nj.x*x_ni[0][xx].y;

163







__syncthreads();

}

int xj = 4*mj+nj;

int xi;

int pos;

float2 g_sum;

xi = 4*mi+0;

if (xj<=xi)

{

pos = ((xi*(xi+1))/2+(xj))*lo2+lx;

g_sum = out[pos];

g_sum.x += l_sum0.x;

g_sum.y += l_sum0.y;

out[pos] = g_sum;

}

xi = 4*mi+1;

if (xj<=xi)

{

pos = ((xi*(xi+1))/2+(xj))*lo2+lx;

g_sum = out[pos];




out[pos] = g_sum;

}

xi = 4*mi+2;

if (xj<=xi)

{

pos = ((xi*(xi+1))/2+(xj))*lo2+lx;

g_sum = out[pos];



out[pos] = g_sum;

}

xi = 4*mi+3;

if (xj<=xi)

{

pos = ((xi*(xi+1))/2+(xj))*lo2+lx;

g_sum = out[pos];



out[pos] = g_sum;

}

}

}

// Kernel wrapper routine for 1xGxG (G=4) accumulation

void a_1xG_4(float2 *out, float2 *in,


{

165

dim3 grid = NULL;

dim3 block = NULL;

grid.x = l/64;

grid.y = (n/4)*(n/4);

grid.z = 1;

block.x = 32;

block.y = 4;

block.z = 1;

accumulate_1xG_4<<<grid,block>>>(out,in,l/2,n/4,t0,tN,tT);

}

// GPU Kernel for 1x1xN accumulation


__global__ void accumulate_1xN(float2 *out, float2 *in,

int lo2, int n, int t0, int tN, int tT)

{

int idx = __mul24(blockIdx.x,blockDim.x)+threadIdx.x;

int nj = blockIdx.y;

float2 l_sum;

for (int ni=0; ni<=nj; ni++)

{

l_sum = make_float2(0.0,0.0);

for (int pos=t0*(lo2*2)+idx; pos<tN*(lo2*2); pos+=lo2*2)

{

float2 chj = in[nj*(lo2*2)*tT+pos];

float2 chi = in[ni*(lo2*2)*tT+pos];

l_sum.x += chj.x*chi.x + chj.y*chi.y;


l_sum.y += chj.y*chi.x - chj.x*chi.y;

}

int pos = (((nj*(nj+1))/2)+ni)*lo2+idx;

float2 g_sum = out[pos];

g_sum.x += l_sum.x;

g_sum.y += l_sum.y;

out[pos] = g_sum;

}

}

// Kernel wrapper routine for 1x1xN accumulation

void a_1xN(float2 *out, float2 *in,


{

dim3 grid = NULL;

dim3 block = NULL;

grid[0].x = l/128;

grid[0].y = n;

grid[0].z = 1;

block[0].x = 64;

block[0].y = 1;

block[0].z = 1;

accumulate_1xN<<<grid,block>>>(out,in,l/2,n,t0,tN,tT);

}

167

A.3 Polyphase Filter Kernel

/**


* packed 8 bit samples, unpacks them to a floating point

* representation, and then pass them through a polyphase

* filter (unpacking occurs partway through the filter).

* For input, they take a pointer to a GPU-resident buffer

* that contains the packed data, grouped by stream. They

* output to a GPU-resident buffer specified by a second

* pointer.

*




* size - the total number of samples in the input buffer

* taps - the number of taps in the polyphase filter

* n - the number of streams present in the input buffer

*


* u_poly : the polyphase filter routine

*/

// GPU Kernel for unpack operation


__global__ void upoly(float2 *out, uchar1 *in,

int size, int taps)


{

int x = threadIdx.x;

int y = blockIdx.x;

int nx = blockIdx.y;

int n = gridDim.y;

int l = blockDim.x;

int pS = (size/(gridDim.x*n));

int poff = (taps-1)*l;

int p0 = (nx*(gridDim.x*pS+poff))+y*pS+x;

int pN = p0-x+pS;

int w = taps*l;

int nxt = taps - 1; //circular buffer index

__shared__ unsigned char s_in[128*8];

__shared__ float s_f[128*8];

for (int t=0; t<taps; t++)

{

int loc = l*t+x;

s_f[loc]=(0.5-0.5*cos(loc*2*pi/w))*(sin((w/2-loc)*pi/l)/(pi*l));

}

// load initial buffer, bar the last tap

for (int p=0; p<w-l; p+=l)

{

s_in[p] = in[p0+p].x;

__syncthreads();

}

// load, calculate, write loop

int tab = (taps-1)*l;

169

for (int p=p0; p<pN; p+=l)

{

// load next buffer

s_in[l*nxt+x] = in[p+tab].x;

__syncthreads();

nxt = (nxt+1)&(taps-1);

// multiply each value by filter and sum across taps

float sum = 0.0;

for (int t=0; t<taps; t++)

{

int loc_v = l*((t+nxt+1)&(taps-1))+x;

int loc_f = l*t+x;

float val = s_in[loc_v]*1.0-127.0;

sum += s_f[loc_f]*val;

}

// write filtered sum to output memory

int po = p - (nx*(taps-1)*l);

out[po] = make_float2(sum,0.0);

}

}

// Kernel wrapper routine for the polyphase filter

void u_poly(float2 *out, char *in,

int size, int taps, int n, int l)

{

dim3 grid = NULL;

dim3 block = NULL;

// grid.x = available 8bytes in shared mem*multiprocessors


// divided by required resources

grid[0].x = 2048*64/(n*l*taps);

grid[0].y = n;

grid[0].z = 1;

block[0].x = l;

block[0].y = 1;

block[0].z = 1;

upoly<<<grid,block>>>(out, (uchar1*)in, size, taps);

}

a parallel model for the heterogeneous …...this thesis explores the use of heterogeneous parallel...

Documents