hardwareaccelerationofpulsarsearchonfpgas usingopencl · kkerenrenel lp ippipeleinlinee k er nlsp...

Overview and TaskFT Convolution Decomposition

High-level Techniques and ImplementationEvaluation

What’s Next

Hardware Acceleration of Pulsar Search on FPGAsusing OpenCL

Oliver Sinnen Haomiao Wang&

Prabu Thiagaraj (Manchester Uni)

Parallel and Reconfigurable Computing

Department of Electrical and Computer EngineeringUniversity of Auckland

Computing for SKA, 2017

Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL

What’s Next

Strong-field Test of Gravity using Pulsars

Image Credit: NASA/Tod Strohmayer (GSFC)/Dana Berry (Chandra X-Ray Observatory)

Image credit: NASA .

What’s Next

Outline

1 Overview and Task

2 FT Convolution Decomposition

3 High-level Techniques and Implementation

4 Evaluation

5 What’s Next

What’s Next

Outline

1 Overview and Task

4 Evaluation

5 What’s Next

What’s Next

Pulsar and Pulsar Search

Observed radiation is a pulse Binary pulsar (Doppler effect)Acceleration search: 1) Time-domain 2) Frequency-domain

What’s Next

Pulsar and Pulsar Search

Frequency-domainUsing matched filtering technique in Fourier domain torecover the signal into single bin.

Ar0 w[r0]+m/2

∑k=[r0]−m/2

AkA∗r0−k ,

where frequency r0 is unknown.Summation is computed at a range of frequencies r .

What’s Next

Block Overview of Pulsar Search Engine

Data Receptor(RCPT)

BeamformedData(BFD)

Dedispersion Buffer Creator

(DDBC)

FilterbankData Chuncks

(FDC) RFI Mtigation(RFIM)

DedispersionBuffers(DB) Dedispersion

Transform(DDTR)

Flagged DB(FDB) Periodicity

Search Buffer Creator (PSBC)

Dedispersed Data Buffer

Complex Fourier

Transform (CXFT)

Periodicity Search Buffer

Birdie Zapping(BRDZ)

Dereddening Spectrum(DRED)

Inverse Complex Fourier Transform

(iCXFT)

Single Pulse Detector (SPCT)

Dedispersed Data Buffer

Single Pulse Sifter

(SPSIFT)

Single Pulse Optimiser(SPOPT)

Candidate Data Output Streamer(CDOS)

To SDP

Fourier Domain Acceleration Search

(FDAS)

Time Domain Resampler Transform(TDRT)

Fourier Transform and Power

Spectrum (PWFT)

Harmonic Summing (HRMS)

Time Domain Candidate

Optimisation(TDAO)

Candidate Sifting(SIFT)

Full Filterbank Buffer Creator

(FFBC)

Candidate Folding and Optimsation

(FLDO)

Fourier Domain Candidate

Optimisation(FDAO)

Filterbank Data for Selected SP Candidates

Filterbank Data for Candidate

FoldingFrom SDP

Common

Single Pulse

Time‐domain Acc

Freq‐domain Acc

What’s Next

Fourier-domain Acceleration Search (FDAS)

FDAS module is applied to search for (binary) pulsars with constantfrequency derivatives in frequency-domain

......

Over 2,000 beams are formed at 4,096

channels/beam

Beami signals are de-dispersed for 6,000 DMs

FIR_k... ...

FIR_85

Post-processing

PSS Engine_i

FT Convolution Module

DM1DM1

DM2...

DM6000

85 FIR filters, maximum length is 421-tap

Pre-Processing

.RFIM .DDTR .PSBC .CXFT .BRDZ .DRED

· Single Pulse Search Modules · Time Domain Acceleration or

Harmonic-sum

Module

FDAS Module

What’s Next

Specification of Task

Parameter Destriiption ValueB # of beams 1000∼ 2000

DM # of de-dispersion measure (DM) trails 6000Tobs Observation period 540stlimit Time of executing one sample group 88msN # of complex samples per group 222

M # of templates/filter 85K # of average template/filter length > 200

What’s Next

Outline

1 Overview and Task

4 Evaluation

5 What’s Next

What’s Next

FT Convolution

Complex floating-point operationsMultiple long FIR filtersLarge input sizeStrict time limit

Number of acceleration devices (CapEx)

Energy consumption (OpEx)

What’s Next

Basic Element

Time-domain FIR Filter (TDFIR)

ym[i ] =K−1

∑k=0

xm[i −k]hm[k], for i = 0,1, ...N−1

Frequency-domain FIR Filter (FDFIR)

F{f ∗h}= F{f } ·F{h}

What’s Next

Hardware Limitation

Naïve Time Domain

block Single precision floating-point (SPF) multiplications(A+ iB)× (C + iD) = (A×C −B×D)+ i(A×D +B×C)

Naïve Frequency Domain

Off-chip (global) memory

Off-chip memory bandwidth

RAM block

On-chip (local) memory size4-Million elements = 32MBytes .

What’s Next

Decomposition Algorithms

Overlap-add Algorithm

Split the coefficient array–> OLA-TD

Overlap-save Algorithm

Split the input array–> OLS-FD

Coefficients C_1 C_2 C_N. . .

Input data Zero

Length = Ncoef /N -1

Output data_i

Output data_1

Output data

Output data_N

+Length = Ncoef /N

Convolve with subset coefficient group i

Output data_2

PD_3PD_2

Input DataZero

Length =Ncoef -1

ID_N...

ID_i PD_i

Convolution with FIR filter

Discard the Ncoef-1 elements

Output Data

PD_1 ... PD_N

Split the input into N small groups

(a) OLA (b) OLS

What’s Next

Outline

1 Overview and Task

4 Evaluation

5 What’s Next

What’s Next

High-level Techniques

Maxeler MaxCompiler using Java to develop FPGA (HPCC2016)

Open Computing Language (OpenCL) for FPGAs (Intel FPGA Cards),GPUs, and CPUs (FPT2016, best paper candidate)

DDR Controller & PHY

Global Memory Interconnect

Local Memory Interconnect

FPGA_0

Block RAM

Kernel PipelineKernel PipelineKernel PipelineKernel PipelineKernels Pipeline

2GB DDR3 x 2 Memory

(Global Memory)

Core 1

Core 4

Memory(DDR3 and SSD)

...DDR Controller & PHY

Global Memory Interconnect

Local Memory Interconnect

FPGA_i

Block RAM

Kernel PipelineKernel PipelineKernel PipelineKernel PipelineKernels Pipeline

What’s Next

Kernel Structures–OLA

What’s Next

Kernel Structures–OLS

Global Memory (Bank1)

Global Memory (Bank2)

ChannelsFFT Data Fetch

Kernel(NDRange)

FFT and MultiplicationKernel (Single)

ChannelsFFT Bit-Reverse

Kernel(NDRange)

IFFT Data FetchKernel

(NDRange)

IFFTKernel(Single)

IFFT Bit-Reverse Kernel

(NDRange)

Channels Channels

Output

InputProcessed

coefficients

Output

Global Memory 1st launch: Bank1 2nd launch: Bank2

ChannelsData Fetch and

Multiplication Kernel(NDRange)

FFT/IFFT Kernel (Single)1st launch FFT

2nd launch IFFT

ChannelsBit-Reverse

Kernel(NDRange)

Global Memory 1st launch: Bank2 2nd launch: Bank11st launch

2nd launch

Processed coefficients

2nd launch

Input1st launch Switch

.Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL

What’s Next

Outline

1 Overview and Task

4 Evaluation

5 What’s Next

What’s Next

Platform

Table: Details of FPGA and GPU Platforms

Device (Board) Terasic DE5-Net Sapphire Nitro R7 370

Hardware Intel Stratix V 5SGXA7 AMD Radeon R7 370

Technology 28nm 28nm

Compute resource622,000 LEs

1024 Stream Processors256 DSP blocks

On-chip memory size 50Mb —

Global memory size 2 x 2GB DDR3 3GB GDDR5

Global memory frequency 800MHz 5,600MHz

Memory interface width 2 x 64-bit 256-bit

Max clock frequency — 985MHz

OpenCL 1.0 1.2

Max power consumption — 150W

What’s Next

Latency–TDFIR vs FDFIR

4 TDFIR Kernels

Naïve

TD-Naïve-64STD-Naïve-64N

OLA-64SOLA-64N

5 FDFIR Kernels

Naïve

FD-Naïve

AOLS-1024AOLS-2048AOLS-4096

TOLS-1024

What’s Next

Latency–TDFIR vs FDFIR

Latencies of a single FPGA (Intel Stratix V A7) in processing sameinput array using 9 different OpenCL kernels:

64 128 256 421

FIR Filter Length

● TD−Naïve−64STD−Naïve−64NOLA−64SOLA−64NFD−NaïveAOLS−1024AOLS−2048AOLS−4096TOLS−1024

What’s Next

Multiple FIR Filters

Even fastest kernel cannot meet time limit

=> Implement multiple FIR filters in parallel

Problem:Bandwidth of off-chip memory is main problemSolution:

Do more processing !Calculate power of complex values (need input in next stage)

Problem:Number of DSP blocks limits number of parallelisable filtersSolution:

Downscale the FFT engine input size: 8 points–> 4 points

What’s Next

1) Multiple FIR filters 2) Power of complex valuesBecomes optimisation problem!

TOLS-1024 8points

2 x AOLS-1024 8points

2 x AOLS-1024-P 8points

AOLS-1024-P 8points

AOLS-1024 8points

AOLS-1024-P 4points

AOLS-2048-P 4points

FFT Engine Element-wise multiplications PowerUnused DSP blocks

What’s Next

Multiple FIR Filters and FPGAs

3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−POpenCL Kernels

Device1 FPGA

2 FPGAs

3 FPGAs

Device1 FPGA

2 FPGAs

3 FPGAs

Device1 FPGA

2 FPGAs

3 FPGAs

Device1 FPGA

2 FPGAs

3 FPGAs

What’s Next

Latency–FPGA vs GPU

Latencies of a single GPU (AMD Radeon R7 370) and 3 FPGAs inprocessing 2 and 4 Million points:

●●●●●●●●●

●●

0 9 18 27 36 45 54 63 72 81

Total FIR Filters

● 3xAOLS−2048−P (4−Million)3xAOLS−2048−P (2−Million)GPU−FD (4−Million)GPU−FD (2−Million)

What’s Next

Energy–FPGA vs GPU

Energy dissipation of single FPGA and GPU in executing the sametask with different kernels:

GPU−FD

3xAOLS−1024−P

3xAOLS−2048−P

3xAOLS−4096−P

AOLS−2048

TOLS−1024

0 5 10 15 20

Energy Dissipation (Joule)

What’s Next

Outline

1 Overview and Task

4 Evaluation

5 What’s Next

What’s Next

Harmonic-summing

Input: Filter-output-plane (FOP, 85×222 SPF points,~1.33GBytes)Processing flow:

8 harmonic planes are generated based on FOP and thestretched planesOne threshold for each row of each harmonic plane (overall:85×8= 680)Hundreds of candidates are recorded

Output: Candidates

each candidate contains the indexes of filter, harmonic, bin,and amplitude (up to 64-bit)

What’s Next

Harmonic-summing

Problems:Input data size is too large (~1.33GBytes)

On-chip memory size too small for all planes

The cost of computation task is very cheap (SPF adds) andeasy to parallelise

Off-chip memory bandwidth is issue

Challenge:Optimise data use (and reuse), computation not an issue

What’s Next

Conclusion

FPGA-based implementation and optimisation of FTconvolution (FIR filter), based on OLA and OLS algorithmsHigh-level approaches to such tasks works well

Covered large design spaceEasy porting and sharing with partners

With multiple FPGAs, FPGA implementation has advantageover GPU in both performance (GFLOPS) and Energyefficiency

hardwareaccelerationofpulsarsearchonfpgas usingopencl · kkerenrenel lp ippipeleinlinee k er nlsp...

Documents

signata memory interface ddr3 analysis suite datasheet ·...

lest we forget: cold-boot attacks on scrambled ddr3...

ddr3 memory interface on pcb

design, validation and correlation of characterized sodimm ...

gowin ddr3 memory interace ip - cdn.gowinsemi.com.cn

ddr3 sdram controller with altmemphy user … · ddr3 sdram...

high density secure memory ddr3 sdram

lest we forget: cold-boot attacks on scrambled ddr3...

ddr3 signature linestatic.highspeedbackbone.net/pdf/patriot...

modern memory interfaces (ddr3) design with ansys … ·...

ddr3 memory modules

ddr3 memory integration for a softcore in a new radiation...

dell latitude ultra -portable e4200 & e4300 · dual channel...

intel® desktop board dx79to product guide memory eight...

s3.amazonaws.com · memory upgrade information dual channel...

axi compliant ddr3 controller -...

turbo nas hardware manual - qnap · cpu marvell 2.0ghz...

ddr3 signature linestatic.highspeedbackbone.net/pdf/patriot...

an520: ddr3 sdram memory interface termination...

tn-41-01: calculating memory system power for...