hardwareaccelerationofpulsarsearchonfpgas usingopencl · kkerenrenel lp ippipeleinlinee k er nlsp...
Post on 28-Apr-2019
213 Views
Preview:
TRANSCRIPT
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Hardware Acceleration of Pulsar Search on FPGAsusing OpenCL
Oliver Sinnen Haomiao Wang&
Prabu Thiagaraj (Manchester Uni)
Parallel and Reconfigurable Computing
Department of Electrical and Computer EngineeringUniversity of Auckland
Computing for SKA, 2017
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Strong-field Test of Gravity using Pulsars
Image Credit: NASA/Tod Strohmayer (GSFC)/Dana Berry (Chandra X-Ray Observatory)
Image credit: NASA .
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Outline
1 Overview and Task
2 FT Convolution Decomposition
3 High-level Techniques and Implementation
4 Evaluation
5 What’s Next
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Outline
1 Overview and Task
2 FT Convolution Decomposition
3 High-level Techniques and Implementation
4 Evaluation
5 What’s Next
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Pulsar and Pulsar Search
Observed radiation is a pulse Binary pulsar (Doppler effect)Acceleration search: 1) Time-domain 2) Frequency-domain
.
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Pulsar and Pulsar Search
Frequency-domainUsing matched filtering technique in Fourier domain torecover the signal into single bin.
Ar0 w[r0]+m/2
∑k=[r0]−m/2
AkA∗r0−k ,
where frequency r0 is unknown.Summation is computed at a range of frequencies r .
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Block Overview of Pulsar Search Engine
Data Receptor(RCPT)
BeamformedData(BFD)
Dedispersion Buffer Creator
(DDBC)
FilterbankData Chuncks
(FDC) RFI Mtigation(RFIM)
DedispersionBuffers(DB) Dedispersion
Transform(DDTR)
Flagged DB(FDB) Periodicity
Search Buffer Creator (PSBC)
Dedispersed Data Buffer
(DDB)
Complex Fourier
Transform (CXFT)
Periodicity Search Buffer
(PSB)
Birdie Zapping(BRDZ)
Dereddening Spectrum(DRED)
Inverse Complex Fourier Transform
(iCXFT)
Single Pulse Detector (SPCT)
Dedispersed Data Buffer
(DDB)
Single Pulse Sifter
(SPSIFT)
Single Pulse Optimiser(SPOPT)
Candidate Data Output Streamer(CDOS)
To SDP
Fourier Domain Acceleration Search
(FDAS)
Time Domain Resampler Transform(TDRT)
Fourier Transform and Power
Spectrum (PWFT)
Harmonic Summing (HRMS)
Time Domain Candidate
Optimisation(TDAO)
Candidate Sifting(SIFT)
Full Filterbank Buffer Creator
(FFBC)
Candidate Folding and Optimsation
(FLDO)
Fourier Domain Candidate
Optimisation(FDAO)
Filterbank Data for Selected SP Candidates
Filterbank Data for Candidate
FoldingFrom SDP
Common
Single Pulse
Time‐domain Acc
Freq‐domain Acc
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Fourier-domain Acceleration Search (FDAS)
FDAS module is applied to search for (binary) pulsars with constantfrequency derivatives in frequency-domain
Beam2
Beami
......
......
Over 2,000 beams are formed at 4,096
channels/beam
Beami signals are de-dispersed for 6,000 DMs
FIR_1
FIR_k... ...
FIR_85
Post-processing
PSS Engine_i
FT Convolution Module
BeamN
DM1DM1
DM2...
DMj
DM6000
...
85 FIR filters, maximum length is 421-tap
Pre-Processing
.RFIM .DDTR .PSBC .CXFT .BRDZ .DRED
· Single Pulse Search Modules · Time Domain Acceleration or
Harmonic-sum
Module
FDAS Module
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Specification of Task
Parameter Destriiption ValueB # of beams 1000∼ 2000
DM # of de-dispersion measure (DM) trails 6000Tobs Observation period 540stlimit Time of executing one sample group 88msN # of complex samples per group 222
M # of templates/filter 85K # of average template/filter length > 200
.
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Outline
1 Overview and Task
2 FT Convolution Decomposition
3 High-level Techniques and Implementation
4 Evaluation
5 What’s Next
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
FT Convolution
Complex floating-point operationsMultiple long FIR filtersLarge input sizeStrict time limit
Number of acceleration devices (CapEx)
Energy consumption (OpEx)
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Basic Element
Time-domain FIR Filter (TDFIR)
ym[i ] =K−1
∑k=0
xm[i −k]hm[k], for i = 0,1, ...N−1
Frequency-domain FIR Filter (FDFIR)
F{f ∗h}= F{f } ·F{h}
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Hardware Limitation
Naïve Time Domain
DSP
block Single precision floating-point (SPF) multiplications(A+ iB)× (C + iD) = (A×C −B×D)+ i(A×D +B×C)
Naïve Frequency Domain
Off-chip (global) memory
Off-chip memory bandwidth
RAM block
On-chip (local) memory size4-Million elements = 32MBytes .
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Decomposition Algorithms
Overlap-add Algorithm
Split the coefficient array–> OLA-TD
Overlap-save Algorithm
Split the input array–> OLS-FD
Coefficients C_1 C_2 C_N. . .
Input data Zero
Length = Ncoef /N -1
Output data_i
Output data_1
Output data
Output data_N
. . .
+Length = Ncoef /N
Split
Convolve with subset coefficient group i
Output data_2
PD_3PD_2
Input DataZero
Length =Ncoef -1
ID_1
ID_2
ID_3
ID_N...
ID_i PD_i
Convolution with FIR filter
Discard the Ncoef-1 elements
Output Data
PD_1 ... PD_N
Split the input into N small groups
(a) OLA (b) OLS
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Outline
1 Overview and Task
2 FT Convolution Decomposition
3 High-level Techniques and Implementation
4 Evaluation
5 What’s Next
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
High-level Techniques
Maxeler MaxCompiler using Java to develop FPGA (HPCC2016)
Open Computing Language (OpenCL) for FPGAs (Intel FPGA Cards),GPUs, and CPUs (FPT2016, best paper candidate)
DDR Controller & PHY
Global Memory Interconnect
Local Memory Interconnect
FPGA_0
PCIe
Block RAM
Kernel PipelineKernel PipelineKernel PipelineKernel PipelineKernels Pipeline
2GB DDR3 x 2 Memory
(Global Memory)
PCIe
Core 1
Core 4
Host
Memory(DDR3 and SSD)
...
...DDR Controller & PHY
Global Memory Interconnect
Local Memory Interconnect
FPGA_i
PCIe
Block RAM
Kernel PipelineKernel PipelineKernel PipelineKernel PipelineKernels Pipeline
...
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Kernel Structures–OLA
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Kernel Structures–OLS
Global Memory (Bank1)
Global Memory (Bank2)
ChannelsFFT Data Fetch
Kernel(NDRange)
FFT and MultiplicationKernel (Single)
ChannelsFFT Bit-Reverse
Kernel(NDRange)
Chan
nels
IFFT Data FetchKernel
(NDRange)
IFFTKernel(Single)
IFFT Bit-Reverse Kernel
(NDRange)
Channels Channels
Output
InputProcessed
coefficients
Output
Global Memory 1st launch: Bank1 2nd launch: Bank2
ChannelsData Fetch and
Multiplication Kernel(NDRange)
FFT/IFFT Kernel (Single)1st launch FFT
2nd launch IFFT
ChannelsBit-Reverse
Kernel(NDRange)
Global Memory 1st launch: Bank2 2nd launch: Bank11st launch
2nd launch
Processed coefficients
2nd launch
Input1st launch Switch
AOLS
TOLS
.Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Outline
1 Overview and Task
2 FT Convolution Decomposition
3 High-level Techniques and Implementation
4 Evaluation
5 What’s Next
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Platform
Table: Details of FPGA and GPU Platforms
Device (Board) Terasic DE5-Net Sapphire Nitro R7 370
Hardware Intel Stratix V 5SGXA7 AMD Radeon R7 370
Technology 28nm 28nm
Compute resource622,000 LEs
1024 Stream Processors256 DSP blocks
On-chip memory size 50Mb —
Global memory size 2 x 2GB DDR3 3GB GDDR5
Global memory frequency 800MHz 5,600MHz
Memory interface width 2 x 64-bit 256-bit
Max clock frequency — 985MHz
OpenCL 1.0 1.2
Max power consumption — 150W
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Latency–TDFIR vs FDFIR
4 TDFIR Kernels
Naïve
TD-Naïve-64STD-Naïve-64N
OLA
OLA-64SOLA-64N
5 FDFIR Kernels
Naïve
FD-Naïve
OLS
AOLS
AOLS-1024AOLS-2048AOLS-4096
TOLS
TOLS-1024
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Latency–TDFIR vs FDFIR
Latencies of a single FPGA (Intel Stratix V A7) in processing sameinput array using 9 different OpenCL kernels:
●
●
●
●
64 128 256 421
0
50
100
150
200
250
FIR Filter Length
Ker
nel E
xecu
tion
Late
ncy
(ms)
● TD−Naïve−64STD−Naïve−64NOLA−64SOLA−64NFD−NaïveAOLS−1024AOLS−2048AOLS−4096TOLS−1024
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Multiple FIR Filters
Even fastest kernel cannot meet time limit
=> Implement multiple FIR filters in parallel
Problem:Bandwidth of off-chip memory is main problemSolution:
Do more processing !Calculate power of complex values (need input in next stage)
Problem:Number of DSP blocks limits number of parallelisable filtersSolution:
Downscale the FFT engine input size: 8 points–> 4 points
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Multiple FIR Filters
Even fastest kernel cannot meet time limit
=> Implement multiple FIR filters in parallel
Problem:Bandwidth of off-chip memory is main problemSolution:
Do more processing !Calculate power of complex values (need input in next stage)
Problem:Number of DSP blocks limits number of parallelisable filtersSolution:
Downscale the FFT engine input size: 8 points–> 4 points
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Multiple FIR Filters
Even fastest kernel cannot meet time limit
=> Implement multiple FIR filters in parallel
Problem:Bandwidth of off-chip memory is main problemSolution:
Do more processing !Calculate power of complex values (need input in next stage)
Problem:Number of DSP blocks limits number of parallelisable filtersSolution:
Downscale the FFT engine input size: 8 points–> 4 points
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Multiple FIR Filters
1) Multiple FIR filters 2) Power of complex valuesBecomes optimisation problem!
TOLS-1024 8points
2 x AOLS-1024 8points
2 x AOLS-1024-P 8points
AOLS-1024-P 8points
AOLS-1024 8points
AOLS-1024-P 4points
AOLS-2048-P 4points
3 x AOLS-2048-P 4points
3 x AOLS-1024-P 4points
FFT Engine Element-wise multiplications PowerUnused DSP blocks
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Multiple FIR Filters and FPGAs
0
200
400
600
3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−POpenCL Kernels
late
ncy
of 8
4 F
IR F
ilter
s (m
s)
Device1 FPGA
2 FPGAs
3 FPGAs
0
100
200
300
3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−POpenCL Kernels
Ker
nel P
efor
man
ce (
GF
LOP
S)
Device1 FPGA
2 FPGAs
3 FPGAs
0
1
2
3
4
3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−POpenCL Kernels
Pow
er E
ffici
ency
(G
FLO
PS
/wat
t)
Device1 FPGA
2 FPGAs
3 FPGAs
0
5
10
15
20
3xAOLS−1024−P 3xAOLS−2048−P 3xAOLS−4096−POpenCL Kernels
Ene
rgy
Dis
sipa
tion
(Jou
le)
Device1 FPGA
2 FPGAs
3 FPGAs
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Latency–FPGA vs GPU
Latencies of a single GPU (AMD Radeon R7 370) and 3 FPGAs inprocessing 2 and 4 Million points:
●●●●●●●●●
●●●●●●●●●
●●●●●●●●●
●●●●●●●●●
●●●●●●●●●
●●●●●●●●●
●●●●●●●●●
●●●●●●●●●
●●●●●●●●●
●●
0 9 18 27 36 45 54 63 72 81
0
20
40
60
80
100
120
140
Total FIR Filters
Ker
nel E
xecu
tion
Late
ncy
(ms)
● 3xAOLS−2048−P (4−Million)3xAOLS−2048−P (2−Million)GPU−FD (4−Million)GPU−FD (2−Million)
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Energy–FPGA vs GPU
Energy dissipation of single FPGA and GPU in executing the sametask with different kernels:
GPU−FD
3xAOLS−1024−P
3xAOLS−2048−P
3xAOLS−4096−P
AOLS−2048
TOLS−1024
0 5 10 15 20
Energy Dissipation (Joule)
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Outline
1 Overview and Task
2 FT Convolution Decomposition
3 High-level Techniques and Implementation
4 Evaluation
5 What’s Next
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Harmonic-summing
Input: Filter-output-plane (FOP, 85×222 SPF points,~1.33GBytes)Processing flow:
8 harmonic planes are generated based on FOP and thestretched planesOne threshold for each row of each harmonic plane (overall:85×8= 680)Hundreds of candidates are recorded
Output: Candidates
each candidate contains the indexes of filter, harmonic, bin,and amplitude (up to 64-bit)
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Harmonic-summing
Problems:Input data size is too large (~1.33GBytes)
On-chip memory size too small for all planes
The cost of computation task is very cheap (SPF adds) andeasy to parallelise
Off-chip memory bandwidth is issue
Challenge:Optimise data use (and reuse), computation not an issue
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
Overview and TaskFT Convolution Decomposition
High-level Techniques and ImplementationEvaluation
What’s Next
Conclusion
FPGA-based implementation and optimisation of FTconvolution (FIR filter), based on OLA and OLS algorithmsHigh-level approaches to such tasks works well
Covered large design spaceEasy porting and sharing with partners
With multiple FPGAs, FPGA implementation has advantageover GPU in both performance (GFLOPS) and Energyefficiency
Oliver Sinnen, Haomiao Wang & Prabu Thiagaraj (Manchester Uni)FDAS on FPGA using OpenCL
top related