towards a heterogeneous computer architecture for cactus
DESCRIPTION
Towards a Heterogeneous Computer Architecture for CACTuS. Anthony Milton. Supervisors: Assoc. Prof. David Kearney ( UniSA ) Dr. Sebastien Wong (DSTO). Reconfigurable Computing Lab http://rcl.unisa.edu.au. Collaboration Partners. Motivation for Heterogeneous CACTuS. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/1.jpg)
Towards a Heterogeneous Computer Architecture for CACTuS
Anthony Milton
Supervisors:Assoc. Prof. David Kearney (UniSA) Dr. Sebastien Wong (DSTO)
Reconfigurable Computing Labhttp://rcl.unisa.edu.au
![Page 2: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/2.jpg)
2
Collaboration Partners
![Page 3: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/3.jpg)
3
CACTuS originally developed and prototyped in MATLAB: Great testbed for algorithm development, BUT poor computational performance
As CACTuS is a visual tracking algorithm real-time operation is desired.
Motivation for Heterogeneous CACTuS
Variant Time/frame Frames/sec“Standard” 269.24ms 3.71“Frequency
Domain”203.74ms 4.91
![Page 4: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/4.jpg)
4
Motivation – Data Parallelism
Input Frame Posterior PositionObserved Image
Posterior Velocity
![Page 5: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/5.jpg)
5
Motivation – Task Parallelism
SEF #1 SEF #2
SEF #n-1 SEF #n
![Page 6: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/6.jpg)
6
It is well known that GPUs and FPGAs are well suited to data-parallel computation
GPUs originally used for computer graphics, now used in a huge number of application areas (GPGPU)
FPGAs used for specialized applications requiring high performance but low power (Radar processing, TCP/IP packet processing…)
Motivation – GPUs & FPGAs
![Page 7: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/7.jpg)
7
Each computational resource has strengths and weaknesses
Using a mix of different (heterogeneous) computing resources for computation, drawing on the strengths of each resource.
Heterogeneous Computing
CPUs
FPGAs
GPUs
![Page 8: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/8.jpg)
8
Construction of a hardware prototype with disparate compute resources is easy
Application development for such a system is hard: Algorithm translation Design partitioning Languages and development environments Models of computation Communication and data transfer Etc..
How to create designs that are partitioned across the different computing resources?
Heterogeneous Computing Systems
![Page 9: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/9.jpg)
9
Develop a heterogeneous computer architecture for CACTuS Maintain tracking metric compared to MATLAB “gold
standard” Improve execution performance of the
algorithm
Project Goals
MATLAB std MATLAB freq
CPU GPUFPGA Hetero
![Page 10: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/10.jpg)
10
Xenon Systems workstation: Intel X5677 Xeon quad-core CPU @ 3.46GHz 6GB DDR3 DRAM NVIDIA Quadro 4000 GPU (2GB GDDR5 DRAM, OpenCL 1.1
device, CUDA 2.0 device)
Alpha Data ADM-XRC-6T1 FPGA board Xilinx Virtex-6 XC6VLX550T FPGA (549,888 logic cells, 864
DSP slices, 2.844MB BRAM) 2GB off-chip DDR3 DRAM Connects to host via PCIe 2.0 x4
Our Research Platform
![Page 11: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/11.jpg)
11
Maintain similar high-level abstractions across all versions
Use 3rd party libraries and designs, open source where possible
Incremental approach to overall development
Development Approach
![Page 12: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/12.jpg)
12
Necessary to develop the C++/CPU version as much of the infrastructure code would need to be re-used for GPU & FPGA versions.
This included video, MATLAB and text file I/O, visualisation, timing & unit testing.
Third party libraries used for this infrastructure included: Qt – visualisation Boost – non-MATLAB file I/O, timing, unit testing MatIO – MATLAB file I/O
Design Decision – Common Infrastructure
![Page 13: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/13.jpg)
13
To reduce development time, and help ensure high-level similarity with MATLAB code, the open source C++ linear algebra library, Armadillo, was utilised.
At the start of development (late 2011), Armadillo did not feature any 2D FFT implementations1, so the industry standard FFTW library was used2.
Design Decisions – C++/CPU
1. Since been added in September 2013.2. MATLAB itself uses FFTW for computing FFTs.
![Page 14: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/14.jpg)
14
Essentially 2 choices for GPU programming framework: CUDA and OpenCL CUDA limited to NVIDIA hardware, mature, has
good supporting libraries such as CUBLAS, CUFFT, good development tools
OpenCL vendor agnostic, less mature, not limited to just GPUs - multicore, GPU, DSP, FPGA (portable)
OpenCL selected: avoid vendor lock-in and eye to the future, as OpenCL likely to become dominant in the future.
Design Decisions – OpenCL/GPU
![Page 15: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/15.jpg)
15
To reduce development time, and help ensure high-level similarity with MATLAB code, the open source OpenCL computing library, ViennaCL, was utilised.
Provided methods for most linear algebra operations required for CACTuS
Did not support complex numbers, but as complex numbers only required for 2D f -domain convolution, workarounds possible.
Design Decisions – OpenCL/GPU
![Page 16: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/16.jpg)
16
Traditional HDL, Verilog & VHDL, very low level and require designer to design control logic, implement hardware flow control etc. Design flexibility but lower productivity
Bluespec (BSV) – modern, rule-based, high-level HDL: Rule based approach naturally matches parallel nature of HW Designer freed from (error prone) control logic design
Alpha Data hardware infrastructure & SDK Get data to and from FPGA via 4 DMA channels. Drivers & SDK on PC side, support hardware and reference
designs on FPGA side.
Design Decisions – FPGA/Bluespec
![Page 17: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/17.jpg)
17
How to best map the algorithm to the heterogeneous platform? Still a work-in-progress and currently being explored
Design Decisions - Heterogeneous
![Page 18: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/18.jpg)
18
The Observe Velocity stage of CACTuS was primary focus for FPGA, is generally between 40% & 90%+ of total FLOPs of algorithm
To perform Observe Velocity in f -domain : 2D-FFT on Xs to give Xs_freq
2D-FFT on Xm to give Xm_freq
Per-element-multiply betweenXs_freq and Xm_freq to give Vm_freq
2D-IFFT on Vm_freq to give Vm
Bottleneck - Observe Velocity
![Page 19: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/19.jpg)
19
Implementation - Hardware
![Page 20: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/20.jpg)
20
Need for evaluation of computational performance and tracking accuracy Verification of development Provide a basis for comparison
Functionality for evaluating computational performance (timing) and tracking accuracy (tracking metrics) integrated into common infrastructure Allows evaluation for single executions External scripts allow for evaluation of batch jobs
Evaluation Methods – Performance and Accuracy
![Page 21: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/21.jpg)
21
MATLAB CPU GPU FPGA Heterostd freq std freq std std1 std2 std freq std freq
Time/frame (ms)
269.24
203.74
671.78
371.26
1438.11
132.26
116.37
684.15
288.74
~110 ~98
Frame/sec (fps)
3.71 4.91 1.49 2.69 0.70 7.56 8.59 1.46 3.46 ~9.09 ~10.20
Results – Performance
1. Vm performed on CPU2. Vm on GPU, padded to nearest-power-of-2
02468
1012 MATLAB_std
MATLAB_freqCPU_stdCPU_freqGPU_stdGPU_std1GPU_std2FPGA_stdFPGA_freqHetero_stdHetero_freq
Fps
![Page 22: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/22.jpg)
22
Early phase of exploring algorithm mappings to heterogeneous platform
3rd party libraries not efficient Task parallelism not yet exploited
Refactor algorithm flow control – lose connection with MATLAB “gold standard” version
Software aspect of project is complex Multiple developers, multiple third party libraries,
FPGA pipeline currently limited to 2D f -domain convolution, only relevant to predict and observe stages Also limited in size due to resource utilisation constraints
Many issues encountered with FPGA development
Limitations & Problems
![Page 23: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/23.jpg)
23
Developer (in)experience impacts on development time and achieved performance greatly.
OpenCL difficult to develop with, becoming easier as it matures and associated libraries improve CUDA might have been a better initial choice
Use of immature libraries not the best idea (unless frequent code changes are your idea of fun)
FPGA functionality takes a lot of time and effort to develop Evaluate exactly what functionality is required to meet
performance constraints.
Lessons Learnt
![Page 24: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/24.jpg)
24
Continue to improve exploitation of data parallelism Likely to be inefficient due to use of small kernels, consider
combining small kernels Task parallelism not yet exploited
Incorporate multi-core threading to fully exploit Investigate problem of scheduling computational resources in
system DRAM integration would benefit FPGA performance greatly
(images currently not large enough to amortise DMA overheads), open up further application mappings.
Future Work
![Page 25: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/25.jpg)
Questions?
![Page 26: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/26.jpg)
26
Single Instruction Multiple Data (SIMD)
Excel at data parallel workloads with deterministic memory accesses
Best architecture for floating point arithmetic
Moderate to develop for: few languages but rapidly maturing ecosystem
Moderately complex memory architecture – developer must be aware of structure
CPUs GPUs FPGAs Single Instruction
Single Data (SISD) Excel at sequential
code, heavily branched code and task-parallel workloads
Easiest to develop for: software languages, environments, strong debugging
Easily understood memory architecture – generally transparent to developer
No fixed model model of computation, designed defined
Flexible enough to excel at a variety of tasks
Best architecture for fixed point arithmetic
Excel at bit, integer and logic operations
Difficult to develop for: hardware languages (HDLs), simulators
Memory architecture required to be defined by designer
![Page 27: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/27.jpg)
27
Specs – Alpha Data ADM-XRC-6T1
![Page 28: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/28.jpg)
28
Tracking Metrics
![Page 29: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/29.jpg)
29
Results – Detailed Accuracy
![Page 30: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/30.jpg)
30
Results - Detailed Performance
![Page 31: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/31.jpg)
31
The reconfigurable nature of FGPAs is both the major strength & weakness of the platform: Freedom to create custom hardware structures
& circuits for specific purposes = specialised, efficient, high-performance HW
No existing microarchitecture, designer needs to create = long development time, hard to debug, huge number of options to consider & choices to be made.
Design Space Exploration (DSE) – DS encapsulates all possible variations & permutations of designs that implement a system.
Nature of FPGA Design
![Page 32: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/32.jpg)
32
Debugging FPGA designs is hard and time consuming – combination of simulation and run-time techniques To simulate in software need
to develop testbenches and analyse waveform:
To analyse behaviour at run-time in hardware need to insert Chipscope cores (modify design), anaylse waveform:
Challenges of HW Design
![Page 33: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/33.jpg)
33
Using 128-point configuration: BRAM utilisation is 24% Timing constraints are just met
Using 256-point configuration: BRAM utilisation is 85% Timing constraints are not met: timing paths associated with the larger
BRAMs are the main cause of problems.
Because of the failing timing constraints of the 256-point configuration, currently restricted to 128-point configuration (2D-FFTs on 128 x 128 images).
Moving away from exclusive use of BRAM by incorporating off-chip DRAM will likely allow much larger input images.
Limitations of Current FPGA Implementation
![Page 34: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/34.jpg)
34
Complete integration of DRAM into infrastructure, unfortunately not a PnP solution: Have a reference design, but additional components to interface
between existing Alpha Data infrastructure and system components developed with Bluespec have been required.
Also additional clock domains, and many additional constraints to be considered.
Close to finalising a design for testing initial integration of DRAM into system.
Modules to perform transpose operations in DRAM have already been developed, so once integration is verified, using DRAM with 2D frequency domain convolution design will be straightforward.
Future Work FPGA - DRAM
![Page 35: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/35.jpg)
35
Developed a functional spatial convolution array in VHDL: Not yet used or integrated into system Has transpose linear filtering architecture, essentially
systolic array. Highly parallel so
exhibits highperformance, but high DSP utilisation.
Future work FPGA – Further Integration of HW Modules
![Page 36: Towards a Heterogeneous Computer Architecture for CACTuS](https://reader036.vdocuments.site/reader036/viewer/2022062410/5681658b550346895dd8528a/html5/thumbnails/36.jpg)
36
R2013a version of MATLAB used, with Image Processing toolbox
Misc Information