fpga design implementation – final release
TRANSCRIPT
FPGA Design Implementation – final release
DELIVERABLE NUMBER D6.6
DELIVERABLE TITLE FPGA Design Implementation – final release
RESPONSIBLE AUTHOR Nallatech Ltd
Co-funded by the Horizon 2020 Framework Program of the European Union
Ref. Ares(2018)3483418 - 30/06/2018
1 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
GRANT AGREEMENT N. 688386
PROJECT REF. NO H2020- 688386
PROJECT ACRONYM OPERA
PROJECT FULL NAME LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure and Platform in Industrial and Societal Applications
STARTING DATE (DUR.) 01/01/2015
ENDING DATE 30/11/2018
PROJECT WEBSITE www.operaproject.eu
WORKPACKAGE N. | TITLE WP6 | Low Power Small Form Factor Datacentre
WORKPACKAGE LEADER Nallatech Ltd
DELIVERABLE N. | TITLE D6.6 | FPGA Design Implementation – final release
RESPONSIBLE AUTHOR Richard Chamberlain, Nallatech Ltd
DATE OF DELIVERY (CONTRACTUAL) 30/06/2018 (M31)
DATE OF DELIVERY (SUBMITTED) 30/06/2018 (M31)
VERSION | STATUS V1.0
NATURE R(Report)
DISSEMINATION LEVEL PU(Public)
AUTHORS (PARTNER) Richard Chamberlain (Nallatech), Giulio Urlini (STM), Roberto Peveri (TESEO) , Daniele Paolini (TESEO)
2 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
VERSION MODIFICATION(S) DATE AUTHOR(S)
0.1 Initial update to D6.2 for review 13/06/2018 Richard Chamberlain
(Nallatech)
0.2 First internal review 18/06/2018 Giulio Urlini (STM)
0.3 Second internal review 27/06/2018 Roberto Peveri (TESEO) ,
Daniele Paolini (TESEO)
1.0 Final review 29/06/2018 Richard Chamberlain
(Nallatech)
3 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
PARTICIPANTS CONTACT
STMICROELECTRONICS SRL
Giulio Urlini
Email: [email protected]
IBM ISRAEL
SCIENCE AND TECHNOLOGY LTD
Joel Nider
Email: [email protected]
HEWLETT PACKARD
CENTRE DE COMPETENCES
(FRANCE)
Cristian Gruia
Email: [email protected]
NALLATECH LTD
Craig Petrie
Email: [email protected]
ISTITUTO SUPERIORE
MARIO BOELLA
Olivier Terzo
Email: [email protected]
TECHNION ISRAEL
INSTITUTE OF TECHNOLOGY
Dan Tsafrir
Email: [email protected]
CSI PIEMONTE
Vittorio Vallero
Email: [email protected]
NEAVIA TECHNOLOGIES
Stéphane Gervais
Email: [email protected]
CERIOS GREEN BV
Frank Verhagen
Email: [email protected]
TESEO SPA
Stefano Serra
Email: [email protected]
DEPARTEMENT DE
L'ISERE
Olivier Latouille
Email: [email protected]
4 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
ACRONYMS LIST
Acronym Description
AVX
BSP
CAPI
FPGA
GPGPU
HDL
HPE
I/O
IoT
MAC
MTBF
OpenCL
PCIe
QSFP
SIMD
SoC
SWAP
HPS
Advanced Vector Extensions
Board Support Package
Coherent Accelerator Processor Interface
Field Programmable Gate Array
General Purpose Graphics Processing Unit
Hardware Description Language
Hewlett Packard Enterprise
Input Output
Internet of Things
Multiply accumulate operations
Mean Time Between Failure
Open Computing Language
PCI Express
Quad Small Form-factor Pluggable
Single Instruction Multiple Data
System on Chip
Size, Weight and Power
Hard Processor System
LIST OF FIGURES
Figure 1 : SOC accelerator functional diagram ........................................................................................ 11 Figure 2 : SOC accelerator physical layout ............................................................................................... 12 Figure 3 : SOC Accelerator functional diagram ........................................................................................ 13 Figure 4 : 385A-SOC System Manager ..................................................................................................... 15 Figure 5 : QSFP28 clocking structure ....................................................................................................... 17 Figure 6 : Extended Front Panel Connections .......................................................................................... 17 Figure 7 : Arria 10 FPGA External Clocking Options ................................................................................. 19 Figure 8: 10-pin USB header ................................................................................................................... 20 Figure 9 : 10-pin dual USB connector ...................................................................................................... 20 Figure 10 : USB daisy chain connectivity ................................................................................................. 20 Figure 11 : I2C Addressing....................................................................................................................... 22 Figure 12 : Altera temperature sensor IP ................................................................................................ 26 Figure 13 : SOC Active Heat Sink ............................................................................................................. 27 Figure 14 : Air flow drawn through active heat sink ................................................................................ 27 Figure 15 : Autodesk CFD Arria 10 Stable Board Temperature (50 Watt design)...................................... 28 Figure 16 : Altera ARM Cortex A9 Hard Processor System (HPS) ............................................................. 29 Figure 17 : Altera SOC Device Block Diagram .......................................................................................... 30 Figure 18 : Boot Memory Locations ........................................................................................................ 31 Figure 19 : FPGA Configuration Block Diagram ........................................................................................ 32 Figure 20 : SOC Configuration with boot Sources .................................................................................... 33 Figure 21 : HPS boots from FPGA ............................................................................................................ 33 Figure 22 : ARM Base Configuration........................................................................................................ 35
5 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 23 : Intel FPGA OpenCL tool flow ................................................................................................. 37 Figure 24 : BSP Components ................................................................................................................... 38 Figure 25 : BSP component interaction ................................................................................................... 40 Figure 26 : SOC reference IP placement .................................................................................................. 40 Figure 27 : Host channel reference design placement (Nallatech 385A device) ....................................... 41 Figure 28 : Nallatech 385A base.aocx ...................................................................................................... 42 Figure 29 : Altera SDK for OpenCL software architecture ........................................................................ 43 Figure 30 : OpenCL MICMAC software components ................................................................................ 46 Figure 31 : Basic OpenCL streaming interface sequence diagram ............................................................ 47 Figure 32 : Command kernel example sequence diagram ....................................................................... 48 Figure 33 : Host to SOC interfaces ........................................................................................................... 50 Figure 34: Using the FPGA as a common interconnect ............................................................................ 52 Figure 35 : Block diagram for single channel on 385A SOC ...................................................................... 53 Figure 36 : Arria10 Soc Development Kit ................................................................................................. 56 Figure 37 : Arria 10 Development Kit Block Diagram ............................................................................... 57 Figure 38 : BSP version 2 with CPU as master ......................................................................................... 59 Figure 39 : BSP Version 2 Floor plan ........................................................................................................ 60 Figure 40 : Block diagram of ANN kernels ............................................................................................... 63 Figure 41 : ANN power ........................................................................................................................... 64
LIST OF TABLES
Table 1: SOC Accelerator Feature List ..................................................................................................... 12 Table 2: FPGA Voltage Settings ............................................................................................................... 14 Table 3: ID PROM Data ........................................................................................................................... 23 Table 4: System Manager Status LEDs ..................................................................................................... 24 Table 5: User LEDs .................................................................................................................................. 25 Table 6 : SOC accelerator BSP components ............................................................................................. 39 Table 7 : Serial link features .................................................................................................................... 52 Table 8 : Serial channel register addresses .............................................................................................. 54 Table 9 : Arria10 SOC development kit hardware development tasks ..................................................... 56 Table 10 : BSP development kit work ...................................................................................................... 58 Table 11 : BSP version resource use ........................................................................................................ 60 Table 12 : Example Power Monitoring Output Data ................................................................................ 62 Table 13 : ANN kernel resources required (BSP v2) ................................................................................. 64
6 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
EXECUTIVE SUMMARY
The main objective of WP6 is to bring top of the class power efficient FPGA (Field Programmable Gate
Array) technology into Hewlett Packard Enterprise (HPE) Moonshot Small Form Factor Data Centre. The
flexibility of the FPGA will be the enabling technology for integration of the different processing
elements of the heterogeneous architecture.
During task 6.2, Nallatech has developed a SOC based FPGA accelerator prototype to support the OPERA
project. This document outlines the details of device, but does not seek to justify the design choices as
this covered in Opera deliverable D6.1.
Nallatech has also developed a Board Support Package (BSP) to support the OpenCL toolflow on the SOC
accelerator prototype developed for the OPERA project. This includes the development of optical serial
connections critical to scalability and interoperability of the heterogeneous processing components. This
document also describes these aspects in detail.
This document describes the hardware, firmware and software work undertaken to support a SOC FPGA
accelerator in the HP Moonshot server. It documents the hardware features and firmware required for
control of the SOC FPGA and the Board Support Package (BSP) software required for programming it.
D6.6 is an update of the original D6.2 document produced in project month M15. As the project
progressed it was clear that some of the design decisions made at the start of the project required
modification in order to fulfil the OPERA objectives. Sections 1-8 remain unchanged, with the updates
described thereafter.
7 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
TABLE OF CONTENTS
1 SOC ACCELERATOR OVERVIEW ......................................................................................................... 11
1.1 OVERVIEW ................................................................................................................................ 11
1.2 SOC ACCELERATOR FEATURES .................................................................................................. 11
2 HARDWARE TECHNICAL DETAILS ...................................................................................................... 14
2.1 OVERVIEW ................................................................................................................................ 14
2.2 FORM-FACTOR ......................................................................................................................... 14
2.3 USER FPGA ............................................................................................................................... 14
2.4 8-LANE PCI-EXPRESS 3.0 INTERFACE (WITH CVP), 8-LANE PCIE MECHANICAL ........................... 14
2.5 SYSTEM MANAGER ................................................................................................................... 15
2.6 USER FPGA DDR4 SDRAM ......................................................................................................... 16
2.7 SOC HPS ................................................................................................................................... 16
2.8 2X QSFP PORTS SUPPORTING 10/40 GB/S ETHERNET ............................................................... 16
2.9 FRONT PANEL CONNECTIVITY ................................................................................................... 17
2.10 CLOCKING CIRCUIT ................................................................................................................... 18
2.10.1 PCIe Clock .......................................................................................................................... 18
2.10.2 Network Clock ................................................................................................................... 18
2.10.3 Memory & General FPGA Clocks ........................................................................................ 18
2.10.4 Configuration Clock ........................................................................................................... 18
2.10.5 Transceiver Clock............................................................................................................... 18
2.10.6 Clocking Options ............................................................................................................... 19
2.10.7 External 1 Pulse per Second Clock (1PPS) .......................................................................... 19
2.11 10-PIN USB HEADER ................................................................................................................. 20
2.12 ON-BOARD USB-BLASTER II....................................................................................................... 21
2.13 UART TO USB INTERFACE.......................................................................................................... 21
2.14 JTAG UTILITIES .......................................................................................................................... 21
2.15 I2C DEVICES .............................................................................................................................. 22
2.16 ID PROM ................................................................................................................................... 23
8 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.17 STATUS LEDS ............................................................................................................................ 24
2.18 USER LEDS ................................................................................................................................ 25
2.19 CARD INITIALISATION & SYSTEM RESET .................................................................................... 25
2.19.1 CvP Autonomous mode ..................................................................................................... 25
2.20 CONTROL & TEMPERATURE SENSOR ........................................................................................ 26
2.20.1 FPGA Temperature Alert Procedure .................................................................................. 26
2.21 POWER ..................................................................................................................................... 26
2.22 THERMAL ................................................................................................................................. 27
3 ARRIA 10 HARD PROCESSOR SYSTEM ............................................................................................... 29
3.1 ALTERA GX660 SOC HARD PROCESSOR SYSTEM (HPS) .............................................................. 29
3.2 HPS OVERVIEW ......................................................................................................................... 30
4 SOC ACCELERATOR CONFIGURATION ............................................................................................... 32
4.1 FPGA CONFIGURATION OVERVIEW ........................................................................................... 32
4.2 CONFIGURATION WITH BOOT SOURCES ................................................................................... 33
4.3 CONFIGURATION WITH SELF HPS BOOT ................................................................................... 33
4.3.1 QSPI Configuration ............................................................................................................ 34
4.3.2 Configuration via USB ........................................................................................................ 34
4.4 ARM SIDE CONFIGURATION...................................................................................................... 34
5 SOC ACCELERATOR BSP .................................................................................................................... 37
5.1 OPENCL TOOL FLOW................................................................................................................. 37
5.2 BSP OVERVIEW ......................................................................................................................... 38
5.3 OPERA BSP REQUIREMENTS ..................................................................................................... 38
5.4 BSP REQURIED DELIVERABLES .................................................................................................. 39
5.4.1 Base aocx file ..................................................................................................................... 39
5.4.2 Board Specification File ..................................................................................................... 42
5.4.3 Kernel Driver ..................................................................................................................... 42
5.5 PLATFORM SUPPORT ................................................................................................................ 43
5.6 SERIAL LINKS ............................................................................................................................ 43
9 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
5.7 ACHIEVING BEST FMAX OF BSP ................................................................................................. 43
5.7.1 Routing Islands .................................................................................................................. 43
5.7.2 Multiple Seed Sweeps ....................................................................................................... 43
5.8 LIMITING IMPACT ON RESOURCE ............................................................................................. 44
5.9 OPENCL INTERFACES ................................................................................................................ 44
5.9.1 Accessing the serial link interface from OpenCL ................................................................ 44
5.9.2 Example loopback design for host channels....................................................................... 45
5.9.3 DDR4 Memory Access ....................................................................................................... 45
5.10 SUPPORT FOR FUTURE INTEL TOOL VERSIONS .......................................................................... 45
5.11 MICMAC APPLICATION APPROACH ........................................................................................... 45
5.12 HOST KERNEL INTERFACE (EXAMPLE 1) .................................................................................... 47
5.13 HOST KERNEL INTERFACE (EXAMPLE 2) .................................................................................... 47
6 HOST (X86) PLATFORM .................................................................................................................... 49
6.1 HOST CHANNELS ...................................................................................................................... 49
6.2 HOST TO ARM PROPRIETARY CONTROL INTERFACE .................................................................. 49
6.2.1 X86 Host driver and API ..................................................................................................... 50
6.2.2 Multiple cards ................................................................................................................... 51
7 SERIAL INTERCONNECT DETAILS ....................................................................................................... 52
7.1 OVERVIEW ................................................................................................................................ 52
7.2 SERIAL CHANNEL IP DETAILS ..................................................................................................... 52
7.3 SERIAL CHANNEL DEBUG .......................................................................................................... 53
7.3.1 Control and Status Registers .............................................................................................. 53
7.3.2 Serial Channel Status Register ........................................................................................... 54
7.3.3 Serial Channel Control Register ......................................................................................... 54
7.3.4 Kernel Rx Ready Performance Accumulator Register ......................................................... 54
7.3.5 Kernel Tx Throughput Performance Accumulator Register ................................................ 54
7.3.6 Performance Control Register ........................................................................................... 55
7.3.7 MMD Support Serial Channel CSR Functions ..................................................................... 55
10 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
8 INTIAL DESIGN EVALUATION PLATFORM .......................................................................................... 56
8.1 OVERVIEW ................................................................................................................................ 56
9 BSP VERSION 2 (FINAL) ..................................................................................................................... 59
9.1 MODIFICATIONS REQUIRED ...................................................................................................... 59
9.2 UPDATED BSP ........................................................................................................................... 59
9.3 POWER MONITORING USING THE EMBEDDED ARM ................................................................. 61
9.4 INTEL OPENCL VERSION 17.1.2 ................................................................................................. 62
10 BSP EXAMPLE USE-CASE ............................................................................................................... 63
10.1 ANN MICMAC USE-CASE ........................................................................................................... 63
10.2 POWER MONITORING .............................................................................................................. 64
11 CONCLUSION ............................................................................................................................... 65
12 APPENDIX: EXAMPLE COMMAND QUEUE OPENCL ....................................................................... 66
12.1 COMMAND QUEUE EXAMPLE ................................................................................................... 66
11 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
1 SOC ACCELERATOR OVERVIEW 1.1 OVERVIEW
In order to be state of the art and to encompass the spirit of heterogeneity of the OPERA project, the
key hardware requirements for OPERA project are as follows:
• Arria 10 technology (Embedded floating point units key, state of the art FPGA at
commencement of project 2015)
• Embedded processor. Increases the heterogeneity of the system and provides more
research possibilities of compute offload and power monitoring.
• Off chip communications for scalability and common interconnect between different host
platforms. E.g. x86 and ARM.
• Host platform communications (CAPI and PCIe) for Power and x86 devices.
• External DDR memory for ARM and OpenCL kernels large enough to store compute data
for number of problems.
The following documentation describes the different hardware elements that have been generated to
fulfil these requirements.
1.2 SOC ACCELERATOR FEATURES
Figure 1 : SOC accelerator functional diagram
Figure 1 shows the key features of the SOC accelerator. The PCIe x8 Gen 3 interface will be used for
communication to the host system, otherwise the ARM and OpenCL kernels can run in a standalone
configuration if this proves preferable. Attached are two DDR memory banks for application memory on
the FPGA and embedded ARM processors. QSFP network ports provide high-speed communications
between multiple boards. The CPLD is used for bring up and configuration control of the FPGA. Figure 2
illustrates the physical layout of the different hardware components.
12 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 2 : SOC accelerator physical layout
The following table is a least of the different hardware features of the SOC accelerator.
Feature Description
Low Profile PCI-Express form factor
This is a single width ½ height, ½ length (167.6mm x
68.9mm x 17mm) card, which allows up to FPGA cards
to be integrated into the Moonshot Edgline server.
PCI-Express 3.0 host interface Electrical and Mechanical x8 PCIe interface. Highest
speed host interface supported by FPGA.
Altera Arria® 10 SX 660 FPGA
Powerful FPGA with dual ARM Cortex-A9 processor
and FPGA fabric with up to 1 TFlop/Sec processing
capability.
2 QSFP28 ports These will be used for inter card communications.
2 banks of 4 GByte, x72, 2133MT/s, DDR4 SDRAMs
External memories supporting up to 34 GBytes/Sec.
MAX 10 FPGA (System manager) Small FPGA/CPLD for system/configuration
management of Arria 10 FPGA.
QSPI 2Gb FLASH memory 2 Giga bits for external flash storage for multiple FPGA
configuration images.
JTAG Interface USB-Blaster II Interface for FPGA JTAG access (10-pin
USB header)
USB USB on Front Panel with integrated USB Hub and board
to board USB management interconnect
User clocks External clocks for user logic.
LEDs External user LEDS
Table 1: SOC Accelerator Feature List
Figure 3 illustrates how the different features are connected on the device.
13 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Arria 10 SOC PCIEx8 Gen3
x72
QSFP+
DDR4SDRAM
X8QSFP+
DDR4SDRAM
x40
Dual-core ARM® Cortex-A9
SENSORS
QSPI
CLOCKS
FRONT PANELPCIE
CONNECTOR
USB HUB
POWER
USB
X4
X4
SYSTEMMANAGER
x32
HOSTCONNECTOR
QSPI
Figure 3 : SOC Accelerator functional diagram
14 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2 HARDWARE TECHNICAL DETAILS 2.1 OVERVIEW
This section list the key technical details of the SOC accelerator
2.2 FORM-FACTOR
The SOC accelerator is a low profile, single width PCIe add in card with an x8 electrical and x8
mechanical interface. The device is 68.90mm high and 167.65mm long (the recommended dimensions
for a low profile PCIe card). The width of the card is 2.5mm at the rear and 14.47 mm at the front
complying to the PCI single slot width dimensions. The PCIe interface complies with the PCIe 3.0
specification. Strictly adhering to the PCIe specification is required to ensure successful integration with
the Moonshot Edgeline server.
2.3 USER FPGA
The User FPGA is an Altera Arria 10 SX660 in a F34 package. The FPGA I/O banks are powered by the
supplies as detailed in Table 2. FPGA core voltage (Vcore) is 0.9V.
Signals Bank Bank IO Voltage
QSFP 0 1E N/A1
QSFP 1 1F N/A1
PCIe x8 Port 1C, 1D N/A1
FPGA Configuration I/O 2A 1.8 V
LEDS, I2C, CLK, USB, Misc 3A, 3B, 3F, 2I 1.8 V
DDR4 Bank FPGA 3C, 3D, 3E 1.2 V
DDR4 Bank HPS 2J, 2K 1.2 V
Table 2: FPGA Voltage Settings
2.4 8-LANE PCI-EXPRESS 3.0 INTERFACE (WITH CVP), 8-LANE PCIE MECHANICAL
The SOC board has an 8-lane PCIe 3.0 interface. It does not feature a dedicated PCIe chip for PCIe host
transfers, hence the user FPGA design must include the Altera PCIe Hard IP core. Altera supports
multiple configurations of the PCIe core as part of QSYS, the user can set up the core for anything from 1
lane at PCIe 1.0 to 8 lanes at PCIe 3.0.
The PCIe interface has the following capabilities:
• Host PCIe bandwidth up to 8 GB/s2 (8 lanes at 8Gbps – PCIe 3.0) with CvP support using the
Altera QSYS Hard IP
• System Management Bus (SMBus)
The SOC accelerator ID PROM can be read by the host over the System Manager (SM) Bus on the PCIe
connection while the host is in standby mode (on-board ID PROM is powered by the PCIe 3.3V AUX
power supply). The I2C address of the PROM is 0x50. This is facilitated by the System Manager device.
1 VccR & VccT set to 1.03 volts, VccH at 1.8 volts, VccA at 1.8 Volts
2 Maximum theoretical data rate for 8 lanes of PCIe 3.0, the actual host bandwidth depends on the host hardware (motherboard, chipset,
processor, etc.), the Hard IP settings and the FPGA design itself.
15 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.5 SYSTEM MANAGER
In order control the behaviour of the SOC FPGA device a system manager has been created and is used
to perform the following functions:
• Configuration of FPGA with support for fallback.
• QSPI flash interfacing
• Power monitoring and sequencing
• I2C interface bridging to peripherals
• JTAG bridging to the User FPGA
• Write back functionality, allows FPGA to update flash and communicate with peripherals
• Using peripheral sensors performs environmental monitoring and responds to warnings/alerts as
necessary.
• GPIO hosting for miscellaneous peripheral control
• USB Blaster IP integration
A MAX103 device has been selected for the system management controller.
SYSTEMMANAGER
MAX10
Host
ADCs
Temp Sensors
ID PROM
FPGA
SMB to PCIe fingers
0 ohm links
POWERSUPPLIES
Sequencing
PWR Good
Board Voltages
CURRENTTRANSDUCER
12V
CLOCKS I2C
TESTCONN
JTAG
100MHz
Arria 10 SOC
Dual-core ARM® Cortex-A9
CONFIG
CLOCKCONTROL
UART
UART
QSFP
x32
I2C CORE
QSPI x8
LEDX8I2C
QSFP I2C
JTAG
I2C
HOSTCONN
CONTROL
Figure 4 : 385A-SOC System Manager
A management USB header which connects to the motherboard via a standard cable is provided for
general monitoring and control. A FT234XD provides a USB to UART interface to the MAX10.
A number of FPGA signals are also grouped with the Altera Fast Passive Parallel (FPP) bus to create a
wider parallel interface. After configuration this interface (excluding any dedicated signals) will be used
for read/write operations between the MAX10 & FPGA. A FPGA requested soft configuration (does not
3 https://www.altera.com/products/fpga/max-series/max-10/overview.html
16 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
require a power cycle) can be carried out by the MAX10. To request a soft configuration a user writes to
the Reconfiguration Register in the System Manager Interface Core.
Two FPGA images, a working and a fallback image, are stored in 2Gbit QSPI flash (8x256Mbit). In
addition to FPGA configuration, the reset sequencing of all other devices is controlled by the MAX10.
Power sequencing is performed by the MAX10 which controls the power up/down sequence of the
board as well as monitoring the various “Power Good” signals A group of board voltages are measured
via the ADCs in the MAX10.
An I2C bus collects temperature and current monitoring data, enabling the MAX10 to present
environmental data to the user. An additional SMB bus connection is provided as an option to the PCIe
fingers, enabling the server’s host sideband management layer to interface to the MAX10.
2.6 USER FPGA DDR4 SDRAM
The User FPGA fabric on the SOC accelerator is connected to one bank of DDR4 SDRAM (part number
MT40A512M8RH-083E:B) which is 72 bits wide, and is configured as 4 GB and operates at 2133 MT/s.
The “Arria 10 External Memory Interfaces” (or EMIF) should be instantiated within the QSYS design and
the parameters from these QPRS files should be used.
The EMIF Hard IP (on the Arria10) has a complementary EMIF Debug Toolkit component that can be
included in any design. There is only one EMIF Debug component per I/O component (on Arria 10),
therefore the EMIF Debug component must be shared between the two available memory banks.
2.7 SOC HPS
The Arria 10 system-on-a-chip (SoC) is composed of a dual-core ARM® Cortex™ -A9 hard processor
system (HPS) and an FPGA. The HPS architecture integrates a wide set of peripherals that reduce board
size and increase performance within a system. Integrated into the HPS are a subset of peripheral
functions including:
• HPS-to-FPGA bridge port - 32, 64, or 128 bits wide
• General-purpose direct memory access (DMA) controller
• Three Ethernet media access controllers (EMACs)
• Two USB 2.0 on-the-go (OTG) controllers
• NAND Flash Controller
• QSPI flash controller
• I2C, UART, SPI, Watchdogs, Timers, etc
The Dual ARM Core SOC fabric on the accelerator is connected to one bank of DDR4 SDRAM which is 40
bits wide, and is configured as 2 GB and operates at 2133 MT/s.
The HPS is discussed in more detail in section 3.
2.8 2X QSFP PORTS SUPPORTING 10/40 GB/S ETHERNET
The 385A-SOC features two 10/40 Gb/s capable ports. The QSFP high speed interfaces are directly
driven from IP within the user FPGA design. Altera provides IP cores for multiple high speed protocols
compatible with the 385A-SOC which all have their own reference clock requirements. The SOC
accelerator is populated with an on-board dual frequency clock chip which, by default, feeds a
644.53125MHz MHz (for 10/40 GbE) to the dedicated transceiver IP reference FPGA clock pin.
A range of protocols can be implemented using the Altera’s IP core QSYS library. For such IP cores
different reference clock frequencies might be required
17 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Each network device is allocated a dedicated MAC addresses. These addresses are labelled on the board
and programmed in the ID PROM.
JitterAttenuation
ChipSi5346
OSC 0
QSFP28_1
QSFP28_0
ClockGenerator 1
ClockGenerator 0
OSC 1
qsfp1_refclk0
qsfp0_refclk1
qsfp1_refclk1
FPGASi5346 ctrl/status
TCXO
10MHz
1PPS
Optional Advanced Clocking Block
recovered_clk0
recovered_clk1
qsfp 0_refclk0
Figure 5 : QSFP28 clocking structure
2.9 FRONT PANEL CONNECTIVITY
The front panel on the SOC accelerator provides access to the two QSFP ports and one USB port. There
are two front panel options, a half-height and full height. The half-height allows connectivity to the two
QSFP and front panel USB ports. The full height option extends this to facilitate connectors for an
external clock and a 1PPS signal. See Figure 6.
Figure 6 : Extended Front Panel Connections
18 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.10 CLOCKING CIRCUIT
The clocking circuit on this board has been chosen to provide flexible and quality clock sources to the
FPGA and associated circuitry without adding significant cost. See Figure 7 for clock layout.
The following clocks sources are available to the FPGA design:
• PCIe Slot Clock
• Network Clock
• Memory and General FPGA Clocks
• Configuration Clock
• Optional 1PPS External Clock input
• Optional External Clocking using clock synthesizer
2.10.1 PCIe Clock
The PCIe clock is 100MHz and is provided by the host motherboard. This clock is routed directly to the
FPGA. It must be used as the reference clock for the PCIe IP (Altera PCIe Hard IP) but can also be used
for other purposes.
2.10.2 Network Clock
The network interface clock is a fixed dual clock selectable Silicon Laboratories Si532 and has a default
power on configuration of 644.53125 MHz which is used to generate the 10.3125 Gb/s transceiver line
rates required for 10 GbE and 40 GbE. It has an LVDS I/O standard.
The clock frequency can be changed by the FPGA. The alternative frequency is 531.25 MHZ which is the
primary clock when using Fibre Channel.
2.10.3 Memory & General FPGA Clocks
This clock source provides three buffered outputs that drive the quadrants of the FPGA. The clock
outputs are LVDS and have a power on frequency of 266 MHz; this default frequency has been chosen
for simple generation of the 1866 MHz clock which is required for the DDR4 memories.
These clocks are sourced from a Silicon Laboratories Si5338 clock generator. Additional clock
frequencies required within the FPGA can be derived from this clock source.
2.10.4 Configuration Clock
The configuration clock is a 100MHz clock that is used by the FPGA internal configuration fabric but is
also routed to the FPGA for use as a user clock. This is a standard single ended oscillator.
2.10.5 Transceiver Clock
Both QSFP modules are driven from dedicated transceivers. These transceivers require a dedicated
reference clock pin to utilize the cleanest clock source and hence support the highest of I/O standards
with tight jitter tolerances.
19 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.10.6 Clocking Options
Arria 10 SOC
CLOCKSYNTH
TCXO
OSC1
OSC0
1PPS
EXT_CLK
Si5338CLOCKSYNTH
GXB_REFCLK0
GXB_REFCLK1
GXB_REFCLK2
GXB_REFCLK3
GXB_REFCLK4
GXB_REFCLK5 REFCLK_GXB
REFCLK_GXB
REFCLK_GXB
REFCLK_GXB
REFCLK_GXB
REFCLK_GXB
RECOVERED _CLK0
RECOVERED _CLK1
IN0
IN1
IN2OUT0
OUT1
OUT2
OUT3 CL
K
LVDS
CL
K
CL
K
CL
K
LVDS
Figure 7 : Arria 10 FPGA External Clocking Options
Figure 7 shows the full network clock circuit illustrating the flexible clocking options. The clocking
options are with respect to transceiver clocking. The transceivers in the Arria 10 are all on the same
column so can utilise any of the GXB_REFCLKs as the source for QSFP reference clocks.
Fixed Clocks:
• Clock synth chip (Si5338) driving DDR4 Ref clock and system clock inputs.
Standard Option:
• OSC 0 fitted with a dual frequency Si532 clock to generate GXB_REFCLK0
Optional Clocks:
• OSC 0 fitted with a fully programmable Si570 clock to generate GXB_REFCLK0
• OSC 1 fitted with a fully programmable Si570 clock to generate GXB_REFCLK1
• OSC 1 fitted with a dual frequency Si532 clock to generate GXB_REFCLK1
• Clock synth chip Si5346 generates GXB_REFCLK2, GXB_REFCLK3, GXB_REFCLK4 & GXB_REFCLK5.
◦ The Si5346 outputs may be a function of the following inputs :
◦ Inputs connected to 2 FPGA generated clocks (typically recovered clocks)
◦ Input connected to 1 external clock source (SMA connector)
◦ Reference Clock (e.g. always connected), TXCO, connected to a Temperature Compensated
50.00 MHz oscillator
2.10.7 External 1 Pulse per Second Clock (1PPS)
The 1PPS signal can be used to provide a means of synchronizing the FPGA timing to an external timing
signal. The input is diode protected with a maximum 5.5V allowed. Vih will be approximately 3V
depending on the FPGA I/O threshold used. To utilize this input the extended front panel is required.
20 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.11 10-PIN USB HEADER
The 385A-SOC features a 10-pin USB header to provide access to the on-board USB-Blaster II (JTAG chain
access) and the FPGA UART to USB interface.
Figure 8 labels the two USB interfaces accessible to the user; Figure 9 shows the 10-pin on-board USB
connector.
Figure 8: 10-pin USB header
Figure 9 : 10-pin dual USB connector
The SOC accelerator has several external USB connector, one in the front panel and two on the rear of
the board. The front panel USB connector is multiplexed with the IN USB connector on the rear. Plugging
a cable into the front panel will disconnect USB traffic routed through the rear connector. A second OUT
USB connector on the rear of the board allows USB traffic to be routed out of the USB hub such that a
downstream connection can be setup from the front panel (or rear USB IN) of the first board to up to
three downstream boards as illustrated in Figure 10.
Figure 10 : USB daisy chain connectivity
21 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Connecting multiple devices in this manor allows a single entry point to a systems JTAG chain. This will
be useful in heterogeneous systems where some host OS’s may not support the Quartus tool flow.
2.12 ON-BOARD USB-BLASTER II
The SOC Accelerator features an on-board USB-Blaster II chip. It enables user access to the FPGA JTAG
Chain for FPGA & Configuration Flash programming purposes using the Quartus Programmer or for
debug purposes using the Quartus JTAG based utilities.
Quartus uses the built-in USB devices driver on Linux to access the USB-Blaster II chip. By default root is
the only user allowed to use these devices. You must change the permissions on the ports before you
can use the USB-Blaster II to program devices with Quartus software.
It is expected that the USB interface will be used during debug of the different MIMMAC
implementations.
2.13 UART TO USB INTERFACE
The board also connects some FPGA pins to a UART to USB on-board chip provides the means to create
a simple debug USB UART port. The UART to USB chip populated on the board is a FT234XD-R, please
visit FTDI Chip’s website to download the part’s datasheet4 and instructions on how to install the device
driver for your preferred Operating System.
2.14 JTAG UTILITIES
The USB-Blaster II interface provides access to the FPGA JTAG chain and allows the user to reprogram
both the FPGA and the Configuration Flash through the user the Quartus Programmer Tool.
Intel provides several other debug tools which use the JTAG chain.
• SignalTap II Logic Analyzer
• Transceiver Toolkit
• External Memory Interface Toolkit
• In-System Sources and Probes Editor
• Etc.
Please refer to Intel’s documentation for details on how to use these tools. The development of the
OpenCL BSP should remove the requirement for a user to be concerned with the different tools by
providing a software driven interface to the different card features.
4 http://www.ftdichip.com/Support/Documents/DataSheets/ICs/DS_FT234XD.pdf
22 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.15 I2C DEVICES
SYSTEMMANAGER
MAX10
I2C_BUS0
TEMP SENSORTMP432ADGST
Add = 0x4C
EEPROMAT24C16C-STUM
Add = 0x50
I2C_BUS1
CLOCKSSi5338B
Add = 0x70
I2C_BUS2
CORE PWRZL8802
Add = 0x31
I2C_BUS3
QSFP0Add = 0xA0
I2C_BUS4
QSFP1Add = 0xA0
PCIECONNECTOR
SM_BUS
CLOCKSSi5346A
Add = 0x6C
I2C_BUS6
Figure 11 : I2C Addressing
An I2C bus connects the low speed peripheral control signals to the system manager for control. Figure
11 shows which devices are connected and can be monitored by the Max10 device via the I2C bus. This
allows the serial links, clocks, PROM and sensor data to be monitored in the host system. This will be
required to control the system behaviour in respect to power monitoring and temperature control,
extremely important the truck use case described in deliverable D6.1.
23 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.16 ID PROM
The ID PROM is connected to the I2C bus at address 0x50. This PROM contains useful information about
the device such as serial number, card version, etc. (see Table 3). This PROM can be read from the host
system via the system manager device. It is then possible to identify exactly what is located in the
system. The PROM is read only.
Bytes (decimal)
Contents
Example
0 – 3 Reserved Do Not Use
4 – 15 Serial Number 7095007
16 – 38 Order code P385ASOC-660-11A-10
39 – 56 Card revision v0201
57 – 60 FPGA type
61 FPGA fabric speed 2
62 - 67 Prom programming date MM/DD/YYY
68 - 73 QSFP28_0 Mac address 00:0c:d7:00:1f:c7
74-79 QSFP28_1 Mac address 00:0c:d7:00:1f:c8
80 FPGA Transceiver Speed 3
81-100 FPGA Part Number 10AX115N3F40E2SG
101 - 115 Reserved Do Not Use
116 - 127 TBC Anything
Table 3: ID PROM Data
24 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.17 STATUS LEDS
The System Manager reflects system its status on LEDs D11-D18, while the user FPGA can drive the LED
D19. These are not planned to be accessible from OpenCL BSP, however they could be added as an
external IO channel if deemed necessary.
The LEDs defined in the Table 4 are bi-colour red and green. Switching both on at the same time gives an
amber colour. These will be useful for determining any faults that may arise.
LED Colour Sequence Description
D11
Green
Red
Fixed on
Flashing
No errors, successful
power up
Power up failure
D12
Green
Green
Red
Red
Amber
Fixed on
Heartbeat
Even mark to space
Fixed on
Fixed on
FPGA config completed
FPGA unconfigured
FPGA thermal cut-out
FPGA flash config error
FPGA flash config failover
D13 Reserved for future use
(RFU)
D14 Reserved for future use
(RFU)
D15 RFU
D16 RFU
D17 RFU
Table 4: System Manager Status LEDs
D11,D12,D13 D14,D15,D16
D17,D18,D19
25 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.18 USER LEDS
The user has access to two extra LEDs connected to the System Manager, which are accessible to the
user FPGA via the system manager interface IP provided in the Quartus tools. The pin location of these
User LEDs is given in Table 5. It’s unlikely that this feature will be utilised as part of the OpenCL BSP.
LED Behaviour
D18 User FPGA Led 0, driven by User FPGA via Nallatech
system manager interface IP.
D19 User FPGA Led 1, driven by User FPGA via Nallatech
system manager interface IP.
Table 5: User LEDs
2.19 CARD INITIALISATION & SYSTEM RESET
Since the SOC Accelerator is a PCIe card and is a slave to a host processor. Therefore, the principle reset
of this card will come through the PCIe bus from the host.
The mechanism that supports a reliable initialization of the firmware and software running on the SOC is
as follows:
• Card powers up and supplies are sequenced by the System Manager, after which the board comes
out of reset.
• When the power supplies are stable, the “All Power Good LED” comes on, the configuration logic is
released and the FPGA configuration starts.
• After FPGA configuration, PCIe training and reset should occur.
• The release of the PCIe reset by the host also deactivates the global reset to the internal FPGA
design.
• After PCIe enumeration, a soft reset comes from the PCIe core to the rest of the internal FPGA logic
to start the firmware.
2.19.1 CvP Autonomous mode
FPGA configuration from flash takes around a second which does not satisfy the PCIe protocol
specifications. It is however possible to get the PCIe IP Core alone loaded in the FPGA fabric first and
have the PCIe IP up and running under 100ms (PCIe specification); this option is called CvP autonomous
mode and is an option of Altera Quartus Tools for the PCIe Hard IP block.
26 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.20 CONTROL & TEMPERATURE SENSOR
Figure 12 : Altera temperature sensor IP
The FPGA temperature can be monitored using the internal FPGA temperature sensor available in the
Arria 10 FPGA. Altera supplies a temperature sensor IP core that can be read directly from the FPGA
logic. The value will be read back to the host or embedded ARM controller as a means of tracking the
device temperature whilst running different applications.
2.20.1 FPGA Temperature Alert Procedure
The FPGA’s die temperature is monitored and a thermal event is routed to the System Manager. This
signal event is set, by default, to trigger when the FPGA die temperature goes above 105oC.
If this occurs the Power Supply Unit (PSU) controller will immediately turn off all power supplies in order
to prevent any permanent damage to the FPGA.
2.21 POWER
The SOC accelerator board is designed to support a power consumption of up to 75W, in line with the
maximum delivered power available from a standard PCIe slot.
The total maximum power required by the Opera use cases is expected to be well below 75W. The
standard Fan/Heatsink COTS solution supports 75 Watts of cooling with an ambient intake temperature
up to 35C. Under these conditions the FPGA die temperature will be kept below an operating junction
temperature of 85C.
27 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
2.22 THERMAL
The SOC accelerator can be fitted with a passive or active heatsink depending on the server
environment where the device will be installed. The option with active heatsink is shown in Figure 13.
Figure 13 : SOC Active Heat Sink
Using Autodesk CFD the thermal characteristics of the cooling technology can be modeled prior to
manufacture. These models are always performed for the expected worst case scenario.
Figure 14 : Air flow drawn through active heat sink
28 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 15 : Autodesk CFD Arria 10 Stable Board Temperature (50 Watt design)
29 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
3 ARRIA 10 HARD PROCESSOR SYSTEM 5 The Arria10 SOC device contains a dual core ARM cortex A9 processor. This chapter describes some of
the features of this hard processor system (HPS) and its relevance to BSP design work.
3.1 ALTERA GX660 SOC HARD PROCESSOR SYSTEM (HPS)
Figure 16 : Altera ARM Cortex A9 Hard Processor System (HPS)
Figure 16 illustrates the main components of the Arria10 SoC Series 10 Hard Processor System. The Arria
10 series HPS has many features, with those key to the Opera project listed below.
Feature
CPU frequency 1.2 GHz with 1.5 GHz via overdrive
Runs 32 bit ARM instructions
ARM NEONTM media processing engine
Single and double precision floating-point unit
Hard memory controller with support for DDR4 and DDR3
FPGA-to-HPS bridge : Allows IP bus masters in the logic core to access to HPS bus
slaves
HPS-to-FPGA bridge : Allows HPS bus masters to access to bus slaves in core fabric
5 https://www.altera.com/products/soc/portfolio/arria-10-soc/arria10-soc-hps.html
30 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
3.2 HPS OVERVIEW
Figure 17 : Altera SOC Device Block Diagram
Figure 17 illustrates the different components of the HPS and FPGA. The HPS portion of the device
contains all ARM related interfaces, including the flash controls for configuration, clock management
and memory interconnects. The FPGA Portion has access to the general FPGA fabric, FPGA IO and fixed
IP components such as PCIe. There is small amount of shared IO between the HPS and FPGA portions of
the device, for when it makes sense to share input stimulus.
Communication between the HPS and FPGA can be done in the following ways:
• Shared External Memory: It is possible to pass data via shared DDR memory. Here both the HPS and
FPGA have accessed to a shared DDR memory bank. This allows the ARM to buffer large amounts of
data ready for processing in the FPGA. It can then continue to run in parallel whilst the FPGA
processes the contents for the DDR. In the OpenCL tool flow this memory appears as a global
memory.
• Shared Internal Memory: It is also possible to create an internal smaller shared memory in the FPGA
fabric that can be accessed from the HPS and FPGA. This memory uses the M20K memory blocks in
the FPGA fabric.
31 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
The different interfaces are accessed by the ARM processor via unique physical addresses. The following
is an example memory address topology that includes an external DDR memory (The real memory map
is still to be decided).
Figure 18 : Boot Memory Locations
The above figure illustrates a possible arrangement of the boot address locations. The SDRAM region is
accessible to the ARM and the FPGA fabric. The HPS-to-FPGA region has a reserved location beyond the
address 0xC0000000.
32 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
4 SOC ACCELERATOR CONFIGURATION
4.1 FPGA CONFIGURATION OVERVIEW
This section outlines the different configuration options and the setup will that will be used for the
Opera project.
The Arria 10 FPGA is connected to a 2Gb Serial Flash chip (EPCQL2048) used for FPGA configuration
upon board power on. The FPGA uses the Active Serial x4 FPGA configuration mode to obtain its
configuration data from the flash at power on, then the MSEL[2..0] is set to 010 for Fast Power-On Reset
(POR) Delay.
The accelerator also has an on-board Altera USB-Blaster II which can be accessed by the USB connector
on the side connector. This provides JTAG access to program the Configuration Serial Flash and the
FPGA.
The following section features extracts from the “Arria 10 Hard Processor System Technical Reference
Manual” in italic. For more detail pertaining to the HPS refer to this document.
Arria 10 SOC
Dual-core ARM® Cortex-A9
2Gb QSPIFLASH
x4SYSTEM
MANAGER(MAX10)USB
BLASTER
JTAG
USBHUB
x8
USBMUX
FP
RP IN
USBUART
RP OUT
UART
2Gb QSPIFLASH bank (8x256Mb)
FPP32
Figure 19 : FPGA Configuration Block Diagram
Figure 19 is a block diagram depicting the connectivity between different JTAG components on the
accelerator board. Configuration is controlled via a MAX10 device.
33 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
4.2 CONFIGURATION WITH BOOT SOURCES
Figure 20 : SOC Configuration with boot Sources
The FPGA can be configured from an external source and the ARM then booted from another external
source.
4.3 CONFIGURATION WITH SELF HPS BOOT
Figure 21 : HPS boots from FPGA
“In Figure 21, the FPGA is configured first through one of its non-HPS configuration sources, this will be
the QSPI interface for the Opera accelerator. The Configuration Subsystem (CSS) block configures the
FPGA fabric as well as the FPGA I/O, shared I/O and hard memory controller I/O. The HPS executes the
34 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
second-stage boot loader from the FPGA. In this situation, the HPS should not be released from reset
until the FPGA is powered on and programmed. Once the FPGA is in user mode and the HPS has been
released from reset, the boot ROM code begins executing. The HPS boot ROM code executes the second-
stage boot loader from the FPGA fabric over the HPS-to-FPGA bridge.” 6
This is the preferred form of configuration of the Opera setup.
4.3.1 QSPI Configuration
The QSPI interface runs at 100 MHz and is connected to the on board flash memory. Booting from the
flash is the fastest way to configure the device. However, loading the flash with a new image is
extremely slow and can take several minutes. Initially there were no plans to support configuration via
DMA, hence any application requiring multiple configurations would expect a long delay between
configurations or limit the number of reconfigurations to the maximum number of images that can be
separately stored in flash memory.
This was approach was changed during the lifetime of the OPERA project as it became obvious this
would be a serious limitation to the FPGA acceleration where multiple designs were required per
application (See 9.1).
4.3.2 Configuration via USB
The SOC accelerator features a System Manager which hosts USB-Blaster functionality to access the
FPGA JTAG chain. By connecting the USB connector to the host’s motherboard with a 5-pin USB cable,
the user can access the JTAG chain with Quartus Programmer. Using this method, the FPGA can be
reconfigured directly or the on-board 2Gb configuration flash can be reprogrammed so the FPGA is
configured with the new design on the next power cycle. The JTAG Chain also provides debug access to
the board and enables the use of various JTAG based debug utilities like SignalTap.
4.4 ARM SIDE CONFIGURATION
There are two options for the ARM side configuration of the FPGA fabric. Either the ARM can boot first
and configure the FPGA or the FPGA can boot first and instruct the ARM processor to start. Either
method is valid and will depend upon what implementation best fits the target use case. For OPERA it is
assumed that the FPGA will configure first from external sources and the ARM will be initiated once
configuration is complete.
6 https://altera.com/documentation/sfo1410070178831.html
35 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 22 : ARM Base Configuration
“In Figure 22 the HPS boots first through one of its non-FPGA fabric boot sources. If the hard memory
controller or shared I/O are required by the HPS during booting then you can either execute a full FPGA
configuration flow or an early I/O release configuration flow. The FPGA must be in a power-on state for
the HPS to reset properly and for the second stage boot loader to initiate configuration through the FPGA
Manager. The software executing on the HPS obtains the FPGA configuration image from any of its flash
memory devices.”7
“The HPS-to-FPGA and lightweight HPS-to-FPGA bridges are both mastered by the level 3 (L3)
interconnect. The FPGA-to-HPS bridge masters the L3 interconnect. This arrangement allows any master
implemented in the FPGA fabric to access most slaves in the HPS. For example, the FPGA-to-HPS bridge
7 https://altera.com/documentation/sfo1410070178831.html
36 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
can access the accelerator coherency port (ACP) of the MPU subsystem to perform cache-coherent
accesses to the SDRAM subsystem.”8
8 https://altera.com/documentation/sfo1410070178831.html
37 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
5 SOC ACCELERATOR BSP
5.1 OPENCL TOOL FLOW
The Intel FPGA SDK for OpenCL allows the implementation of FPGA logic using OpenCL C, an ANSI C-
based language with additional OpenCL constructs. The OpenCL SDK allows code to be emulated on
hardware in a software flow before generating HDL code to be compiled through the Quartus FPGA tool
chain. Once compiled the FPGA can be programmed using generated binary through the Khronos
OpenCL API.
The Intel FPGA SDK compiler takes a users OpenCL code and generates the HDL code that represents the
kernel code and the target FPGA accelerator. The IP for the target accelerator is described by the SOC
accelerator board support package. Once the HDL is created the Intel Quartus tools compile the HDL to
create a binary. This compilation process, known as place and route, can take many hours to complete.
Figure 23 illustrates the Intel FPGA OpenCL tool flow compiling a “hello world” example.
Figure 23 : Intel FPGA OpenCL tool flow
The OpenCL tool flow comes with several tools for control and debug. The BSP must also support other
features of the Intel FPGA OpenCL SDK such as board diagnosis and configuration options.
38 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
5.2 BSP OVERVIEW
This section describes the work undertaken to create a programming environment that interacts with
x86, ARM and FPGA fabric. The following sections highlight the areas of research that have been
undertaken to create a Board Support Package (BSP) suitable for the OPERA use cases.
As discussed in deliverable D6.1, heterogeneity brings with it the added complexity of multiple
operating systems and device design processes. The seamless integration of these different design
processes within a single environment will be crucial to the success of the MICMAC use case. The goal is
to evaluate what design processes work best for heterogeneous systems and what does not. In order to
evaluate the design process it is necessary to develop a new BSP.
The OPERA hardware is uniquely different to any BSP previously developed by Nallatech and Intel. The
presence of the ARM processor within the chip provides unique challenges and problems. The main
challenge is to how to handle the presence of two master interfaces using the OpenCL tool flow that
expects a single master, slave relationship.
The Intel FPGA OpenCL tool flow provides a set of utilities and a compiler for targeting FPGA
accelerators using the OpenCL work flow. In order for the tools to be able to target an FPGA accelerator,
the accelerator vendor must create what is known as a board support package (BSP). This BSP is a non-
trivial piece of firmware that connects required Intel FPGA IP with bespoke vendor logic to give the
OpenCL tool flow access to the accelerator and its unique capabilities.
5.3 OPERA BSP REQUIREMENTS
For the OPERA SOC card the BSP must support access to the external IO interconnect, the external DDR
memories, connectivity to the ARM processor on device and also a route back to the PCIe attached
processor. This requires the creation and inclusion of different IP blocks.
Figure 24 : BSP Components
Figure 24 illustrates the different components of the SOC BSP device. The ARM and the OpenCL kernel
share data via a shared DDR interface. Data is transferred on off device via the serial links and host
channel interface (PCIe).
39 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
There are various firmware and software components required by the Intel FPGA OpenCL SDK tool flow.
These are listed in the following sections.
5.4 BSP REQURIED DELIVERABLES
5.4.1 Base aocx file
The Base aocx is a precompiled and preplaced FPGA design containing all the firmware required by the
BSP, in order to guarantee a high performance for all kernel designs. To create a BSP that is efficient, in
terms of resource and clock frequency, the location of the different IP blocks must be carefully
considered. Keeping the size of the IP blocks to a minimum is crucial to maximise resources available for
processing, whilst the efficiency of the routing between the different components is crucial for
maintaining a high clock frequency.
The base is loaded by the Intel FPGA tool flow and used as starting point for the generation of the FPGA
OpenCL application.
The following table lists the key components of the base.aocx file are a brief description of their
function.
BSP Component Description
Arria10 HPS
The Arria 10 Hard Processor System configures and
connects the external interfaces of the hard processor.
Clock Cross Kernel Memory 1
This is a memory mapped clock crossing bridge that
allows data to be transferred to/from the kernel clock
domain into the DDR memory’s clock domain.
EMIF_a10_hps
The Altera A10 Extended Memory Interface (EMIF) DDR
Interface. Provides an interface to connect a bank of
external memory to the BSP. Configured for a bank of
DDR4 with a 32bit data bus and a 15bit address bus.
Kernel Interface
The OpenCL Kernel Interface allows the host interface
to access and control the OpenCL kernel.
Kernel Clock Generator
The OpenCL Kernel Clock Generator generates a clock
output and a clock 2x output for use by the OpenCL
kernels. An Avalon-MM slave interface allows
reprogramming of the phase-locked loops (PLLs) and
kernel clock status information.
MM interconnect 0
The largest Avalon Memory Map Interconnect. In this
case connected between the address span extender
component and the DDR memory.
Avalon MM interconnects are inserted into a system to
connect between Avalon master and slave
components. These can consume a large amount of
resource especially if the data and address buses are
wide.
Other components in a BSP
Pipeline bridges, clock crossing bridges, reset
controllers, Partial Reconfiguration, ACL Version ID
register, Arria 10 Temperature sensor, address span
extenders and clock controllers.
Table 6 : SOC accelerator BSP components
Figure 25 is a top-level diagram that illustrates how these components interact.
40 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 25 : BSP component interaction
These components are then placed on the FPGA in a manner that minimises the impact on the FPGA
resource and routing. Intel provides a reference design for a pure SOC system with no external IO
connections with the exception of a DDR interface. This reference design will form the basis of the fixed
logic part (base.aocx) of the Opera BSP. A couple of images in Figure 26 illustrate the logic placement of
the reference design. The smaller image has the reference design logic highlighted in red with the
OpenCL kernel logic coloured blue. The larger image is a zoomed in section of the reference BSP with
the key components highlighted in different colours.
Figure 26 : SOC reference IP placement
Intel also provides an example design for using host channels. This reference design must be integrated
with the SOC reference design.
41 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 27 : Host channel reference design placement (Nallatech 385A device)
Figure 27 shows the host channel code highlighted in red and the PCIe interface code in light blue.
Drawing on the experienced gained during the creation of Nallatech’s 385A BSP, a new BPS will be
created incorporating the required SOC features.
42 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 28 : Nallatech 385A base.aocx
The SOC accelerator BSP will be similar to the Nallatech 385A BSP, however the kernel control logic will
be placed close to the SOC device which will simplify the PCIE interface component. Figure 28 shows the
385A BSP with the different BSP components labelled.
5.4.2 Board Specification File
This file is an XML description of the different attributes of the OpenCL features. It includes descriptions
of the memories and external IO connections to instruct the Intel FPGA tool flow how to wire together
the different components with the user kernel design.
5.4.3 Kernel Driver
In order to talk to the device via PCIe the host system requires a driver to facilitate PCIe communications
via the OpenCL API. This allows the FPGA to be discovered by the OpenCL SDK.
Figure 29 depicts the four layers of the Intel Software Development Kit (SDK) for OpenCL (AOCL)
software architecture: runtime, hardware abstraction layer (HAL), memory mapped device (MMD) layer,
and kernel mode driver.
43 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 29 : Altera SDK for OpenCL software architecture
5.5 PLATFORM SUPPORT
The host system must be running one of the following supported Linux target platforms:
• Red Hat Enterprise 64-bit Linux (RHEL) version 6 on the x86-64 architecture
• Nallatech also tested the BSP package against CentOS 64-bit Linux version 6 on the x86_64
architecture.
• Ubuntu will supported for MICMAC environment
The OpenCL BSP provides a memory-mapped device (MMD) layer necessary for communication with the
accelerator board and the lower level kernel mode driver. All other upper software layers are provided
by the Intel FPGA SDK for OpenCL installation.
5.6 SERIAL LINKS
It is proposed that the FPGA act as a common high speed interconnect between IBM and Intel systems.
For this to be the case a serial interconnect must be created that is fast and integrated within the
OpenCL tool flow. The serial links will also provide the ability to be partition algorithms over multiple
FPGA devices if required. See section 7 for more details.
5.7 ACHIEVING BEST FMAX OF BSP
To achieve the best kernel performance the BSP logic must be carefully placed on the device. This
section describes some of the work undertaken to achieve best kernel clock performance
5.7.1 Routing Islands
Simply placing the required BSP IP components on device is not enough to ensure a good performance
for the BSP. Problems occur when the user kernel logic is added to the systems. Poor placement of the
IP components prevents the efficient routing of the kernel logic which results in a low maximum
frequency (FMax) of the application. This can be somewhat alleviated by creating logic islands for the
pipelined stages of key BSP signals. These islands are placed in locations to be as least disruptive as
possible. Figure 28 illustrates how these islands were placed for the Nallatech 385A device.
5.7.2 Multiple Seed Sweeps
Once a BSP has been placed the design is compiled to create the BSP fixed logic aocx (See 5.4.1). The
performance of this fixed logic is dependent upon what is effectively a set of random choices made by
the place and route tools. The choice of starting conditions by the tools determines the final placement
and routing and therefore the FMax of the design. By running multiple seed sweeps, where the initial
conditions are stochastically modified, a set of designs with varying FMax values are created. The one
with the highest FMax is then used as the base aocx file.
44 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
5.8 LIMITING IMPACT ON RESOURCE
The resource required for the BSP base aocx is resource that will not be available to use within a users
kernel. Therefore it is important to keep the size of the IP components to a minimum. This can be done
by limiting the functionality and by compressing the areas of within which the IP components fit.
Reducing the functionality limits what can be done by the device, hence it is often the case that multiple
BSPs are created with different features to ensure logic is not used when it’s not required. For the Opera
BSP the requirements are fixed and therefore only one BSP has been designed.
When fixing the locations of the IP blocks, care must be taken to ensure the kernel has access to key
logic elements such as DSP and M20K components. These are aligned in strips vertically down the device
with standard FPGA logic elements between. Therefore fix areas logic are placed in a manner, where
possible, that does not prevent access. The priority is on preserving DSP blocks as these form the
foundation of any algorithm functions.
The place and route tools will place an IP component in the most efficient layout possible. This can mean
that an IP component is spread out more than is required. Any logic areas used by the IP components
will not be accessible to the users kernel, even if the FPGA logic in these areas are empty. Hence, it
makes sense to force the BSP IP to be constrained to as smaller area as possible. Constraining IP to a
fixed block can have a negative effect on clock performance and cause the IP to fail timing (Not function
correctly). Therefore, it usually requires several attempts to achieve the optimal placement of the logic.
5.9 OPENCL INTERFACES
This section explains how the different IO elements are accessed from the OpenCL kernel code, i.e. how
the MICMAC code will access the SOC hardware features.
5.9.1 Accessing the serial link interface from OpenCL
The Intel OpenCL SDK allows each external IO interface to be given a unique name for identification.
Channels are declared in the users OpenCL code prior to being used and attached to a physical location
by using the __attribute__ keyword and the appropriate name. The channel names are defined in the
board specification file. The following code declares the serial IO interfaces as OpenCL channels that can
be accessed using the OpenCL SDK channel methods.
channel ulong4 sch_in0 __attribute__((depth(4))) __attribute__((io("kernel_input_ch0")));
channel ulong4 sch_out0 __attribute__((depth(4))) __attribute__((io("kernel_output_ch0")));
channel ulong4 sch_in1 __attribute__((depth(4))) __attribute__((io("kernel_input_ch1")));
channel ulong4 sch_out1 __attribute__((depth(4))) __attribute__((io("kernel_output_ch1")));
Reading and writing to channels is performed using the standard read_channel_altera() and
write_channel_altera() function calls provided as part of the Intel OpenCL SDK.
The width of the serial channels is hardcoded and set to 256 bits or ulong4. As each serial channel is
capable of ~5 GBytes/Sec the kernel clock frequency needs to be greater than 156MHz to maximise the
throughput of the link.
45 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
The serial channels are blocking, meaning any read from a serial link will pause the kernel until data is
available and any write to a channel will pause a kernel if the output is full. Therefore synchronisation
between multiple accelerators is data driven.
5.9.2 Example loopback design for host channels
Accessing the host channels from the OpenCL tool flow is done using the channels flagged with the
appropriate host IO attributes. The following code is an example loopback design that reads data from
the host and immediately writes it back. For more detail see section 6.1.
channel ulong4 host_in __attribute__((depth(0)))
__attribute__((io("host_to_dev")));
channel ulong4 device_out __attribute__((depth(0)))
__attribute__((io("dev_to_host")));
__kernel void loopback(ulong length, uint nostop)
{
ulong counter;
ulong4 data;
counter = 0;
while (nostop | (counter < length)) {
data = read_channel_altera(host_in);
write_channel_altera(device_out, data);
counter += 32;
}
}
5.9.3 DDR4 Memory Access
The DDR memory is accessed by declaring global memory in the OpenCL kernel code. A particular bank
of DDR memory can be targeted by setting the appropriate attribute on the global memory parameter
to a kernel.
__kernel void foo(__global __attribute__((buffer_location("DDR_0"))) int *x,
__global __attribute__((buffer_location("DDR_1"))) int *y)
5.10 SUPPORT FOR FUTURE INTEL TOOL VERSIONS
BSP development, on Arria10 has proved to be challenging, largely due immature and changing
reference BSPs, changes in the Quartus tool chain and Partial Reconfiguration issues. This has been
compounded by the lack of forward migration between Quartus Versions meaning that a BSP developed
in one version of Quartus is not usable in a different version of Quartus. For that reason Opera
development will be locked into version 16.1 of the Intel FPGA tool flow, unless there is a good technical
reason to migrate.
5.11 MICMAC APPLICATION APPROACH
This section describes how the MICMAC application will interact with the OpenCL tool flow.
46 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
The majority of the MICMAC application will reside on the Moonshot host with selected elements of
code ported to the SOC accelerator. This requires the creation of:
• OpenCL kernel(s) for FPGA acceleration of key elements of the MICMAC code
• ARM application code for control of the OpenCL kernel and to perform some MICMAC
operations that cannot be accelerated on the FPGA, only where it would be inefficient to move
data back to the host application.
• Host side channel interface code for configuration and the passing of control data to and from
the OpenCL kernel via MMD interface.
The MICMAC acceleration can make use of the inter device serial links to expand the FPGA acceleration
across multiple devices if required.
Figure 30 : OpenCL MICMAC software components
47 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
5.12 HOST KERNEL INTERFACE (EXAMPLE 1)
The simplest interaction between the x86 host and the FPGA is to stream data direct into the FPGA
kernel. Synchronisation between is automatically achieved by the blocking nature of the interface
kernels. In this scenario the ARM does not read or write any data to the system and simple controls the
execution of the OpenCL kernels.
Figure 31 : Basic OpenCL streaming interface sequence diagram
Step Description of Figure 31
1 The ARM enqueues the command and processing OpenCL kernels. The command kernel is instructed
as to how many commands it expects to process.
2 The x86 host writes data to the FPGA host channels using the MMD interface
3 The FPGA kernel runs reading data from the host channel. The kernel can use global memory to store
temporary values if necessary
4 The kernel writes results back to the host using the host channels.
5 The ARM waits for the kernels to complete before moving onto the nest processing task.
5.13 HOST KERNEL INTERFACE (EXAMPLE 2)
This section describes a possible sequence for synchronising host, ARM and FPGA code to allow the host
to read and write to global memory. The following setup assumes there is an OpenCL command kernel
which handles data received from the host. This command kernel reads and writes host data from the
FPGA global memory and synchronises host commands with ARM via the OpenCL kernels. The code this
description is based upon is listed in section 9.
48 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 32 : Command kernel example sequence diagram
Step Description of Figure 32
1 The ARM enqueues the command and processing OpenCL kernels. The command kernel is instructed
as to how many commands it expects to process.
2
The command queue waits for the data from the host channel interface. This data is in the form of a
header and data payload. The header instructs the kernel whether the data is a read or write to global
memory or an acknowledgement.
3 The first command received is a write to global memory. The command is therefore followed by the
data payload.
4
The next command in the command queue is an acknowledgement. This is used to instruct the
processing kernel that the required data has been written to global memory. This can be done using a
simple blocking OpenCL channel connected between the command and processing kernel.
5 The processing kernel has been waiting for the acknowledgement in step 4 and will now start
processing the data written to the global memory.
6 Once the processing kernel is complete it writes an acknowledgement to the command kernel.
7 The command kernel has been waiting for an acknowledgement, reads the results from global memory
and writes the data to the host channel.
8 The host has been waiting for the data and can now read the results for further processing
9 The ARM has been waiting using the OpenCL clWaitForEvents method and completes once all the
expected commands have been processed. The ARM can then move onto the next task.
49 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
6 HOST (X86) PLATFORM
6.1 HOST CHANNELS
As the SOC ARM processor is master in this system the host x86 processor will use host channels to read
and write data to the OpenCL kernels on the FPGA. Host channels allow the x86 host to write directly to
an IO channel in the BSP rather than global memory as per a typical OpenCL system. Avoiding global
memory writes from the host removes a conflict that would otherwise arise between the ARM and x86
both trying to access the DDR memory.
Host channels are implemented through the MMD layer as the standard OpenCL API does not support
this kind of interface. The MMD layer is a thin software layer for communicating directly with the board.
It is used for any utility programs that support OpenCL enable hardware, e.g. board diagnosis. Here it
has been expanded to include the following API calls for host communications.
MMD API Command Description
int aocl_mmd_hostchannel_create Opens channel between host and kernel.
int aocl_mmd_hostchannel_destroy Closes channel between host and kernel
void *aocl_mmd_hostchannel_get_buffer
The host channel get buffer operation provides host
with a pointer to a buffer to write and read from
kernel. If the direction of the channel was 1 during
create, the pointer returned is buffer to write data
into kernel. If direction was 0, the pointer returned
is buffer to read data out of kernel.
size t aocl_mmd_hostchannel_ack_buffer
Acknowledge to the channel that the user has
written or read data from it. This will make the data
or additional buffer space available to write to or
read for the kernel.
Intel have provided a reference design for the Arria10 SOC development board (See section 8), which
will form the basis of the host channel interface for the Opera accelerator.
The host interface is not able to write directly to the embedded ARM processor and can only present
and read data from OpenCL kernels. If data is to be passed to the ARM from the host the OpenCL kernel
must read data from the host channels and write this to the DDR attached memory, which is accessible
to both the OpenCL kernel and the ARM.
6.2 HOST TO ARM PROPRIETARY CONTROL INTERFACE
As part of the Opera project Nallatech has developed a host to ARM communication channel to allow
data and control information to be passed between an application on the ARM and an application
running the Moonshot x86 system. This is done by implementing an Ethernet-over-PCIe interface. A
driver on the Moonshot host and equivalent driver on the ARM open an Ethernet port that allows any
communication that would normally be possible via a typical Ethernet connection (E.g. ssh, scp, nfs, etc).
This interface is not fast relative to the OpenCL PCIe interface and should only be used for control and
monitoring. For faster communications the host-channel interface, described in section 6.1, should be
used.
50 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 33 : Host to SOC interfaces
A small memory (262 Kbytes) in the FPGA logic creates a buffer into which data can be accessed from
both the host and ARM. This can be used for passing useful information from the host to ARM such as
OpenCL attributes. The depth of this buffer can be reduced or increased depending upon the demands
of the MICMAC application. Reducing the size will release extra FPGA memory resource for the OpenCL
kernel code.
Figure 33 illustrates the two available host to SOC interfaces.
6.2.1 X86 Host driver and API
The host side driver is used to get a handle to the host attached SOC accelerator card. This handle can
then be used to perform some simple control instructions via set of API commands. Some of the details
are still be determined and the API functions may change in the future.
Host API Command Description
NALLA_HANDLE NALLA_385Asoc_Open(uint32_t cardNumber, uint32_t flags); Gets a handle to the attached accelerator
void NALLA_385Asoc_Close(NALLA_HANDLE cardHandle);
Releases the handle opened by
NALLA_385Asoc_Open
uint32_t NALLA_385Asoc_Status(NALLA_HANDLE cardHandle, uint32_t command, void* status);
Retrieve the value of the various status registers.
These are still to be decided, but will include the
firmware version, timestamp, optical link status,
reset, device ID, etc...
size_t NALLA_385Asoc_Write(NALLA_HANDLE cardHandle, void* data, uint32_t offset, uint32_t lengthBytes, uint32_t flags);
Write a block of data to the accelerator at an offset.
size_t NALLA_385Asoc_Read(NALLA_HANDLE cardHandle, void* data, uint32_t offset, uint32_t lengthBytes, uint32_t flags);
Read a block of data from the accelerator at an offset.
51 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
6.2.1.1 SOC ARM driver and API
The ARM processor has an equivalent driver API for interacting with SOC to host communications. The
API is preliminary and may be subject to change.
This driver gives the ARM the ability to access the shared memory buffer.
6.2.2 Multiple cards
In order to differentiate between different SOC accelerators in the same system, it is envisioned that
each host/card has its own subnet address. E.g.:
Card 1: 192.168.100.2 (PC Host 192.168.100.1)
Card 2: 192.168.101.2 (PC Host 192.168.101.1)
Card 3: 192.168.102.2 (PC Host 192.168.102.1)
52 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
7 SERIAL INTERCONNECT DETAILS 7.1 OVERVIEW
In order to provide scalability and communication between different hardware vendors, the
development of serial interconnect was required. This serial interconnect acts as a common interface,
via the FPGA, for power and x86 systems.
Figure 34: Using the FPGA as a common interconnect
7.2 SERIAL CHANNEL IP DETAILS
Each serial channel within a BSP consists of an input and output external I/O channel. The BSP serial
channel IP has the following features.
Feature Description
Configuration 4-lanes, full duplex, with flow control
Encoding 64B/66B
Forward Error Correction KR-FEC
Line rate (each lane) 10.3125 Gbits/sec
Channel Latency ~390ns
User Interface, transmit 256 Bit Avalon Streaming
User Interface, receive 256 Bit Avalon Streaming
Maximum transfer bandwidth per channel per direction 39.6875 Gbits/sec
Table 7 : Serial link features
53 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
x4
40G Ethernet
Fibre
Arria-10
Av ST
CSR Av MM
x4
Serial
Channel I/F
QSFP+
Tx
QSFP+
Rx
256bits
Tx Kernel Channel 0
Rx Kernel Channel 0
Kernel Clk<400MHz
Av ST
256bits
Kernel Clk<400MHz
100MHz Clk
644.53125MHz Ref Clk
OpenCL User / Kernal Domain OpenCL BSP
Figure 35 : Block diagram for single channel on 385A SOC
Figure 35 illustrates how the serial channel IP is used in the BSP. The SOC accelerator has two QSFP+
module sites directly connected to the High Speed Serial ports of the FPGA. The BSP contains all the
relevant IP to enable external Altera I/O channel interfaces into an OpenCL Kernel.
7.3 SERIAL CHANNEL DEBUG
A simple register interface in the serial channel IP is connected to the PCIe BAR to aid larger system
debug.
The following registers are defined in the PCIe BAR memory map:
SCO BSP: serial channel CSR registers are in BAR4
• serial channel 0 CSR offset : 0x20000
• serial channel 1 CSR offset : 0x20040
7.3.1 Control and Status Registers
Registers are defined as follows:
Address
Offset
Host
Access
Default
Value Description
0x0 Rd 0x00000000
Serial Channel Status
[0] Kernel Stream Sink Ready
[3:1] Not Used (returns "000")
[7:4] Tx PHY Ready
[11:8] Tx PHY Calibration Complete (not tx_cal_busy)
[12] PLL Locked
[15:13] Not Used (returns "000")
[16] All Rx Lanes Deskewed
[19:17] Not Used (returns "000")
[23:20] RX Lane Aligned
[27:24] Rx PHY Ready
[31:28] Rx PHY Calibration Complete
0x1 Rd/Wr 0x00000000
Serial Channel Control
[0] Reset the Serial Channels Receiver (always returns '0')
[3:1] Not Used (returns "000")
[4] Reset the Serial Channels Transmitter (always returns '0')
54 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
[7:5] Not Used (returns "000")
[31:8] Not Used (returns 0x000000)
0x2 Rd 0x00000FFF
Kernel Rx Ready Performance Accumulator
[11:00] rx_rdy_perf_acc
[31:12] Not used
0x3 Rd 0x00000FFF
Kernel Tx Valid Performance Accumulator
[11:00] tx_thpt_perf_acc
[31:12] Not used
0x4 Rd/Wr 0x00000000
Performance Control
[0] Reset Rx Performance Reg (always returns '0')
[3:1] Not Used (returns "000")
[4] Reset Tx Performance Reg (always returns '0')
[7:5] Not Used (returns "000")
[31:8] Not Used (returns 0x000000)
Table 8 : Serial channel register addresses
7.3.2 Serial Channel Status Register
Under normal operating conditions this register reads 0xFFF11FF1. If the input kernel is able to supply
data into the channel faster than data can be read from the channel then the Kernel Sink Stream Ready
signal will de-assert to apply back pressure to the Tx Kernel. In this case the register reads 0xFFF11FF0.
When QSFP+ modules are not fitted (in the case of the 385A) then the register reads 0xF0001FF0.
7.3.3 Serial Channel Control Register
Asserting a reset to the Serial Channels Transmitter will reset the Tx interface and flush any data in the
Tx FIFO. The channels receiver will lose lane alignment and this will lead to a reset of its Rx interface.
Asserting a reset to the Serial Channels Receiver will reset the Rx interface and flush any data in the Rx
FIFO.
7.3.4 Kernel Rx Ready Performance Accumulator Register
This gives a continuous, short term measure of the ratio of high to low on the kernel_stream_src_ready
signal. The actual ratio is calculated by the host using the formula:
Kernel Rx Performance Ratio = rx_rdy_perf_acc/4095
This ratio can be used to determine how effectively a kernel is servicing the stream. A good kernel will
keep the 'ready' signal high all the time (ratio = 1.0).
7.3.5 Kernel Tx Throughput Performance Accumulator Register
This gives a continuous, short term measure of the ratio of data transfers through the
kernel_stream_snk port with respect to the kernel clock. It is calculated by the host using the formula:
Kernel Tx Throughput Ratio = tx_thpt_perf_acc/4095
This ratio can be used to determine how effectively a kernel is providing data for transmission. For a
steady stream of data, a good kernel will get close to the ratio of Channel Clock frequency (156.25MHz)
to Kernel Clock frequency.
55 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
7.3.6 Performance Control Register
Resets the Performance Accumulator Registers.
7.3.7 MMD Support Serial Channel CSR Functions
The current Nallatech MMD implementation contains helper functions for the serial channel CSR
registers. The functions are defined in aocl_mmd.h provided with BSP.
56 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
8 INTIAL DESIGN EVALUATION PLATFORM
8.1 OVERVIEW
The initial design work for the SOC Accelerator was performed using an Arria 10 development kit by
Altera purchased for the Opera project. This development as the targeted FPGA device plus many
peripherals attached for testing developed firmware and for clarifying the software stack for controlling
the device.
Figure 36 : Arria10 Soc Development Kit
Firmware development work. Description
Host to ARM communication interface The development of Ethernet over PCIE interface. See
6.1.
HPS boot and FPGA configuration synchronisation
This is the development of the software stack from
controlling the boot of the ARM device and
understanding the synchronisation protocol with the
configuration of the FPGA fabric.
HPS to FPGA fabric interface
The development of firmware to interface the HPS
ARM processor with FPGA external DDR memory
interfaces, to give the ARM access to the shared DDR
memory.
Clock BIST prototype run for x86 and HPS
The development of BIST software and firmware in
preparation for the delivery of the SOC accelerator
prototype cards.
PCIe interface prototyping
Attaching a Samtec FMC cable via the development
kit’s FMC connector, it is possible to test the PCIe
interface logic without a standard PCIe connector.
HPS reset/recovery The development of firmware and software stack to
facilitate HPS reset and recovery.
UART Interface
Development of a UART (Universal Asyncrhonous
Receiver/Trasmitter) controller direct into the ARM
processors.
HPS Clocking Verifying the correct clock configuration for the HPS
system.
Table 9 : Arria10 SOC development kit hardware development tasks
57 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 37 : Arria 10 Development Kit Block Diagram
As can be seen the Arria10 development board has many features critical to the development of the
SOC accelerator. The firmware and basic software stack developed can then be used to test the initial
development of the BSP until the availability of the prototype cards. The serial link interfaces cannot be
tested on the development board.
Table 10 lists the BSP tasks performed using the Arria10 development board.
58 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
BSP development work. Description
ARM to DDR4 shared memory The DDR4 memory accessed via the HPS can be shared
with the FPGA fabric. This a verification of Altera IP.
PCIe host channels
Using the Samtec FMC connector the PCIe host
channels can be functionally tested.
The host channels were also tested and verified on an
Nallatech 510T device to ascertain the potential
performance this interface.
Kernel boot and control
Testing the software stack and kernel interrupt system
for controlling OpenCL kernels from the ARM
processor.
Table 10 : BSP development kit work
59 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
9 BSP VERSION 2 (FINAL)
9.1 MODIFICATIONS REQUIRED
The initial board support package created for OPERA used the embedded ARM processor as the master
controller. This was to facilitate the offload of work between the ARM and the FPGA. However, as the
project progressed it was clear that the MICMAC software to be accelerated on the FPGA (see D6.4)
could not be partitioned to benefit from the embedded ARM on the device. Also, using the ARM as the
host required a different programming approach as described within the Intel FPGA OpenCL
documentation increasing design verification time. It relied on the availability of host channel support by
the Intel tools, which was not available in time as expected. This became more significant with the
addition of the CNN offload to the OPERA project late in the projects lifetime, that heavily relied on this
host interface.
The host channel interface also prevented dynamic reconfiguration of the FPGA device, restricting each
application to a single FPGA image. For the MICMAC code this was not practical with multiple
applications run sequentially.
Therefore, the decision was taken to create a new BSP where the host CPU is the master as with a
traditional FPGA OpenCL accelerator card. Here, the ARM is no longer used for code acceleration, but
for running kernel level performance measurements. This diagnostic code runs in parallel to the FPGA
offloaded acceleration providing high fidelity power information.
Sections 5.12, 5.13 and 6 are no longer valid for this version of the BSP.
9.2 UPDATED BSP
Figure 38 : BSP version 2 with CPU as master
Version 2 of the BSP is illustrated in Figure 38. The power monitoring software is always there but does
not need to be running to use the FPGA.
60 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Figure 39 : BSP Version 2 Floor plan
Figure 39 shows the floor plan of the FPGA with CPU as host. The design has been made to be as
efficient as possibly to allow the FPGA to deliver as best performance as possible.
Design name Total resources BSP resources
ALMS 251680 46970 (18.7%)
FFs 1006720 187880 (18.7%)
RAMs 2131 418 (19.6%)
DSPs 1687 129 (7.7%)
Table 11 : BSP version resource use
61 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
9.3 POWER MONITORING USING THE EMBEDDED ARM
In order to use the monitoring software the FPGA-SOC accelerator must be a revision 2 card or higher.
This is due to subtle hardware changes regarding the routing to the onboard system manager. A new
uboot configuration was created which includes the power monitoring software and FPGA card needs to
have its HPS boot system updated with the appropriate files. If the power monitoring system is not
being used, only the FPGA flash needs to be updated with the latest BSP design.
Once loaded the host can retrieve the FPGA’s current power status using the following utility created for
the OPERA project:
sudo ./sysfsOpenclMonitor /sys/bus/pci/devices/0000\:0b\:00.0/resource2
The following is an example output:
/sys/bus/pci/devices/0000:0b:00.0/resource2 opened.
After mmap
PCI Memory mapped to address 0x7fbf97951000.
OFFSET_VCC_12V0 : 11.649859 V
OFFSET_VCC_0V95 : 0.948706 V
OFFSET_VCCT_1V0 : 1.023748 V
OFFSET_MEM_1V2 : 1.198236 V
OFFSET_VCCR_1V0 : 1.020697 V
OFFSET_VCC_2V5 : 2.498360 V
OFFSET_VCC_1V8 : 1.833961 V
OFFSET_VCC_5V0 : 5.015022 V
OFFSET_12V_CURRENT : 1.847419
OFFSET_TEMPERATURE : 51.960938 degC
The overall FPGA power can be determined from the multiplication of the OFFSET_VCC_12V0 value with
the OFFSET_12V_CURRENT value.
This output will be used to populate the required fields used by RedFish (See D4.3).
The current root filesystem is set to start the “devOpenclMonitor” application on the HPS on boot. This will watch
for kernel activity and create log files in the tmp filesystem on /mnt/ramdisk. The number of kernels that are
monitored is modified by the utility “sysfsOpenclControl”, which allows the values of shared control registers to
be updated from the host. E.g:
/sysfsOpenclControl /sys/bus/pci/devices/0000\:0b\:00.0/resource2 <Number of kernels to log>
The monitoring software is designed to start when logging information when it sees a kernel start event in the
OpenCL kernel control firmware. These signals have been directly wired to a shared address in the ARM operating
system where a simple spin lock waits for changes to its state. The FPGA voltages, currents and temperatures are
then written every millisecond to flash memory on the FPGA board whilst the kernel is running (See Table 12
example output)
62 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Time usec V12 Current v12 Temperature Power
1 11.46903 2.416846 57.49219 27.71887
1097 11.42034 2.518529 57.49219 28.76245
2174 11.42034 2.518529 57.49219 28.76245
Table 12 : Example Power Monitoring Output Data
9.4 INTEL OPENCL VERSION 17.1.2
Initially the version of the OpenCL tools used by the OPERA project was to be fixed at the beginning of
the project. However, Intel have made significant improvements to the tools that justify updated the
OPERA BSP to the latest available version. This has given improvements in clock frequency and resource
use, as well as enabling some new compile features that have reduced development time.
63 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
10 BSP EXAMPLE USE-CASE
10.1 ANN MICMAC USE-CASE
This section describes how the final BSP described in section 9 was used to accelerate part of the
MICMAC software. For details regarding the overall MICMAC software port please see D6.8.
The Approximate Nearest Neighbour (ANN) function is used extensively within the MICMAC software to
find correlating points between deferent images or varying scale and orientation. It is compute intensive
task that requires many floating point calculations. First some pre-processing is done on the images to
create tie points used by the ANN algorithm. These tie-points are multidimensional values (128 in this
case). For each image thousands for tie-points are created and must be each compared to all other tie-
points in another image to find the closest matching pairs. This is an n x n problem. The CPU employs a
tree structure that checks 1 dimension at a time in an approximation that removes the need to match all
points against each other. This is not particularly accurate, but some 50x faster than a full n x n search.
The recursive nature of this tree search does not fit well on FPGA’s and a brute force approach that
compares all points is the only option on the FPGA. This reduces the acceleration gained by the FPGA as
the brute force approach is significantly more compute intensive, however the accuracy of the code is
significantly improved. The CPU tree search returns approximately 30% of valid matches, whereas the
brute force approach is 100% accurate.
To maximise the performance of the FPGA design is split into producer, consumer and compute kernels.
Having a single kernel responsible for global memory communications reduces the resource required
compare to having each compute kernel access global memory directly.
The following diagram illustrates how the kernels are connected.
Figure 40 : Block diagram of ANN kernels
Multiple compute kernels are created until resources are exhausted on the FPGA.
The following code is used to calculated the distance between two points on opposing images. Note that
the following code uses integer arithmetic for calculation. This halves the DSP resource required by the
FPGA doubling the number of distances that can be computed in parallel.
unsigned int CalcDistanceFunction16bit(dim_data_type A, dim_data_type B)
{
unsigned short diff[128];
#pragma unroll
for (int p = 0; p < DIM; p++)
diff[p] = (unsigned short)abs((short)((int)A.data[p] - (int)B.data[p]));
unsigned int total = 0;
#pragma unroll
for (int p = 0; p < DIM; p++)
total += (0xffffffff & ((diff[p] & 0xffff) * (diff[p] & 0xffff))) >> 8;
return total;
}
64 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
Each compute kernel processes 4 points in parallel. This is because the local M20K memory used to
buffer the input data can be accessed 4 times in parallel before replication is required. As the number
M20K are the dominating resource for this code, replication needs to be avoided wherever possible.
Table 13 lists the resources required by the ANN kernels when compiled using the latest BSP and version
17.1.2 of the Intel OpenCL tools.
Table 13 : ANN kernel resources required (BSP v2)
10.2 POWER MONITORING
When the ANN code is running with the power monitoring enabled in the ARM processor the following
plot can be produced. It lists the processing time on the x axis (usecs) versus the power draw in the Y
axis. The power analysis of the ANN code will be discussed in more detail in OPERA deliverable D4.3.
Figure 41 : ANN power
0
5
10
15
20
25
30
35
-200000 0 200000 400000 600000 800000 1000000 1200000 1400000
ANN Power Consumption
Kernel/Partition ALUTs (%) FFs (%) RAMs (%) DSPs (%)
Kernel system
Partition 23 23 24 8
Compute kernel (%) 15 5 14 16
Producer 3 3 7 0
Consumer 1 1 1 0
65 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
11 CONCLUSION
The purpose of this deliverable was to deliver a platform that enabled the Arria10 SOC device as an
offload acceleration device and as enabler for heterogeneity between x86 and Power systems.
During the course of the OPERA project it was found that the ARM processor provided no significant
compute benefit when accelerating the software deployed for the different use cases in OEPRA. To this
end having the ARM as the master proved to be a bottleneck in performance and programmability,
inhibiting the partners ability to code and accelerate applications. Therefore, a second BSP was created
that followed the standard Intel OpenCL programming tool flow. This expedited the development of
application code and allowed the consortium to include CNN offload within the OPERA project timeline.
The ARM processor is now used for power/system monitoring. This was used to measure the efficiency
of different FPGA implementations, that would not be possible without the ARM’s close interaction with
the FPGA as part of the SOC package. This monitoring approach is described in more detail in deliverable
D4.3.
In conclusion the best approach is to use SOC devices where monitoring and low-level system
management are required. For the applications studied in OPERA the SOC does not offer any
performance benefit versus a non SOC device, however it does provide the ability to monitor
performance for power based design optimisations. Therefore, SOC devices have a place in HPC systems
that require fine grain performance monitoring or low level control offloaded from the host platform.
66 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
12 APPENDIX: EXAMPLE COMMAND QUEUE OPENCL
12.1 COMMAND QUEUE EXAMPLE
/*
This kernel can be used as template for the MICMAC use case.
*/
#define IDLE 0x0
#define WRITE_TO_GLOBAL_MEM 0x1
#define READ_FROM_GLOBAL_MEM 0x2
#define HOST_SYNC 0x3
#define KERNEL_SYNC 0x4
// Declare host input and output channels
channel ulong4 device_in __attribute__((depth(0)))
__attribute__((io("host_to_dev")));
channel ulong4 device_out __attribute__((depth(0)))
__attribute__((io("dev_to_host")));
// Declare serial link input and output channels.
// Used for inter-card communication.
channel ulong4 sch_in0 __attribute__((depth(4))) __attribute__((io("kernel_input_ch0")));
channel ulong4 sch_out0 __attribute__((depth(4))) __attribute__((io("kernel_output_ch0")));
channel ulong4 sch_in1 __attribute__((depth(4))) __attribute__((io("kernel_input_ch1")));
channel ulong4 sch_out1 __attribute__((depth(4))) __attribute__((io("kernel_output_ch1")));
// Create a synchronisation channel to allow the host to synchronise with the application kernel and
// visa versa
channel bool HostToKernelRequestChannel;
channel bool KernelToHostAcknowledgeChannel;
// Example helper kernel for reading from host to global memory to replace normal clEnqueueWriteBuffer
// commands. Here the firs word read from the interface is a header. This can be user defined to
// fit with the users needs. A header is not necessary if the packet sizes are know.
// The number for commands to service is set by the ARM host.
__kernel
void ServiceHostCommandQueue(__global ulong4 *restrict ddr_buffer,int NoCommands)
{
int command = 0;
unsigned char state=IDLE;
unsigned int packet_count = 0;
unsigned int PacketSize = 0; // Bytes in multiples of 16 (I.e. ulong4)
unsigned int Offset = 0; // Bytes in multiples of 16 (I.e. ulong4)
67 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
while (command < NoCommands) // Fully pipelined main loop for best performance
{
ulong4 Data;
if (state != READ_FROM_GLOBAL_MEM)
Data = read_channel_altera(device_in);
switch (state)
{
case IDLE:
packet_count = 0; // Reset packet count
state = Data.s0;
PacketSize = Data.s1; // Bytes in multiples of 16 (I.e. ulong4)
Offset = Data.s2; // Bytes in multiples of 16 (I.e. ulong4)
break;
case WRITE_TO_GLOBAL_MEM:
ddr_buffer[Offset+(packet_count>>4)] = Data;
if (packet_count != (PacketSize-16))
packet_count += 16;
else
{
state = IDLE;
command++;
}
break;
case READ_FROM_GLOBAL_MEM:
ulong4 output = ddr_buffer[Offset+(packet_count>>4)];
write_channel_altera(device_out,output);
if (packet_count != (PacketSize-16))
packet_count += 16;
else
{
state = IDLE;
command++;
}
break;
// Synchronisation routines.
case HOST_SYNC:
// Instructs kernel that host is ready for it start.
// I.e. data has been written to global memory.
write_channel_altera(HostToKernelRequestChannel,1);
state = IDLE;
break;
case KERNEL_SYNC:
// Command queue will pause until kernel sends acknowledgement
read_channel_altera(KernelToHostAcknowledgeChannel);
state = IDLE;
break;
default : break;
}
68 D6.6| FPGA design implementation – final release
OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure
and Platform in Industrial and Societal Applications
}
}