fpga design implementation – final release

69
FPGA Design Implementation – final release DELIVERABLE NUMBER D6.6 DELIVERABLE TITLE FPGA Design Implementation – final release RESPONSIBLE AUTHOR Nallatech Ltd Co-funded by the Horizon 2020 Framework Program of the European Union Ref. Ares(2018)3483418 - 30/06/2018

Upload: others

Post on 06-Jun-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FPGA Design Implementation – final release

FPGA Design Implementation – final release

DELIVERABLE NUMBER D6.6

DELIVERABLE TITLE FPGA Design Implementation – final release

RESPONSIBLE AUTHOR Nallatech Ltd

Co-funded by the Horizon 2020 Framework Program of the European Union

Ref. Ares(2018)3483418 - 30/06/2018

Page 2: FPGA Design Implementation – final release

1 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

GRANT AGREEMENT N. 688386

PROJECT REF. NO H2020- 688386

PROJECT ACRONYM OPERA

PROJECT FULL NAME LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure and Platform in Industrial and Societal Applications

STARTING DATE (DUR.) 01/01/2015

ENDING DATE 30/11/2018

PROJECT WEBSITE www.operaproject.eu

WORKPACKAGE N. | TITLE WP6 | Low Power Small Form Factor Datacentre

WORKPACKAGE LEADER Nallatech Ltd

DELIVERABLE N. | TITLE D6.6 | FPGA Design Implementation – final release

RESPONSIBLE AUTHOR Richard Chamberlain, Nallatech Ltd

DATE OF DELIVERY (CONTRACTUAL) 30/06/2018 (M31)

DATE OF DELIVERY (SUBMITTED) 30/06/2018 (M31)

VERSION | STATUS V1.0

NATURE R(Report)

DISSEMINATION LEVEL PU(Public)

AUTHORS (PARTNER) Richard Chamberlain (Nallatech), Giulio Urlini (STM), Roberto Peveri (TESEO) , Daniele Paolini (TESEO)

Page 3: FPGA Design Implementation – final release

2 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

VERSION MODIFICATION(S) DATE AUTHOR(S)

0.1 Initial update to D6.2 for review 13/06/2018 Richard Chamberlain

(Nallatech)

0.2 First internal review 18/06/2018 Giulio Urlini (STM)

0.3 Second internal review 27/06/2018 Roberto Peveri (TESEO) ,

Daniele Paolini (TESEO)

1.0 Final review 29/06/2018 Richard Chamberlain

(Nallatech)

Page 4: FPGA Design Implementation – final release

3 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

PARTICIPANTS CONTACT

STMICROELECTRONICS SRL

Giulio Urlini

Email: [email protected]

IBM ISRAEL

SCIENCE AND TECHNOLOGY LTD

Joel Nider

Email: [email protected]

HEWLETT PACKARD

CENTRE DE COMPETENCES

(FRANCE)

Cristian Gruia

Email: [email protected]

NALLATECH LTD

Craig Petrie

Email: [email protected]

ISTITUTO SUPERIORE

MARIO BOELLA

Olivier Terzo

Email: [email protected]

TECHNION ISRAEL

INSTITUTE OF TECHNOLOGY

Dan Tsafrir

Email: [email protected]

CSI PIEMONTE

Vittorio Vallero

Email: [email protected]

NEAVIA TECHNOLOGIES

Stéphane Gervais

Email: [email protected]

CERIOS GREEN BV

Frank Verhagen

Email: [email protected]

TESEO SPA

Stefano Serra

Email: [email protected]

DEPARTEMENT DE

L'ISERE

Olivier Latouille

Email: [email protected]

Page 5: FPGA Design Implementation – final release

4 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

ACRONYMS LIST

Acronym Description

AVX

BSP

CAPI

FPGA

GPGPU

HDL

HPE

I/O

IoT

MAC

MTBF

OpenCL

PCIe

QSFP

SIMD

SoC

SWAP

HPS

Advanced Vector Extensions

Board Support Package

Coherent Accelerator Processor Interface

Field Programmable Gate Array

General Purpose Graphics Processing Unit

Hardware Description Language

Hewlett Packard Enterprise

Input Output

Internet of Things

Multiply accumulate operations

Mean Time Between Failure

Open Computing Language

PCI Express

Quad Small Form-factor Pluggable

Single Instruction Multiple Data

System on Chip

Size, Weight and Power

Hard Processor System

LIST OF FIGURES

Figure 1 : SOC accelerator functional diagram ........................................................................................ 11 Figure 2 : SOC accelerator physical layout ............................................................................................... 12 Figure 3 : SOC Accelerator functional diagram ........................................................................................ 13 Figure 4 : 385A-SOC System Manager ..................................................................................................... 15 Figure 5 : QSFP28 clocking structure ....................................................................................................... 17 Figure 6 : Extended Front Panel Connections .......................................................................................... 17 Figure 7 : Arria 10 FPGA External Clocking Options ................................................................................. 19 Figure 8: 10-pin USB header ................................................................................................................... 20 Figure 9 : 10-pin dual USB connector ...................................................................................................... 20 Figure 10 : USB daisy chain connectivity ................................................................................................. 20 Figure 11 : I2C Addressing....................................................................................................................... 22 Figure 12 : Altera temperature sensor IP ................................................................................................ 26 Figure 13 : SOC Active Heat Sink ............................................................................................................. 27 Figure 14 : Air flow drawn through active heat sink ................................................................................ 27 Figure 15 : Autodesk CFD Arria 10 Stable Board Temperature (50 Watt design)...................................... 28 Figure 16 : Altera ARM Cortex A9 Hard Processor System (HPS) ............................................................. 29 Figure 17 : Altera SOC Device Block Diagram .......................................................................................... 30 Figure 18 : Boot Memory Locations ........................................................................................................ 31 Figure 19 : FPGA Configuration Block Diagram ........................................................................................ 32 Figure 20 : SOC Configuration with boot Sources .................................................................................... 33 Figure 21 : HPS boots from FPGA ............................................................................................................ 33 Figure 22 : ARM Base Configuration........................................................................................................ 35

Page 6: FPGA Design Implementation – final release

5 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 23 : Intel FPGA OpenCL tool flow ................................................................................................. 37 Figure 24 : BSP Components ................................................................................................................... 38 Figure 25 : BSP component interaction ................................................................................................... 40 Figure 26 : SOC reference IP placement .................................................................................................. 40 Figure 27 : Host channel reference design placement (Nallatech 385A device) ....................................... 41 Figure 28 : Nallatech 385A base.aocx ...................................................................................................... 42 Figure 29 : Altera SDK for OpenCL software architecture ........................................................................ 43 Figure 30 : OpenCL MICMAC software components ................................................................................ 46 Figure 31 : Basic OpenCL streaming interface sequence diagram ............................................................ 47 Figure 32 : Command kernel example sequence diagram ....................................................................... 48 Figure 33 : Host to SOC interfaces ........................................................................................................... 50 Figure 34: Using the FPGA as a common interconnect ............................................................................ 52 Figure 35 : Block diagram for single channel on 385A SOC ...................................................................... 53 Figure 36 : Arria10 Soc Development Kit ................................................................................................. 56 Figure 37 : Arria 10 Development Kit Block Diagram ............................................................................... 57 Figure 38 : BSP version 2 with CPU as master ......................................................................................... 59 Figure 39 : BSP Version 2 Floor plan ........................................................................................................ 60 Figure 40 : Block diagram of ANN kernels ............................................................................................... 63 Figure 41 : ANN power ........................................................................................................................... 64

LIST OF TABLES

Table 1: SOC Accelerator Feature List ..................................................................................................... 12 Table 2: FPGA Voltage Settings ............................................................................................................... 14 Table 3: ID PROM Data ........................................................................................................................... 23 Table 4: System Manager Status LEDs ..................................................................................................... 24 Table 5: User LEDs .................................................................................................................................. 25 Table 6 : SOC accelerator BSP components ............................................................................................. 39 Table 7 : Serial link features .................................................................................................................... 52 Table 8 : Serial channel register addresses .............................................................................................. 54 Table 9 : Arria10 SOC development kit hardware development tasks ..................................................... 56 Table 10 : BSP development kit work ...................................................................................................... 58 Table 11 : BSP version resource use ........................................................................................................ 60 Table 12 : Example Power Monitoring Output Data ................................................................................ 62 Table 13 : ANN kernel resources required (BSP v2) ................................................................................. 64

Page 7: FPGA Design Implementation – final release

6 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

EXECUTIVE SUMMARY

The main objective of WP6 is to bring top of the class power efficient FPGA (Field Programmable Gate

Array) technology into Hewlett Packard Enterprise (HPE) Moonshot Small Form Factor Data Centre. The

flexibility of the FPGA will be the enabling technology for integration of the different processing

elements of the heterogeneous architecture.

During task 6.2, Nallatech has developed a SOC based FPGA accelerator prototype to support the OPERA

project. This document outlines the details of device, but does not seek to justify the design choices as

this covered in Opera deliverable D6.1.

Nallatech has also developed a Board Support Package (BSP) to support the OpenCL toolflow on the SOC

accelerator prototype developed for the OPERA project. This includes the development of optical serial

connections critical to scalability and interoperability of the heterogeneous processing components. This

document also describes these aspects in detail.

This document describes the hardware, firmware and software work undertaken to support a SOC FPGA

accelerator in the HP Moonshot server. It documents the hardware features and firmware required for

control of the SOC FPGA and the Board Support Package (BSP) software required for programming it.

D6.6 is an update of the original D6.2 document produced in project month M15. As the project

progressed it was clear that some of the design decisions made at the start of the project required

modification in order to fulfil the OPERA objectives. Sections 1-8 remain unchanged, with the updates

described thereafter.

Page 8: FPGA Design Implementation – final release

7 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

TABLE OF CONTENTS

1 SOC ACCELERATOR OVERVIEW ......................................................................................................... 11

1.1 OVERVIEW ................................................................................................................................ 11

1.2 SOC ACCELERATOR FEATURES .................................................................................................. 11

2 HARDWARE TECHNICAL DETAILS ...................................................................................................... 14

2.1 OVERVIEW ................................................................................................................................ 14

2.2 FORM-FACTOR ......................................................................................................................... 14

2.3 USER FPGA ............................................................................................................................... 14

2.4 8-LANE PCI-EXPRESS 3.0 INTERFACE (WITH CVP), 8-LANE PCIE MECHANICAL ........................... 14

2.5 SYSTEM MANAGER ................................................................................................................... 15

2.6 USER FPGA DDR4 SDRAM ......................................................................................................... 16

2.7 SOC HPS ................................................................................................................................... 16

2.8 2X QSFP PORTS SUPPORTING 10/40 GB/S ETHERNET ............................................................... 16

2.9 FRONT PANEL CONNECTIVITY ................................................................................................... 17

2.10 CLOCKING CIRCUIT ................................................................................................................... 18

2.10.1 PCIe Clock .......................................................................................................................... 18

2.10.2 Network Clock ................................................................................................................... 18

2.10.3 Memory & General FPGA Clocks ........................................................................................ 18

2.10.4 Configuration Clock ........................................................................................................... 18

2.10.5 Transceiver Clock............................................................................................................... 18

2.10.6 Clocking Options ............................................................................................................... 19

2.10.7 External 1 Pulse per Second Clock (1PPS) .......................................................................... 19

2.11 10-PIN USB HEADER ................................................................................................................. 20

2.12 ON-BOARD USB-BLASTER II....................................................................................................... 21

2.13 UART TO USB INTERFACE.......................................................................................................... 21

2.14 JTAG UTILITIES .......................................................................................................................... 21

2.15 I2C DEVICES .............................................................................................................................. 22

2.16 ID PROM ................................................................................................................................... 23

Page 9: FPGA Design Implementation – final release

8 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.17 STATUS LEDS ............................................................................................................................ 24

2.18 USER LEDS ................................................................................................................................ 25

2.19 CARD INITIALISATION & SYSTEM RESET .................................................................................... 25

2.19.1 CvP Autonomous mode ..................................................................................................... 25

2.20 CONTROL & TEMPERATURE SENSOR ........................................................................................ 26

2.20.1 FPGA Temperature Alert Procedure .................................................................................. 26

2.21 POWER ..................................................................................................................................... 26

2.22 THERMAL ................................................................................................................................. 27

3 ARRIA 10 HARD PROCESSOR SYSTEM ............................................................................................... 29

3.1 ALTERA GX660 SOC HARD PROCESSOR SYSTEM (HPS) .............................................................. 29

3.2 HPS OVERVIEW ......................................................................................................................... 30

4 SOC ACCELERATOR CONFIGURATION ............................................................................................... 32

4.1 FPGA CONFIGURATION OVERVIEW ........................................................................................... 32

4.2 CONFIGURATION WITH BOOT SOURCES ................................................................................... 33

4.3 CONFIGURATION WITH SELF HPS BOOT ................................................................................... 33

4.3.1 QSPI Configuration ............................................................................................................ 34

4.3.2 Configuration via USB ........................................................................................................ 34

4.4 ARM SIDE CONFIGURATION...................................................................................................... 34

5 SOC ACCELERATOR BSP .................................................................................................................... 37

5.1 OPENCL TOOL FLOW................................................................................................................. 37

5.2 BSP OVERVIEW ......................................................................................................................... 38

5.3 OPERA BSP REQUIREMENTS ..................................................................................................... 38

5.4 BSP REQURIED DELIVERABLES .................................................................................................. 39

5.4.1 Base aocx file ..................................................................................................................... 39

5.4.2 Board Specification File ..................................................................................................... 42

5.4.3 Kernel Driver ..................................................................................................................... 42

5.5 PLATFORM SUPPORT ................................................................................................................ 43

5.6 SERIAL LINKS ............................................................................................................................ 43

Page 10: FPGA Design Implementation – final release

9 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

5.7 ACHIEVING BEST FMAX OF BSP ................................................................................................. 43

5.7.1 Routing Islands .................................................................................................................. 43

5.7.2 Multiple Seed Sweeps ....................................................................................................... 43

5.8 LIMITING IMPACT ON RESOURCE ............................................................................................. 44

5.9 OPENCL INTERFACES ................................................................................................................ 44

5.9.1 Accessing the serial link interface from OpenCL ................................................................ 44

5.9.2 Example loopback design for host channels....................................................................... 45

5.9.3 DDR4 Memory Access ....................................................................................................... 45

5.10 SUPPORT FOR FUTURE INTEL TOOL VERSIONS .......................................................................... 45

5.11 MICMAC APPLICATION APPROACH ........................................................................................... 45

5.12 HOST KERNEL INTERFACE (EXAMPLE 1) .................................................................................... 47

5.13 HOST KERNEL INTERFACE (EXAMPLE 2) .................................................................................... 47

6 HOST (X86) PLATFORM .................................................................................................................... 49

6.1 HOST CHANNELS ...................................................................................................................... 49

6.2 HOST TO ARM PROPRIETARY CONTROL INTERFACE .................................................................. 49

6.2.1 X86 Host driver and API ..................................................................................................... 50

6.2.2 Multiple cards ................................................................................................................... 51

7 SERIAL INTERCONNECT DETAILS ....................................................................................................... 52

7.1 OVERVIEW ................................................................................................................................ 52

7.2 SERIAL CHANNEL IP DETAILS ..................................................................................................... 52

7.3 SERIAL CHANNEL DEBUG .......................................................................................................... 53

7.3.1 Control and Status Registers .............................................................................................. 53

7.3.2 Serial Channel Status Register ........................................................................................... 54

7.3.3 Serial Channel Control Register ......................................................................................... 54

7.3.4 Kernel Rx Ready Performance Accumulator Register ......................................................... 54

7.3.5 Kernel Tx Throughput Performance Accumulator Register ................................................ 54

7.3.6 Performance Control Register ........................................................................................... 55

7.3.7 MMD Support Serial Channel CSR Functions ..................................................................... 55

Page 11: FPGA Design Implementation – final release

10 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

8 INTIAL DESIGN EVALUATION PLATFORM .......................................................................................... 56

8.1 OVERVIEW ................................................................................................................................ 56

9 BSP VERSION 2 (FINAL) ..................................................................................................................... 59

9.1 MODIFICATIONS REQUIRED ...................................................................................................... 59

9.2 UPDATED BSP ........................................................................................................................... 59

9.3 POWER MONITORING USING THE EMBEDDED ARM ................................................................. 61

9.4 INTEL OPENCL VERSION 17.1.2 ................................................................................................. 62

10 BSP EXAMPLE USE-CASE ............................................................................................................... 63

10.1 ANN MICMAC USE-CASE ........................................................................................................... 63

10.2 POWER MONITORING .............................................................................................................. 64

11 CONCLUSION ............................................................................................................................... 65

12 APPENDIX: EXAMPLE COMMAND QUEUE OPENCL ....................................................................... 66

12.1 COMMAND QUEUE EXAMPLE ................................................................................................... 66

Page 12: FPGA Design Implementation – final release

11 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

1 SOC ACCELERATOR OVERVIEW 1.1 OVERVIEW

In order to be state of the art and to encompass the spirit of heterogeneity of the OPERA project, the

key hardware requirements for OPERA project are as follows:

• Arria 10 technology (Embedded floating point units key, state of the art FPGA at

commencement of project 2015)

• Embedded processor. Increases the heterogeneity of the system and provides more

research possibilities of compute offload and power monitoring.

• Off chip communications for scalability and common interconnect between different host

platforms. E.g. x86 and ARM.

• Host platform communications (CAPI and PCIe) for Power and x86 devices.

• External DDR memory for ARM and OpenCL kernels large enough to store compute data

for number of problems.

The following documentation describes the different hardware elements that have been generated to

fulfil these requirements.

1.2 SOC ACCELERATOR FEATURES

Figure 1 : SOC accelerator functional diagram

Figure 1 shows the key features of the SOC accelerator. The PCIe x8 Gen 3 interface will be used for

communication to the host system, otherwise the ARM and OpenCL kernels can run in a standalone

configuration if this proves preferable. Attached are two DDR memory banks for application memory on

the FPGA and embedded ARM processors. QSFP network ports provide high-speed communications

between multiple boards. The CPLD is used for bring up and configuration control of the FPGA. Figure 2

illustrates the physical layout of the different hardware components.

Page 13: FPGA Design Implementation – final release

12 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 2 : SOC accelerator physical layout

The following table is a least of the different hardware features of the SOC accelerator.

Feature Description

Low Profile PCI-Express form factor

This is a single width ½ height, ½ length (167.6mm x

68.9mm x 17mm) card, which allows up to FPGA cards

to be integrated into the Moonshot Edgline server.

PCI-Express 3.0 host interface Electrical and Mechanical x8 PCIe interface. Highest

speed host interface supported by FPGA.

Altera Arria® 10 SX 660 FPGA

Powerful FPGA with dual ARM Cortex-A9 processor

and FPGA fabric with up to 1 TFlop/Sec processing

capability.

2 QSFP28 ports These will be used for inter card communications.

2 banks of 4 GByte, x72, 2133MT/s, DDR4 SDRAMs

External memories supporting up to 34 GBytes/Sec.

MAX 10 FPGA (System manager) Small FPGA/CPLD for system/configuration

management of Arria 10 FPGA.

QSPI 2Gb FLASH memory 2 Giga bits for external flash storage for multiple FPGA

configuration images.

JTAG Interface USB-Blaster II Interface for FPGA JTAG access (10-pin

USB header)

USB USB on Front Panel with integrated USB Hub and board

to board USB management interconnect

User clocks External clocks for user logic.

LEDs External user LEDS

Table 1: SOC Accelerator Feature List

Figure 3 illustrates how the different features are connected on the device.

Page 14: FPGA Design Implementation – final release

13 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Arria 10 SOC PCIEx8 Gen3

x72

QSFP+

DDR4SDRAM

X8QSFP+

DDR4SDRAM

x40

Dual-core ARM® Cortex-A9

SENSORS

QSPI

CLOCKS

FRONT PANELPCIE

CONNECTOR

USB HUB

POWER

USB

X4

X4

SYSTEMMANAGER

x32

HOSTCONNECTOR

QSPI

Figure 3 : SOC Accelerator functional diagram

Page 15: FPGA Design Implementation – final release

14 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2 HARDWARE TECHNICAL DETAILS 2.1 OVERVIEW

This section list the key technical details of the SOC accelerator

2.2 FORM-FACTOR

The SOC accelerator is a low profile, single width PCIe add in card with an x8 electrical and x8

mechanical interface. The device is 68.90mm high and 167.65mm long (the recommended dimensions

for a low profile PCIe card). The width of the card is 2.5mm at the rear and 14.47 mm at the front

complying to the PCI single slot width dimensions. The PCIe interface complies with the PCIe 3.0

specification. Strictly adhering to the PCIe specification is required to ensure successful integration with

the Moonshot Edgeline server.

2.3 USER FPGA

The User FPGA is an Altera Arria 10 SX660 in a F34 package. The FPGA I/O banks are powered by the

supplies as detailed in Table 2. FPGA core voltage (Vcore) is 0.9V.

Signals Bank Bank IO Voltage

QSFP 0 1E N/A1

QSFP 1 1F N/A1

PCIe x8 Port 1C, 1D N/A1

FPGA Configuration I/O 2A 1.8 V

LEDS, I2C, CLK, USB, Misc 3A, 3B, 3F, 2I 1.8 V

DDR4 Bank FPGA 3C, 3D, 3E 1.2 V

DDR4 Bank HPS 2J, 2K 1.2 V

Table 2: FPGA Voltage Settings

2.4 8-LANE PCI-EXPRESS 3.0 INTERFACE (WITH CVP), 8-LANE PCIE MECHANICAL

The SOC board has an 8-lane PCIe 3.0 interface. It does not feature a dedicated PCIe chip for PCIe host

transfers, hence the user FPGA design must include the Altera PCIe Hard IP core. Altera supports

multiple configurations of the PCIe core as part of QSYS, the user can set up the core for anything from 1

lane at PCIe 1.0 to 8 lanes at PCIe 3.0.

The PCIe interface has the following capabilities:

• Host PCIe bandwidth up to 8 GB/s2 (8 lanes at 8Gbps – PCIe 3.0) with CvP support using the

Altera QSYS Hard IP

• System Management Bus (SMBus)

The SOC accelerator ID PROM can be read by the host over the System Manager (SM) Bus on the PCIe

connection while the host is in standby mode (on-board ID PROM is powered by the PCIe 3.3V AUX

power supply). The I2C address of the PROM is 0x50. This is facilitated by the System Manager device.

1 VccR & VccT set to 1.03 volts, VccH at 1.8 volts, VccA at 1.8 Volts

2 Maximum theoretical data rate for 8 lanes of PCIe 3.0, the actual host bandwidth depends on the host hardware (motherboard, chipset,

processor, etc.), the Hard IP settings and the FPGA design itself.

Page 16: FPGA Design Implementation – final release

15 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.5 SYSTEM MANAGER

In order control the behaviour of the SOC FPGA device a system manager has been created and is used

to perform the following functions:

• Configuration of FPGA with support for fallback.

• QSPI flash interfacing

• Power monitoring and sequencing

• I2C interface bridging to peripherals

• JTAG bridging to the User FPGA

• Write back functionality, allows FPGA to update flash and communicate with peripherals

• Using peripheral sensors performs environmental monitoring and responds to warnings/alerts as

necessary.

• GPIO hosting for miscellaneous peripheral control

• USB Blaster IP integration

A MAX103 device has been selected for the system management controller.

SYSTEMMANAGER

MAX10

Host

ADCs

Temp Sensors

ID PROM

FPGA

SMB to PCIe fingers

0 ohm links

POWERSUPPLIES

Sequencing

PWR Good

Board Voltages

CURRENTTRANSDUCER

12V

CLOCKS I2C

TESTCONN

JTAG

100MHz

Arria 10 SOC

Dual-core ARM® Cortex-A9

CONFIG

CLOCKCONTROL

UART

UART

QSFP

x32

I2C CORE

QSPI x8

LEDX8I2C

QSFP I2C

JTAG

I2C

HOSTCONN

CONTROL

Figure 4 : 385A-SOC System Manager

A management USB header which connects to the motherboard via a standard cable is provided for

general monitoring and control. A FT234XD provides a USB to UART interface to the MAX10.

A number of FPGA signals are also grouped with the Altera Fast Passive Parallel (FPP) bus to create a

wider parallel interface. After configuration this interface (excluding any dedicated signals) will be used

for read/write operations between the MAX10 & FPGA. A FPGA requested soft configuration (does not

3 https://www.altera.com/products/fpga/max-series/max-10/overview.html

Page 17: FPGA Design Implementation – final release

16 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

require a power cycle) can be carried out by the MAX10. To request a soft configuration a user writes to

the Reconfiguration Register in the System Manager Interface Core.

Two FPGA images, a working and a fallback image, are stored in 2Gbit QSPI flash (8x256Mbit). In

addition to FPGA configuration, the reset sequencing of all other devices is controlled by the MAX10.

Power sequencing is performed by the MAX10 which controls the power up/down sequence of the

board as well as monitoring the various “Power Good” signals A group of board voltages are measured

via the ADCs in the MAX10.

An I2C bus collects temperature and current monitoring data, enabling the MAX10 to present

environmental data to the user. An additional SMB bus connection is provided as an option to the PCIe

fingers, enabling the server’s host sideband management layer to interface to the MAX10.

2.6 USER FPGA DDR4 SDRAM

The User FPGA fabric on the SOC accelerator is connected to one bank of DDR4 SDRAM (part number

MT40A512M8RH-083E:B) which is 72 bits wide, and is configured as 4 GB and operates at 2133 MT/s.

The “Arria 10 External Memory Interfaces” (or EMIF) should be instantiated within the QSYS design and

the parameters from these QPRS files should be used.

The EMIF Hard IP (on the Arria10) has a complementary EMIF Debug Toolkit component that can be

included in any design. There is only one EMIF Debug component per I/O component (on Arria 10),

therefore the EMIF Debug component must be shared between the two available memory banks.

2.7 SOC HPS

The Arria 10 system-on-a-chip (SoC) is composed of a dual-core ARM® Cortex™ -A9 hard processor

system (HPS) and an FPGA. The HPS architecture integrates a wide set of peripherals that reduce board

size and increase performance within a system. Integrated into the HPS are a subset of peripheral

functions including:

• HPS-to-FPGA bridge port - 32, 64, or 128 bits wide

• General-purpose direct memory access (DMA) controller

• Three Ethernet media access controllers (EMACs)

• Two USB 2.0 on-the-go (OTG) controllers

• NAND Flash Controller

• QSPI flash controller

• I2C, UART, SPI, Watchdogs, Timers, etc

The Dual ARM Core SOC fabric on the accelerator is connected to one bank of DDR4 SDRAM which is 40

bits wide, and is configured as 2 GB and operates at 2133 MT/s.

The HPS is discussed in more detail in section 3.

2.8 2X QSFP PORTS SUPPORTING 10/40 GB/S ETHERNET

The 385A-SOC features two 10/40 Gb/s capable ports. The QSFP high speed interfaces are directly

driven from IP within the user FPGA design. Altera provides IP cores for multiple high speed protocols

compatible with the 385A-SOC which all have their own reference clock requirements. The SOC

accelerator is populated with an on-board dual frequency clock chip which, by default, feeds a

644.53125MHz MHz (for 10/40 GbE) to the dedicated transceiver IP reference FPGA clock pin.

A range of protocols can be implemented using the Altera’s IP core QSYS library. For such IP cores

different reference clock frequencies might be required

Page 18: FPGA Design Implementation – final release

17 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Each network device is allocated a dedicated MAC addresses. These addresses are labelled on the board

and programmed in the ID PROM.

JitterAttenuation

ChipSi5346

OSC 0

QSFP28_1

QSFP28_0

ClockGenerator 1

ClockGenerator 0

OSC 1

qsfp1_refclk0

qsfp0_refclk1

qsfp1_refclk1

FPGASi5346 ctrl/status

TCXO

10MHz

1PPS

Optional Advanced Clocking Block

recovered_clk0

recovered_clk1

qsfp 0_refclk0

Figure 5 : QSFP28 clocking structure

2.9 FRONT PANEL CONNECTIVITY

The front panel on the SOC accelerator provides access to the two QSFP ports and one USB port. There

are two front panel options, a half-height and full height. The half-height allows connectivity to the two

QSFP and front panel USB ports. The full height option extends this to facilitate connectors for an

external clock and a 1PPS signal. See Figure 6.

Figure 6 : Extended Front Panel Connections

Page 19: FPGA Design Implementation – final release

18 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.10 CLOCKING CIRCUIT

The clocking circuit on this board has been chosen to provide flexible and quality clock sources to the

FPGA and associated circuitry without adding significant cost. See Figure 7 for clock layout.

The following clocks sources are available to the FPGA design:

• PCIe Slot Clock

• Network Clock

• Memory and General FPGA Clocks

• Configuration Clock

• Optional 1PPS External Clock input

• Optional External Clocking using clock synthesizer

2.10.1 PCIe Clock

The PCIe clock is 100MHz and is provided by the host motherboard. This clock is routed directly to the

FPGA. It must be used as the reference clock for the PCIe IP (Altera PCIe Hard IP) but can also be used

for other purposes.

2.10.2 Network Clock

The network interface clock is a fixed dual clock selectable Silicon Laboratories Si532 and has a default

power on configuration of 644.53125 MHz which is used to generate the 10.3125 Gb/s transceiver line

rates required for 10 GbE and 40 GbE. It has an LVDS I/O standard.

The clock frequency can be changed by the FPGA. The alternative frequency is 531.25 MHZ which is the

primary clock when using Fibre Channel.

2.10.3 Memory & General FPGA Clocks

This clock source provides three buffered outputs that drive the quadrants of the FPGA. The clock

outputs are LVDS and have a power on frequency of 266 MHz; this default frequency has been chosen

for simple generation of the 1866 MHz clock which is required for the DDR4 memories.

These clocks are sourced from a Silicon Laboratories Si5338 clock generator. Additional clock

frequencies required within the FPGA can be derived from this clock source.

2.10.4 Configuration Clock

The configuration clock is a 100MHz clock that is used by the FPGA internal configuration fabric but is

also routed to the FPGA for use as a user clock. This is a standard single ended oscillator.

2.10.5 Transceiver Clock

Both QSFP modules are driven from dedicated transceivers. These transceivers require a dedicated

reference clock pin to utilize the cleanest clock source and hence support the highest of I/O standards

with tight jitter tolerances.

Page 20: FPGA Design Implementation – final release

19 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.10.6 Clocking Options

Arria 10 SOC

CLOCKSYNTH

TCXO

OSC1

OSC0

1PPS

EXT_CLK

Si5338CLOCKSYNTH

GXB_REFCLK0

GXB_REFCLK1

GXB_REFCLK2

GXB_REFCLK3

GXB_REFCLK4

GXB_REFCLK5 REFCLK_GXB

REFCLK_GXB

REFCLK_GXB

REFCLK_GXB

REFCLK_GXB

REFCLK_GXB

RECOVERED _CLK0

RECOVERED _CLK1

IN0

IN1

IN2OUT0

OUT1

OUT2

OUT3 CL

K

LVDS

CL

K

CL

K

CL

K

LVDS

Figure 7 : Arria 10 FPGA External Clocking Options

Figure 7 shows the full network clock circuit illustrating the flexible clocking options. The clocking

options are with respect to transceiver clocking. The transceivers in the Arria 10 are all on the same

column so can utilise any of the GXB_REFCLKs as the source for QSFP reference clocks.

Fixed Clocks:

• Clock synth chip (Si5338) driving DDR4 Ref clock and system clock inputs.

Standard Option:

• OSC 0 fitted with a dual frequency Si532 clock to generate GXB_REFCLK0

Optional Clocks:

• OSC 0 fitted with a fully programmable Si570 clock to generate GXB_REFCLK0

• OSC 1 fitted with a fully programmable Si570 clock to generate GXB_REFCLK1

• OSC 1 fitted with a dual frequency Si532 clock to generate GXB_REFCLK1

• Clock synth chip Si5346 generates GXB_REFCLK2, GXB_REFCLK3, GXB_REFCLK4 & GXB_REFCLK5.

◦ The Si5346 outputs may be a function of the following inputs :

◦ Inputs connected to 2 FPGA generated clocks (typically recovered clocks)

◦ Input connected to 1 external clock source (SMA connector)

◦ Reference Clock (e.g. always connected), TXCO, connected to a Temperature Compensated

50.00 MHz oscillator

2.10.7 External 1 Pulse per Second Clock (1PPS)

The 1PPS signal can be used to provide a means of synchronizing the FPGA timing to an external timing

signal. The input is diode protected with a maximum 5.5V allowed. Vih will be approximately 3V

depending on the FPGA I/O threshold used. To utilize this input the extended front panel is required.

Page 21: FPGA Design Implementation – final release

20 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.11 10-PIN USB HEADER

The 385A-SOC features a 10-pin USB header to provide access to the on-board USB-Blaster II (JTAG chain

access) and the FPGA UART to USB interface.

Figure 8 labels the two USB interfaces accessible to the user; Figure 9 shows the 10-pin on-board USB

connector.

Figure 8: 10-pin USB header

Figure 9 : 10-pin dual USB connector

The SOC accelerator has several external USB connector, one in the front panel and two on the rear of

the board. The front panel USB connector is multiplexed with the IN USB connector on the rear. Plugging

a cable into the front panel will disconnect USB traffic routed through the rear connector. A second OUT

USB connector on the rear of the board allows USB traffic to be routed out of the USB hub such that a

downstream connection can be setup from the front panel (or rear USB IN) of the first board to up to

three downstream boards as illustrated in Figure 10.

Figure 10 : USB daisy chain connectivity

Page 22: FPGA Design Implementation – final release

21 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Connecting multiple devices in this manor allows a single entry point to a systems JTAG chain. This will

be useful in heterogeneous systems where some host OS’s may not support the Quartus tool flow.

2.12 ON-BOARD USB-BLASTER II

The SOC Accelerator features an on-board USB-Blaster II chip. It enables user access to the FPGA JTAG

Chain for FPGA & Configuration Flash programming purposes using the Quartus Programmer or for

debug purposes using the Quartus JTAG based utilities.

Quartus uses the built-in USB devices driver on Linux to access the USB-Blaster II chip. By default root is

the only user allowed to use these devices. You must change the permissions on the ports before you

can use the USB-Blaster II to program devices with Quartus software.

It is expected that the USB interface will be used during debug of the different MIMMAC

implementations.

2.13 UART TO USB INTERFACE

The board also connects some FPGA pins to a UART to USB on-board chip provides the means to create

a simple debug USB UART port. The UART to USB chip populated on the board is a FT234XD-R, please

visit FTDI Chip’s website to download the part’s datasheet4 and instructions on how to install the device

driver for your preferred Operating System.

2.14 JTAG UTILITIES

The USB-Blaster II interface provides access to the FPGA JTAG chain and allows the user to reprogram

both the FPGA and the Configuration Flash through the user the Quartus Programmer Tool.

Intel provides several other debug tools which use the JTAG chain.

• SignalTap II Logic Analyzer

• Transceiver Toolkit

• External Memory Interface Toolkit

• In-System Sources and Probes Editor

• Etc.

Please refer to Intel’s documentation for details on how to use these tools. The development of the

OpenCL BSP should remove the requirement for a user to be concerned with the different tools by

providing a software driven interface to the different card features.

4 http://www.ftdichip.com/Support/Documents/DataSheets/ICs/DS_FT234XD.pdf

Page 23: FPGA Design Implementation – final release

22 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.15 I2C DEVICES

SYSTEMMANAGER

MAX10

I2C_BUS0

TEMP SENSORTMP432ADGST

Add = 0x4C

EEPROMAT24C16C-STUM

Add = 0x50

I2C_BUS1

CLOCKSSi5338B

Add = 0x70

I2C_BUS2

CORE PWRZL8802

Add = 0x31

I2C_BUS3

QSFP0Add = 0xA0

I2C_BUS4

QSFP1Add = 0xA0

PCIECONNECTOR

SM_BUS

CLOCKSSi5346A

Add = 0x6C

I2C_BUS6

Figure 11 : I2C Addressing

An I2C bus connects the low speed peripheral control signals to the system manager for control. Figure

11 shows which devices are connected and can be monitored by the Max10 device via the I2C bus. This

allows the serial links, clocks, PROM and sensor data to be monitored in the host system. This will be

required to control the system behaviour in respect to power monitoring and temperature control,

extremely important the truck use case described in deliverable D6.1.

Page 24: FPGA Design Implementation – final release

23 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.16 ID PROM

The ID PROM is connected to the I2C bus at address 0x50. This PROM contains useful information about

the device such as serial number, card version, etc. (see Table 3). This PROM can be read from the host

system via the system manager device. It is then possible to identify exactly what is located in the

system. The PROM is read only.

Bytes (decimal)

Contents

Example

0 – 3 Reserved Do Not Use

4 – 15 Serial Number 7095007

16 – 38 Order code P385ASOC-660-11A-10

39 – 56 Card revision v0201

57 – 60 FPGA type

61 FPGA fabric speed 2

62 - 67 Prom programming date MM/DD/YYY

68 - 73 QSFP28_0 Mac address 00:0c:d7:00:1f:c7

74-79 QSFP28_1 Mac address 00:0c:d7:00:1f:c8

80 FPGA Transceiver Speed 3

81-100 FPGA Part Number 10AX115N3F40E2SG

101 - 115 Reserved Do Not Use

116 - 127 TBC Anything

Table 3: ID PROM Data

Page 25: FPGA Design Implementation – final release

24 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.17 STATUS LEDS

The System Manager reflects system its status on LEDs D11-D18, while the user FPGA can drive the LED

D19. These are not planned to be accessible from OpenCL BSP, however they could be added as an

external IO channel if deemed necessary.

The LEDs defined in the Table 4 are bi-colour red and green. Switching both on at the same time gives an

amber colour. These will be useful for determining any faults that may arise.

LED Colour Sequence Description

D11

Green

Red

Fixed on

Flashing

No errors, successful

power up

Power up failure

D12

Green

Green

Red

Red

Amber

Fixed on

Heartbeat

Even mark to space

Fixed on

Fixed on

FPGA config completed

FPGA unconfigured

FPGA thermal cut-out

FPGA flash config error

FPGA flash config failover

D13 Reserved for future use

(RFU)

D14 Reserved for future use

(RFU)

D15 RFU

D16 RFU

D17 RFU

Table 4: System Manager Status LEDs

D11,D12,D13 D14,D15,D16

D17,D18,D19

Page 26: FPGA Design Implementation – final release

25 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.18 USER LEDS

The user has access to two extra LEDs connected to the System Manager, which are accessible to the

user FPGA via the system manager interface IP provided in the Quartus tools. The pin location of these

User LEDs is given in Table 5. It’s unlikely that this feature will be utilised as part of the OpenCL BSP.

LED Behaviour

D18 User FPGA Led 0, driven by User FPGA via Nallatech

system manager interface IP.

D19 User FPGA Led 1, driven by User FPGA via Nallatech

system manager interface IP.

Table 5: User LEDs

2.19 CARD INITIALISATION & SYSTEM RESET

Since the SOC Accelerator is a PCIe card and is a slave to a host processor. Therefore, the principle reset

of this card will come through the PCIe bus from the host.

The mechanism that supports a reliable initialization of the firmware and software running on the SOC is

as follows:

• Card powers up and supplies are sequenced by the System Manager, after which the board comes

out of reset.

• When the power supplies are stable, the “All Power Good LED” comes on, the configuration logic is

released and the FPGA configuration starts.

• After FPGA configuration, PCIe training and reset should occur.

• The release of the PCIe reset by the host also deactivates the global reset to the internal FPGA

design.

• After PCIe enumeration, a soft reset comes from the PCIe core to the rest of the internal FPGA logic

to start the firmware.

2.19.1 CvP Autonomous mode

FPGA configuration from flash takes around a second which does not satisfy the PCIe protocol

specifications. It is however possible to get the PCIe IP Core alone loaded in the FPGA fabric first and

have the PCIe IP up and running under 100ms (PCIe specification); this option is called CvP autonomous

mode and is an option of Altera Quartus Tools for the PCIe Hard IP block.

Page 27: FPGA Design Implementation – final release

26 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.20 CONTROL & TEMPERATURE SENSOR

Figure 12 : Altera temperature sensor IP

The FPGA temperature can be monitored using the internal FPGA temperature sensor available in the

Arria 10 FPGA. Altera supplies a temperature sensor IP core that can be read directly from the FPGA

logic. The value will be read back to the host or embedded ARM controller as a means of tracking the

device temperature whilst running different applications.

2.20.1 FPGA Temperature Alert Procedure

The FPGA’s die temperature is monitored and a thermal event is routed to the System Manager. This

signal event is set, by default, to trigger when the FPGA die temperature goes above 105oC.

If this occurs the Power Supply Unit (PSU) controller will immediately turn off all power supplies in order

to prevent any permanent damage to the FPGA.

2.21 POWER

The SOC accelerator board is designed to support a power consumption of up to 75W, in line with the

maximum delivered power available from a standard PCIe slot.

The total maximum power required by the Opera use cases is expected to be well below 75W. The

standard Fan/Heatsink COTS solution supports 75 Watts of cooling with an ambient intake temperature

up to 35C. Under these conditions the FPGA die temperature will be kept below an operating junction

temperature of 85C.

Page 28: FPGA Design Implementation – final release

27 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

2.22 THERMAL

The SOC accelerator can be fitted with a passive or active heatsink depending on the server

environment where the device will be installed. The option with active heatsink is shown in Figure 13.

Figure 13 : SOC Active Heat Sink

Using Autodesk CFD the thermal characteristics of the cooling technology can be modeled prior to

manufacture. These models are always performed for the expected worst case scenario.

Figure 14 : Air flow drawn through active heat sink

Page 29: FPGA Design Implementation – final release

28 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 15 : Autodesk CFD Arria 10 Stable Board Temperature (50 Watt design)

Page 30: FPGA Design Implementation – final release

29 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

3 ARRIA 10 HARD PROCESSOR SYSTEM 5 The Arria10 SOC device contains a dual core ARM cortex A9 processor. This chapter describes some of

the features of this hard processor system (HPS) and its relevance to BSP design work.

3.1 ALTERA GX660 SOC HARD PROCESSOR SYSTEM (HPS)

Figure 16 : Altera ARM Cortex A9 Hard Processor System (HPS)

Figure 16 illustrates the main components of the Arria10 SoC Series 10 Hard Processor System. The Arria

10 series HPS has many features, with those key to the Opera project listed below.

Feature

CPU frequency 1.2 GHz with 1.5 GHz via overdrive

Runs 32 bit ARM instructions

ARM NEONTM media processing engine

Single and double precision floating-point unit

Hard memory controller with support for DDR4 and DDR3

FPGA-to-HPS bridge : Allows IP bus masters in the logic core to access to HPS bus

slaves

HPS-to-FPGA bridge : Allows HPS bus masters to access to bus slaves in core fabric

5 https://www.altera.com/products/soc/portfolio/arria-10-soc/arria10-soc-hps.html

Page 31: FPGA Design Implementation – final release

30 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

3.2 HPS OVERVIEW

Figure 17 : Altera SOC Device Block Diagram

Figure 17 illustrates the different components of the HPS and FPGA. The HPS portion of the device

contains all ARM related interfaces, including the flash controls for configuration, clock management

and memory interconnects. The FPGA Portion has access to the general FPGA fabric, FPGA IO and fixed

IP components such as PCIe. There is small amount of shared IO between the HPS and FPGA portions of

the device, for when it makes sense to share input stimulus.

Communication between the HPS and FPGA can be done in the following ways:

• Shared External Memory: It is possible to pass data via shared DDR memory. Here both the HPS and

FPGA have accessed to a shared DDR memory bank. This allows the ARM to buffer large amounts of

data ready for processing in the FPGA. It can then continue to run in parallel whilst the FPGA

processes the contents for the DDR. In the OpenCL tool flow this memory appears as a global

memory.

• Shared Internal Memory: It is also possible to create an internal smaller shared memory in the FPGA

fabric that can be accessed from the HPS and FPGA. This memory uses the M20K memory blocks in

the FPGA fabric.

Page 32: FPGA Design Implementation – final release

31 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

The different interfaces are accessed by the ARM processor via unique physical addresses. The following

is an example memory address topology that includes an external DDR memory (The real memory map

is still to be decided).

Figure 18 : Boot Memory Locations

The above figure illustrates a possible arrangement of the boot address locations. The SDRAM region is

accessible to the ARM and the FPGA fabric. The HPS-to-FPGA region has a reserved location beyond the

address 0xC0000000.

Page 33: FPGA Design Implementation – final release

32 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

4 SOC ACCELERATOR CONFIGURATION

4.1 FPGA CONFIGURATION OVERVIEW

This section outlines the different configuration options and the setup will that will be used for the

Opera project.

The Arria 10 FPGA is connected to a 2Gb Serial Flash chip (EPCQL2048) used for FPGA configuration

upon board power on. The FPGA uses the Active Serial x4 FPGA configuration mode to obtain its

configuration data from the flash at power on, then the MSEL[2..0] is set to 010 for Fast Power-On Reset

(POR) Delay.

The accelerator also has an on-board Altera USB-Blaster II which can be accessed by the USB connector

on the side connector. This provides JTAG access to program the Configuration Serial Flash and the

FPGA.

The following section features extracts from the “Arria 10 Hard Processor System Technical Reference

Manual” in italic. For more detail pertaining to the HPS refer to this document.

Arria 10 SOC

Dual-core ARM® Cortex-A9

2Gb QSPIFLASH

x4SYSTEM

MANAGER(MAX10)USB

BLASTER

JTAG

USBHUB

x8

USBMUX

FP

RP IN

USBUART

RP OUT

UART

2Gb QSPIFLASH bank (8x256Mb)

FPP32

Figure 19 : FPGA Configuration Block Diagram

Figure 19 is a block diagram depicting the connectivity between different JTAG components on the

accelerator board. Configuration is controlled via a MAX10 device.

Page 34: FPGA Design Implementation – final release

33 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

4.2 CONFIGURATION WITH BOOT SOURCES

Figure 20 : SOC Configuration with boot Sources

The FPGA can be configured from an external source and the ARM then booted from another external

source.

4.3 CONFIGURATION WITH SELF HPS BOOT

Figure 21 : HPS boots from FPGA

“In Figure 21, the FPGA is configured first through one of its non-HPS configuration sources, this will be

the QSPI interface for the Opera accelerator. The Configuration Subsystem (CSS) block configures the

FPGA fabric as well as the FPGA I/O, shared I/O and hard memory controller I/O. The HPS executes the

Page 35: FPGA Design Implementation – final release

34 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

second-stage boot loader from the FPGA. In this situation, the HPS should not be released from reset

until the FPGA is powered on and programmed. Once the FPGA is in user mode and the HPS has been

released from reset, the boot ROM code begins executing. The HPS boot ROM code executes the second-

stage boot loader from the FPGA fabric over the HPS-to-FPGA bridge.” 6

This is the preferred form of configuration of the Opera setup.

4.3.1 QSPI Configuration

The QSPI interface runs at 100 MHz and is connected to the on board flash memory. Booting from the

flash is the fastest way to configure the device. However, loading the flash with a new image is

extremely slow and can take several minutes. Initially there were no plans to support configuration via

DMA, hence any application requiring multiple configurations would expect a long delay between

configurations or limit the number of reconfigurations to the maximum number of images that can be

separately stored in flash memory.

This was approach was changed during the lifetime of the OPERA project as it became obvious this

would be a serious limitation to the FPGA acceleration where multiple designs were required per

application (See 9.1).

4.3.2 Configuration via USB

The SOC accelerator features a System Manager which hosts USB-Blaster functionality to access the

FPGA JTAG chain. By connecting the USB connector to the host’s motherboard with a 5-pin USB cable,

the user can access the JTAG chain with Quartus Programmer. Using this method, the FPGA can be

reconfigured directly or the on-board 2Gb configuration flash can be reprogrammed so the FPGA is

configured with the new design on the next power cycle. The JTAG Chain also provides debug access to

the board and enables the use of various JTAG based debug utilities like SignalTap.

4.4 ARM SIDE CONFIGURATION

There are two options for the ARM side configuration of the FPGA fabric. Either the ARM can boot first

and configure the FPGA or the FPGA can boot first and instruct the ARM processor to start. Either

method is valid and will depend upon what implementation best fits the target use case. For OPERA it is

assumed that the FPGA will configure first from external sources and the ARM will be initiated once

configuration is complete.

6 https://altera.com/documentation/sfo1410070178831.html

Page 36: FPGA Design Implementation – final release

35 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 22 : ARM Base Configuration

“In Figure 22 the HPS boots first through one of its non-FPGA fabric boot sources. If the hard memory

controller or shared I/O are required by the HPS during booting then you can either execute a full FPGA

configuration flow or an early I/O release configuration flow. The FPGA must be in a power-on state for

the HPS to reset properly and for the second stage boot loader to initiate configuration through the FPGA

Manager. The software executing on the HPS obtains the FPGA configuration image from any of its flash

memory devices.”7

“The HPS-to-FPGA and lightweight HPS-to-FPGA bridges are both mastered by the level 3 (L3)

interconnect. The FPGA-to-HPS bridge masters the L3 interconnect. This arrangement allows any master

implemented in the FPGA fabric to access most slaves in the HPS. For example, the FPGA-to-HPS bridge

7 https://altera.com/documentation/sfo1410070178831.html

Page 37: FPGA Design Implementation – final release

36 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

can access the accelerator coherency port (ACP) of the MPU subsystem to perform cache-coherent

accesses to the SDRAM subsystem.”8

8 https://altera.com/documentation/sfo1410070178831.html

Page 38: FPGA Design Implementation – final release

37 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

5 SOC ACCELERATOR BSP

5.1 OPENCL TOOL FLOW

The Intel FPGA SDK for OpenCL allows the implementation of FPGA logic using OpenCL C, an ANSI C-

based language with additional OpenCL constructs. The OpenCL SDK allows code to be emulated on

hardware in a software flow before generating HDL code to be compiled through the Quartus FPGA tool

chain. Once compiled the FPGA can be programmed using generated binary through the Khronos

OpenCL API.

The Intel FPGA SDK compiler takes a users OpenCL code and generates the HDL code that represents the

kernel code and the target FPGA accelerator. The IP for the target accelerator is described by the SOC

accelerator board support package. Once the HDL is created the Intel Quartus tools compile the HDL to

create a binary. This compilation process, known as place and route, can take many hours to complete.

Figure 23 illustrates the Intel FPGA OpenCL tool flow compiling a “hello world” example.

Figure 23 : Intel FPGA OpenCL tool flow

The OpenCL tool flow comes with several tools for control and debug. The BSP must also support other

features of the Intel FPGA OpenCL SDK such as board diagnosis and configuration options.

Page 39: FPGA Design Implementation – final release

38 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

5.2 BSP OVERVIEW

This section describes the work undertaken to create a programming environment that interacts with

x86, ARM and FPGA fabric. The following sections highlight the areas of research that have been

undertaken to create a Board Support Package (BSP) suitable for the OPERA use cases.

As discussed in deliverable D6.1, heterogeneity brings with it the added complexity of multiple

operating systems and device design processes. The seamless integration of these different design

processes within a single environment will be crucial to the success of the MICMAC use case. The goal is

to evaluate what design processes work best for heterogeneous systems and what does not. In order to

evaluate the design process it is necessary to develop a new BSP.

The OPERA hardware is uniquely different to any BSP previously developed by Nallatech and Intel. The

presence of the ARM processor within the chip provides unique challenges and problems. The main

challenge is to how to handle the presence of two master interfaces using the OpenCL tool flow that

expects a single master, slave relationship.

The Intel FPGA OpenCL tool flow provides a set of utilities and a compiler for targeting FPGA

accelerators using the OpenCL work flow. In order for the tools to be able to target an FPGA accelerator,

the accelerator vendor must create what is known as a board support package (BSP). This BSP is a non-

trivial piece of firmware that connects required Intel FPGA IP with bespoke vendor logic to give the

OpenCL tool flow access to the accelerator and its unique capabilities.

5.3 OPERA BSP REQUIREMENTS

For the OPERA SOC card the BSP must support access to the external IO interconnect, the external DDR

memories, connectivity to the ARM processor on device and also a route back to the PCIe attached

processor. This requires the creation and inclusion of different IP blocks.

Figure 24 : BSP Components

Figure 24 illustrates the different components of the SOC BSP device. The ARM and the OpenCL kernel

share data via a shared DDR interface. Data is transferred on off device via the serial links and host

channel interface (PCIe).

Page 40: FPGA Design Implementation – final release

39 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

There are various firmware and software components required by the Intel FPGA OpenCL SDK tool flow.

These are listed in the following sections.

5.4 BSP REQURIED DELIVERABLES

5.4.1 Base aocx file

The Base aocx is a precompiled and preplaced FPGA design containing all the firmware required by the

BSP, in order to guarantee a high performance for all kernel designs. To create a BSP that is efficient, in

terms of resource and clock frequency, the location of the different IP blocks must be carefully

considered. Keeping the size of the IP blocks to a minimum is crucial to maximise resources available for

processing, whilst the efficiency of the routing between the different components is crucial for

maintaining a high clock frequency.

The base is loaded by the Intel FPGA tool flow and used as starting point for the generation of the FPGA

OpenCL application.

The following table lists the key components of the base.aocx file are a brief description of their

function.

BSP Component Description

Arria10 HPS

The Arria 10 Hard Processor System configures and

connects the external interfaces of the hard processor.

Clock Cross Kernel Memory 1

This is a memory mapped clock crossing bridge that

allows data to be transferred to/from the kernel clock

domain into the DDR memory’s clock domain.

EMIF_a10_hps

The Altera A10 Extended Memory Interface (EMIF) DDR

Interface. Provides an interface to connect a bank of

external memory to the BSP. Configured for a bank of

DDR4 with a 32bit data bus and a 15bit address bus.

Kernel Interface

The OpenCL Kernel Interface allows the host interface

to access and control the OpenCL kernel.

Kernel Clock Generator

The OpenCL Kernel Clock Generator generates a clock

output and a clock 2x output for use by the OpenCL

kernels. An Avalon-MM slave interface allows

reprogramming of the phase-locked loops (PLLs) and

kernel clock status information.

MM interconnect 0

The largest Avalon Memory Map Interconnect. In this

case connected between the address span extender

component and the DDR memory.

Avalon MM interconnects are inserted into a system to

connect between Avalon master and slave

components. These can consume a large amount of

resource especially if the data and address buses are

wide.

Other components in a BSP

Pipeline bridges, clock crossing bridges, reset

controllers, Partial Reconfiguration, ACL Version ID

register, Arria 10 Temperature sensor, address span

extenders and clock controllers.

Table 6 : SOC accelerator BSP components

Figure 25 is a top-level diagram that illustrates how these components interact.

Page 41: FPGA Design Implementation – final release

40 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 25 : BSP component interaction

These components are then placed on the FPGA in a manner that minimises the impact on the FPGA

resource and routing. Intel provides a reference design for a pure SOC system with no external IO

connections with the exception of a DDR interface. This reference design will form the basis of the fixed

logic part (base.aocx) of the Opera BSP. A couple of images in Figure 26 illustrate the logic placement of

the reference design. The smaller image has the reference design logic highlighted in red with the

OpenCL kernel logic coloured blue. The larger image is a zoomed in section of the reference BSP with

the key components highlighted in different colours.

Figure 26 : SOC reference IP placement

Intel also provides an example design for using host channels. This reference design must be integrated

with the SOC reference design.

Page 42: FPGA Design Implementation – final release

41 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 27 : Host channel reference design placement (Nallatech 385A device)

Figure 27 shows the host channel code highlighted in red and the PCIe interface code in light blue.

Drawing on the experienced gained during the creation of Nallatech’s 385A BSP, a new BPS will be

created incorporating the required SOC features.

Page 43: FPGA Design Implementation – final release

42 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 28 : Nallatech 385A base.aocx

The SOC accelerator BSP will be similar to the Nallatech 385A BSP, however the kernel control logic will

be placed close to the SOC device which will simplify the PCIE interface component. Figure 28 shows the

385A BSP with the different BSP components labelled.

5.4.2 Board Specification File

This file is an XML description of the different attributes of the OpenCL features. It includes descriptions

of the memories and external IO connections to instruct the Intel FPGA tool flow how to wire together

the different components with the user kernel design.

5.4.3 Kernel Driver

In order to talk to the device via PCIe the host system requires a driver to facilitate PCIe communications

via the OpenCL API. This allows the FPGA to be discovered by the OpenCL SDK.

Figure 29 depicts the four layers of the Intel Software Development Kit (SDK) for OpenCL (AOCL)

software architecture: runtime, hardware abstraction layer (HAL), memory mapped device (MMD) layer,

and kernel mode driver.

Page 44: FPGA Design Implementation – final release

43 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 29 : Altera SDK for OpenCL software architecture

5.5 PLATFORM SUPPORT

The host system must be running one of the following supported Linux target platforms:

• Red Hat Enterprise 64-bit Linux (RHEL) version 6 on the x86-64 architecture

• Nallatech also tested the BSP package against CentOS 64-bit Linux version 6 on the x86_64

architecture.

• Ubuntu will supported for MICMAC environment

The OpenCL BSP provides a memory-mapped device (MMD) layer necessary for communication with the

accelerator board and the lower level kernel mode driver. All other upper software layers are provided

by the Intel FPGA SDK for OpenCL installation.

5.6 SERIAL LINKS

It is proposed that the FPGA act as a common high speed interconnect between IBM and Intel systems.

For this to be the case a serial interconnect must be created that is fast and integrated within the

OpenCL tool flow. The serial links will also provide the ability to be partition algorithms over multiple

FPGA devices if required. See section 7 for more details.

5.7 ACHIEVING BEST FMAX OF BSP

To achieve the best kernel performance the BSP logic must be carefully placed on the device. This

section describes some of the work undertaken to achieve best kernel clock performance

5.7.1 Routing Islands

Simply placing the required BSP IP components on device is not enough to ensure a good performance

for the BSP. Problems occur when the user kernel logic is added to the systems. Poor placement of the

IP components prevents the efficient routing of the kernel logic which results in a low maximum

frequency (FMax) of the application. This can be somewhat alleviated by creating logic islands for the

pipelined stages of key BSP signals. These islands are placed in locations to be as least disruptive as

possible. Figure 28 illustrates how these islands were placed for the Nallatech 385A device.

5.7.2 Multiple Seed Sweeps

Once a BSP has been placed the design is compiled to create the BSP fixed logic aocx (See 5.4.1). The

performance of this fixed logic is dependent upon what is effectively a set of random choices made by

the place and route tools. The choice of starting conditions by the tools determines the final placement

and routing and therefore the FMax of the design. By running multiple seed sweeps, where the initial

conditions are stochastically modified, a set of designs with varying FMax values are created. The one

with the highest FMax is then used as the base aocx file.

Page 45: FPGA Design Implementation – final release

44 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

5.8 LIMITING IMPACT ON RESOURCE

The resource required for the BSP base aocx is resource that will not be available to use within a users

kernel. Therefore it is important to keep the size of the IP components to a minimum. This can be done

by limiting the functionality and by compressing the areas of within which the IP components fit.

Reducing the functionality limits what can be done by the device, hence it is often the case that multiple

BSPs are created with different features to ensure logic is not used when it’s not required. For the Opera

BSP the requirements are fixed and therefore only one BSP has been designed.

When fixing the locations of the IP blocks, care must be taken to ensure the kernel has access to key

logic elements such as DSP and M20K components. These are aligned in strips vertically down the device

with standard FPGA logic elements between. Therefore fix areas logic are placed in a manner, where

possible, that does not prevent access. The priority is on preserving DSP blocks as these form the

foundation of any algorithm functions.

The place and route tools will place an IP component in the most efficient layout possible. This can mean

that an IP component is spread out more than is required. Any logic areas used by the IP components

will not be accessible to the users kernel, even if the FPGA logic in these areas are empty. Hence, it

makes sense to force the BSP IP to be constrained to as smaller area as possible. Constraining IP to a

fixed block can have a negative effect on clock performance and cause the IP to fail timing (Not function

correctly). Therefore, it usually requires several attempts to achieve the optimal placement of the logic.

5.9 OPENCL INTERFACES

This section explains how the different IO elements are accessed from the OpenCL kernel code, i.e. how

the MICMAC code will access the SOC hardware features.

5.9.1 Accessing the serial link interface from OpenCL

The Intel OpenCL SDK allows each external IO interface to be given a unique name for identification.

Channels are declared in the users OpenCL code prior to being used and attached to a physical location

by using the __attribute__ keyword and the appropriate name. The channel names are defined in the

board specification file. The following code declares the serial IO interfaces as OpenCL channels that can

be accessed using the OpenCL SDK channel methods.

channel ulong4 sch_in0 __attribute__((depth(4))) __attribute__((io("kernel_input_ch0")));

channel ulong4 sch_out0 __attribute__((depth(4))) __attribute__((io("kernel_output_ch0")));

channel ulong4 sch_in1 __attribute__((depth(4))) __attribute__((io("kernel_input_ch1")));

channel ulong4 sch_out1 __attribute__((depth(4))) __attribute__((io("kernel_output_ch1")));

Reading and writing to channels is performed using the standard read_channel_altera() and

write_channel_altera() function calls provided as part of the Intel OpenCL SDK.

The width of the serial channels is hardcoded and set to 256 bits or ulong4. As each serial channel is

capable of ~5 GBytes/Sec the kernel clock frequency needs to be greater than 156MHz to maximise the

throughput of the link.

Page 46: FPGA Design Implementation – final release

45 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

The serial channels are blocking, meaning any read from a serial link will pause the kernel until data is

available and any write to a channel will pause a kernel if the output is full. Therefore synchronisation

between multiple accelerators is data driven.

5.9.2 Example loopback design for host channels

Accessing the host channels from the OpenCL tool flow is done using the channels flagged with the

appropriate host IO attributes. The following code is an example loopback design that reads data from

the host and immediately writes it back. For more detail see section 6.1.

channel ulong4 host_in __attribute__((depth(0)))

__attribute__((io("host_to_dev")));

channel ulong4 device_out __attribute__((depth(0)))

__attribute__((io("dev_to_host")));

__kernel void loopback(ulong length, uint nostop)

{

ulong counter;

ulong4 data;

counter = 0;

while (nostop | (counter < length)) {

data = read_channel_altera(host_in);

write_channel_altera(device_out, data);

counter += 32;

}

}

5.9.3 DDR4 Memory Access

The DDR memory is accessed by declaring global memory in the OpenCL kernel code. A particular bank

of DDR memory can be targeted by setting the appropriate attribute on the global memory parameter

to a kernel.

__kernel void foo(__global __attribute__((buffer_location("DDR_0"))) int *x,

__global __attribute__((buffer_location("DDR_1"))) int *y)

5.10 SUPPORT FOR FUTURE INTEL TOOL VERSIONS

BSP development, on Arria10 has proved to be challenging, largely due immature and changing

reference BSPs, changes in the Quartus tool chain and Partial Reconfiguration issues. This has been

compounded by the lack of forward migration between Quartus Versions meaning that a BSP developed

in one version of Quartus is not usable in a different version of Quartus. For that reason Opera

development will be locked into version 16.1 of the Intel FPGA tool flow, unless there is a good technical

reason to migrate.

5.11 MICMAC APPLICATION APPROACH

This section describes how the MICMAC application will interact with the OpenCL tool flow.

Page 47: FPGA Design Implementation – final release

46 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

The majority of the MICMAC application will reside on the Moonshot host with selected elements of

code ported to the SOC accelerator. This requires the creation of:

• OpenCL kernel(s) for FPGA acceleration of key elements of the MICMAC code

• ARM application code for control of the OpenCL kernel and to perform some MICMAC

operations that cannot be accelerated on the FPGA, only where it would be inefficient to move

data back to the host application.

• Host side channel interface code for configuration and the passing of control data to and from

the OpenCL kernel via MMD interface.

The MICMAC acceleration can make use of the inter device serial links to expand the FPGA acceleration

across multiple devices if required.

Figure 30 : OpenCL MICMAC software components

Page 48: FPGA Design Implementation – final release

47 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

5.12 HOST KERNEL INTERFACE (EXAMPLE 1)

The simplest interaction between the x86 host and the FPGA is to stream data direct into the FPGA

kernel. Synchronisation between is automatically achieved by the blocking nature of the interface

kernels. In this scenario the ARM does not read or write any data to the system and simple controls the

execution of the OpenCL kernels.

Figure 31 : Basic OpenCL streaming interface sequence diagram

Step Description of Figure 31

1 The ARM enqueues the command and processing OpenCL kernels. The command kernel is instructed

as to how many commands it expects to process.

2 The x86 host writes data to the FPGA host channels using the MMD interface

3 The FPGA kernel runs reading data from the host channel. The kernel can use global memory to store

temporary values if necessary

4 The kernel writes results back to the host using the host channels.

5 The ARM waits for the kernels to complete before moving onto the nest processing task.

5.13 HOST KERNEL INTERFACE (EXAMPLE 2)

This section describes a possible sequence for synchronising host, ARM and FPGA code to allow the host

to read and write to global memory. The following setup assumes there is an OpenCL command kernel

which handles data received from the host. This command kernel reads and writes host data from the

FPGA global memory and synchronises host commands with ARM via the OpenCL kernels. The code this

description is based upon is listed in section 9.

Page 49: FPGA Design Implementation – final release

48 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 32 : Command kernel example sequence diagram

Step Description of Figure 32

1 The ARM enqueues the command and processing OpenCL kernels. The command kernel is instructed

as to how many commands it expects to process.

2

The command queue waits for the data from the host channel interface. This data is in the form of a

header and data payload. The header instructs the kernel whether the data is a read or write to global

memory or an acknowledgement.

3 The first command received is a write to global memory. The command is therefore followed by the

data payload.

4

The next command in the command queue is an acknowledgement. This is used to instruct the

processing kernel that the required data has been written to global memory. This can be done using a

simple blocking OpenCL channel connected between the command and processing kernel.

5 The processing kernel has been waiting for the acknowledgement in step 4 and will now start

processing the data written to the global memory.

6 Once the processing kernel is complete it writes an acknowledgement to the command kernel.

7 The command kernel has been waiting for an acknowledgement, reads the results from global memory

and writes the data to the host channel.

8 The host has been waiting for the data and can now read the results for further processing

9 The ARM has been waiting using the OpenCL clWaitForEvents method and completes once all the

expected commands have been processed. The ARM can then move onto the next task.

Page 50: FPGA Design Implementation – final release

49 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

6 HOST (X86) PLATFORM

6.1 HOST CHANNELS

As the SOC ARM processor is master in this system the host x86 processor will use host channels to read

and write data to the OpenCL kernels on the FPGA. Host channels allow the x86 host to write directly to

an IO channel in the BSP rather than global memory as per a typical OpenCL system. Avoiding global

memory writes from the host removes a conflict that would otherwise arise between the ARM and x86

both trying to access the DDR memory.

Host channels are implemented through the MMD layer as the standard OpenCL API does not support

this kind of interface. The MMD layer is a thin software layer for communicating directly with the board.

It is used for any utility programs that support OpenCL enable hardware, e.g. board diagnosis. Here it

has been expanded to include the following API calls for host communications.

MMD API Command Description

int aocl_mmd_hostchannel_create Opens channel between host and kernel.

int aocl_mmd_hostchannel_destroy Closes channel between host and kernel

void *aocl_mmd_hostchannel_get_buffer

The host channel get buffer operation provides host

with a pointer to a buffer to write and read from

kernel. If the direction of the channel was 1 during

create, the pointer returned is buffer to write data

into kernel. If direction was 0, the pointer returned

is buffer to read data out of kernel.

size t aocl_mmd_hostchannel_ack_buffer

Acknowledge to the channel that the user has

written or read data from it. This will make the data

or additional buffer space available to write to or

read for the kernel.

Intel have provided a reference design for the Arria10 SOC development board (See section 8), which

will form the basis of the host channel interface for the Opera accelerator.

The host interface is not able to write directly to the embedded ARM processor and can only present

and read data from OpenCL kernels. If data is to be passed to the ARM from the host the OpenCL kernel

must read data from the host channels and write this to the DDR attached memory, which is accessible

to both the OpenCL kernel and the ARM.

6.2 HOST TO ARM PROPRIETARY CONTROL INTERFACE

As part of the Opera project Nallatech has developed a host to ARM communication channel to allow

data and control information to be passed between an application on the ARM and an application

running the Moonshot x86 system. This is done by implementing an Ethernet-over-PCIe interface. A

driver on the Moonshot host and equivalent driver on the ARM open an Ethernet port that allows any

communication that would normally be possible via a typical Ethernet connection (E.g. ssh, scp, nfs, etc).

This interface is not fast relative to the OpenCL PCIe interface and should only be used for control and

monitoring. For faster communications the host-channel interface, described in section 6.1, should be

used.

Page 51: FPGA Design Implementation – final release

50 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 33 : Host to SOC interfaces

A small memory (262 Kbytes) in the FPGA logic creates a buffer into which data can be accessed from

both the host and ARM. This can be used for passing useful information from the host to ARM such as

OpenCL attributes. The depth of this buffer can be reduced or increased depending upon the demands

of the MICMAC application. Reducing the size will release extra FPGA memory resource for the OpenCL

kernel code.

Figure 33 illustrates the two available host to SOC interfaces.

6.2.1 X86 Host driver and API

The host side driver is used to get a handle to the host attached SOC accelerator card. This handle can

then be used to perform some simple control instructions via set of API commands. Some of the details

are still be determined and the API functions may change in the future.

Host API Command Description

NALLA_HANDLE NALLA_385Asoc_Open(uint32_t cardNumber, uint32_t flags); Gets a handle to the attached accelerator

void NALLA_385Asoc_Close(NALLA_HANDLE cardHandle);

Releases the handle opened by

NALLA_385Asoc_Open

uint32_t NALLA_385Asoc_Status(NALLA_HANDLE cardHandle, uint32_t command, void* status);

Retrieve the value of the various status registers.

These are still to be decided, but will include the

firmware version, timestamp, optical link status,

reset, device ID, etc...

size_t NALLA_385Asoc_Write(NALLA_HANDLE cardHandle, void* data, uint32_t offset, uint32_t lengthBytes, uint32_t flags);

Write a block of data to the accelerator at an offset.

size_t NALLA_385Asoc_Read(NALLA_HANDLE cardHandle, void* data, uint32_t offset, uint32_t lengthBytes, uint32_t flags);

Read a block of data from the accelerator at an offset.

Page 52: FPGA Design Implementation – final release

51 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

6.2.1.1 SOC ARM driver and API

The ARM processor has an equivalent driver API for interacting with SOC to host communications. The

API is preliminary and may be subject to change.

This driver gives the ARM the ability to access the shared memory buffer.

6.2.2 Multiple cards

In order to differentiate between different SOC accelerators in the same system, it is envisioned that

each host/card has its own subnet address. E.g.:

Card 1: 192.168.100.2 (PC Host 192.168.100.1)

Card 2: 192.168.101.2 (PC Host 192.168.101.1)

Card 3: 192.168.102.2 (PC Host 192.168.102.1)

Page 53: FPGA Design Implementation – final release

52 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

7 SERIAL INTERCONNECT DETAILS 7.1 OVERVIEW

In order to provide scalability and communication between different hardware vendors, the

development of serial interconnect was required. This serial interconnect acts as a common interface,

via the FPGA, for power and x86 systems.

Figure 34: Using the FPGA as a common interconnect

7.2 SERIAL CHANNEL IP DETAILS

Each serial channel within a BSP consists of an input and output external I/O channel. The BSP serial

channel IP has the following features.

Feature Description

Configuration 4-lanes, full duplex, with flow control

Encoding 64B/66B

Forward Error Correction KR-FEC

Line rate (each lane) 10.3125 Gbits/sec

Channel Latency ~390ns

User Interface, transmit 256 Bit Avalon Streaming

User Interface, receive 256 Bit Avalon Streaming

Maximum transfer bandwidth per channel per direction 39.6875 Gbits/sec

Table 7 : Serial link features

Page 54: FPGA Design Implementation – final release

53 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

x4

40G Ethernet

Fibre

Arria-10

Av ST

CSR Av MM

x4

Serial

Channel I/F

QSFP+

Tx

QSFP+

Rx

256bits

Tx Kernel Channel 0

Rx Kernel Channel 0

Kernel Clk<400MHz

Av ST

256bits

Kernel Clk<400MHz

100MHz Clk

644.53125MHz Ref Clk

OpenCL User / Kernal Domain OpenCL BSP

Figure 35 : Block diagram for single channel on 385A SOC

Figure 35 illustrates how the serial channel IP is used in the BSP. The SOC accelerator has two QSFP+

module sites directly connected to the High Speed Serial ports of the FPGA. The BSP contains all the

relevant IP to enable external Altera I/O channel interfaces into an OpenCL Kernel.

7.3 SERIAL CHANNEL DEBUG

A simple register interface in the serial channel IP is connected to the PCIe BAR to aid larger system

debug.

The following registers are defined in the PCIe BAR memory map:

SCO BSP: serial channel CSR registers are in BAR4

• serial channel 0 CSR offset : 0x20000

• serial channel 1 CSR offset : 0x20040

7.3.1 Control and Status Registers

Registers are defined as follows:

Address

Offset

Host

Access

Default

Value Description

0x0 Rd 0x00000000

Serial Channel Status

[0] Kernel Stream Sink Ready

[3:1] Not Used (returns "000")

[7:4] Tx PHY Ready

[11:8] Tx PHY Calibration Complete (not tx_cal_busy)

[12] PLL Locked

[15:13] Not Used (returns "000")

[16] All Rx Lanes Deskewed

[19:17] Not Used (returns "000")

[23:20] RX Lane Aligned

[27:24] Rx PHY Ready

[31:28] Rx PHY Calibration Complete

0x1 Rd/Wr 0x00000000

Serial Channel Control

[0] Reset the Serial Channels Receiver (always returns '0')

[3:1] Not Used (returns "000")

[4] Reset the Serial Channels Transmitter (always returns '0')

Page 55: FPGA Design Implementation – final release

54 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

[7:5] Not Used (returns "000")

[31:8] Not Used (returns 0x000000)

0x2 Rd 0x00000FFF

Kernel Rx Ready Performance Accumulator

[11:00] rx_rdy_perf_acc

[31:12] Not used

0x3 Rd 0x00000FFF

Kernel Tx Valid Performance Accumulator

[11:00] tx_thpt_perf_acc

[31:12] Not used

0x4 Rd/Wr 0x00000000

Performance Control

[0] Reset Rx Performance Reg (always returns '0')

[3:1] Not Used (returns "000")

[4] Reset Tx Performance Reg (always returns '0')

[7:5] Not Used (returns "000")

[31:8] Not Used (returns 0x000000)

Table 8 : Serial channel register addresses

7.3.2 Serial Channel Status Register

Under normal operating conditions this register reads 0xFFF11FF1. If the input kernel is able to supply

data into the channel faster than data can be read from the channel then the Kernel Sink Stream Ready

signal will de-assert to apply back pressure to the Tx Kernel. In this case the register reads 0xFFF11FF0.

When QSFP+ modules are not fitted (in the case of the 385A) then the register reads 0xF0001FF0.

7.3.3 Serial Channel Control Register

Asserting a reset to the Serial Channels Transmitter will reset the Tx interface and flush any data in the

Tx FIFO. The channels receiver will lose lane alignment and this will lead to a reset of its Rx interface.

Asserting a reset to the Serial Channels Receiver will reset the Rx interface and flush any data in the Rx

FIFO.

7.3.4 Kernel Rx Ready Performance Accumulator Register

This gives a continuous, short term measure of the ratio of high to low on the kernel_stream_src_ready

signal. The actual ratio is calculated by the host using the formula:

Kernel Rx Performance Ratio = rx_rdy_perf_acc/4095

This ratio can be used to determine how effectively a kernel is servicing the stream. A good kernel will

keep the 'ready' signal high all the time (ratio = 1.0).

7.3.5 Kernel Tx Throughput Performance Accumulator Register

This gives a continuous, short term measure of the ratio of data transfers through the

kernel_stream_snk port with respect to the kernel clock. It is calculated by the host using the formula:

Kernel Tx Throughput Ratio = tx_thpt_perf_acc/4095

This ratio can be used to determine how effectively a kernel is providing data for transmission. For a

steady stream of data, a good kernel will get close to the ratio of Channel Clock frequency (156.25MHz)

to Kernel Clock frequency.

Page 56: FPGA Design Implementation – final release

55 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

7.3.6 Performance Control Register

Resets the Performance Accumulator Registers.

7.3.7 MMD Support Serial Channel CSR Functions

The current Nallatech MMD implementation contains helper functions for the serial channel CSR

registers. The functions are defined in aocl_mmd.h provided with BSP.

Page 57: FPGA Design Implementation – final release

56 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

8 INTIAL DESIGN EVALUATION PLATFORM

8.1 OVERVIEW

The initial design work for the SOC Accelerator was performed using an Arria 10 development kit by

Altera purchased for the Opera project. This development as the targeted FPGA device plus many

peripherals attached for testing developed firmware and for clarifying the software stack for controlling

the device.

Figure 36 : Arria10 Soc Development Kit

Firmware development work. Description

Host to ARM communication interface The development of Ethernet over PCIE interface. See

6.1.

HPS boot and FPGA configuration synchronisation

This is the development of the software stack from

controlling the boot of the ARM device and

understanding the synchronisation protocol with the

configuration of the FPGA fabric.

HPS to FPGA fabric interface

The development of firmware to interface the HPS

ARM processor with FPGA external DDR memory

interfaces, to give the ARM access to the shared DDR

memory.

Clock BIST prototype run for x86 and HPS

The development of BIST software and firmware in

preparation for the delivery of the SOC accelerator

prototype cards.

PCIe interface prototyping

Attaching a Samtec FMC cable via the development

kit’s FMC connector, it is possible to test the PCIe

interface logic without a standard PCIe connector.

HPS reset/recovery The development of firmware and software stack to

facilitate HPS reset and recovery.

UART Interface

Development of a UART (Universal Asyncrhonous

Receiver/Trasmitter) controller direct into the ARM

processors.

HPS Clocking Verifying the correct clock configuration for the HPS

system.

Table 9 : Arria10 SOC development kit hardware development tasks

Page 58: FPGA Design Implementation – final release

57 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 37 : Arria 10 Development Kit Block Diagram

As can be seen the Arria10 development board has many features critical to the development of the

SOC accelerator. The firmware and basic software stack developed can then be used to test the initial

development of the BSP until the availability of the prototype cards. The serial link interfaces cannot be

tested on the development board.

Table 10 lists the BSP tasks performed using the Arria10 development board.

Page 59: FPGA Design Implementation – final release

58 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

BSP development work. Description

ARM to DDR4 shared memory The DDR4 memory accessed via the HPS can be shared

with the FPGA fabric. This a verification of Altera IP.

PCIe host channels

Using the Samtec FMC connector the PCIe host

channels can be functionally tested.

The host channels were also tested and verified on an

Nallatech 510T device to ascertain the potential

performance this interface.

Kernel boot and control

Testing the software stack and kernel interrupt system

for controlling OpenCL kernels from the ARM

processor.

Table 10 : BSP development kit work

Page 60: FPGA Design Implementation – final release

59 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

9 BSP VERSION 2 (FINAL)

9.1 MODIFICATIONS REQUIRED

The initial board support package created for OPERA used the embedded ARM processor as the master

controller. This was to facilitate the offload of work between the ARM and the FPGA. However, as the

project progressed it was clear that the MICMAC software to be accelerated on the FPGA (see D6.4)

could not be partitioned to benefit from the embedded ARM on the device. Also, using the ARM as the

host required a different programming approach as described within the Intel FPGA OpenCL

documentation increasing design verification time. It relied on the availability of host channel support by

the Intel tools, which was not available in time as expected. This became more significant with the

addition of the CNN offload to the OPERA project late in the projects lifetime, that heavily relied on this

host interface.

The host channel interface also prevented dynamic reconfiguration of the FPGA device, restricting each

application to a single FPGA image. For the MICMAC code this was not practical with multiple

applications run sequentially.

Therefore, the decision was taken to create a new BSP where the host CPU is the master as with a

traditional FPGA OpenCL accelerator card. Here, the ARM is no longer used for code acceleration, but

for running kernel level performance measurements. This diagnostic code runs in parallel to the FPGA

offloaded acceleration providing high fidelity power information.

Sections 5.12, 5.13 and 6 are no longer valid for this version of the BSP.

9.2 UPDATED BSP

Figure 38 : BSP version 2 with CPU as master

Version 2 of the BSP is illustrated in Figure 38. The power monitoring software is always there but does

not need to be running to use the FPGA.

Page 61: FPGA Design Implementation – final release

60 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Figure 39 : BSP Version 2 Floor plan

Figure 39 shows the floor plan of the FPGA with CPU as host. The design has been made to be as

efficient as possibly to allow the FPGA to deliver as best performance as possible.

Design name Total resources BSP resources

ALMS 251680 46970 (18.7%)

FFs 1006720 187880 (18.7%)

RAMs 2131 418 (19.6%)

DSPs 1687 129 (7.7%)

Table 11 : BSP version resource use

Page 62: FPGA Design Implementation – final release

61 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

9.3 POWER MONITORING USING THE EMBEDDED ARM

In order to use the monitoring software the FPGA-SOC accelerator must be a revision 2 card or higher.

This is due to subtle hardware changes regarding the routing to the onboard system manager. A new

uboot configuration was created which includes the power monitoring software and FPGA card needs to

have its HPS boot system updated with the appropriate files. If the power monitoring system is not

being used, only the FPGA flash needs to be updated with the latest BSP design.

Once loaded the host can retrieve the FPGA’s current power status using the following utility created for

the OPERA project:

sudo ./sysfsOpenclMonitor /sys/bus/pci/devices/0000\:0b\:00.0/resource2

The following is an example output:

/sys/bus/pci/devices/0000:0b:00.0/resource2 opened.

After mmap

PCI Memory mapped to address 0x7fbf97951000.

OFFSET_VCC_12V0 : 11.649859 V

OFFSET_VCC_0V95 : 0.948706 V

OFFSET_VCCT_1V0 : 1.023748 V

OFFSET_MEM_1V2 : 1.198236 V

OFFSET_VCCR_1V0 : 1.020697 V

OFFSET_VCC_2V5 : 2.498360 V

OFFSET_VCC_1V8 : 1.833961 V

OFFSET_VCC_5V0 : 5.015022 V

OFFSET_12V_CURRENT : 1.847419

OFFSET_TEMPERATURE : 51.960938 degC

The overall FPGA power can be determined from the multiplication of the OFFSET_VCC_12V0 value with

the OFFSET_12V_CURRENT value.

This output will be used to populate the required fields used by RedFish (See D4.3).

The current root filesystem is set to start the “devOpenclMonitor” application on the HPS on boot. This will watch

for kernel activity and create log files in the tmp filesystem on /mnt/ramdisk. The number of kernels that are

monitored is modified by the utility “sysfsOpenclControl”, which allows the values of shared control registers to

be updated from the host. E.g:

/sysfsOpenclControl /sys/bus/pci/devices/0000\:0b\:00.0/resource2 <Number of kernels to log>

The monitoring software is designed to start when logging information when it sees a kernel start event in the

OpenCL kernel control firmware. These signals have been directly wired to a shared address in the ARM operating

system where a simple spin lock waits for changes to its state. The FPGA voltages, currents and temperatures are

then written every millisecond to flash memory on the FPGA board whilst the kernel is running (See Table 12

example output)

Page 63: FPGA Design Implementation – final release

62 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Time usec V12 Current v12 Temperature Power

1 11.46903 2.416846 57.49219 27.71887

1097 11.42034 2.518529 57.49219 28.76245

2174 11.42034 2.518529 57.49219 28.76245

Table 12 : Example Power Monitoring Output Data

9.4 INTEL OPENCL VERSION 17.1.2

Initially the version of the OpenCL tools used by the OPERA project was to be fixed at the beginning of

the project. However, Intel have made significant improvements to the tools that justify updated the

OPERA BSP to the latest available version. This has given improvements in clock frequency and resource

use, as well as enabling some new compile features that have reduced development time.

Page 64: FPGA Design Implementation – final release

63 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

10 BSP EXAMPLE USE-CASE

10.1 ANN MICMAC USE-CASE

This section describes how the final BSP described in section 9 was used to accelerate part of the

MICMAC software. For details regarding the overall MICMAC software port please see D6.8.

The Approximate Nearest Neighbour (ANN) function is used extensively within the MICMAC software to

find correlating points between deferent images or varying scale and orientation. It is compute intensive

task that requires many floating point calculations. First some pre-processing is done on the images to

create tie points used by the ANN algorithm. These tie-points are multidimensional values (128 in this

case). For each image thousands for tie-points are created and must be each compared to all other tie-

points in another image to find the closest matching pairs. This is an n x n problem. The CPU employs a

tree structure that checks 1 dimension at a time in an approximation that removes the need to match all

points against each other. This is not particularly accurate, but some 50x faster than a full n x n search.

The recursive nature of this tree search does not fit well on FPGA’s and a brute force approach that

compares all points is the only option on the FPGA. This reduces the acceleration gained by the FPGA as

the brute force approach is significantly more compute intensive, however the accuracy of the code is

significantly improved. The CPU tree search returns approximately 30% of valid matches, whereas the

brute force approach is 100% accurate.

To maximise the performance of the FPGA design is split into producer, consumer and compute kernels.

Having a single kernel responsible for global memory communications reduces the resource required

compare to having each compute kernel access global memory directly.

The following diagram illustrates how the kernels are connected.

Figure 40 : Block diagram of ANN kernels

Multiple compute kernels are created until resources are exhausted on the FPGA.

The following code is used to calculated the distance between two points on opposing images. Note that

the following code uses integer arithmetic for calculation. This halves the DSP resource required by the

FPGA doubling the number of distances that can be computed in parallel.

unsigned int CalcDistanceFunction16bit(dim_data_type A, dim_data_type B)

{

unsigned short diff[128];

#pragma unroll

for (int p = 0; p < DIM; p++)

diff[p] = (unsigned short)abs((short)((int)A.data[p] - (int)B.data[p]));

unsigned int total = 0;

#pragma unroll

for (int p = 0; p < DIM; p++)

total += (0xffffffff & ((diff[p] & 0xffff) * (diff[p] & 0xffff))) >> 8;

return total;

}

Page 65: FPGA Design Implementation – final release

64 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

Each compute kernel processes 4 points in parallel. This is because the local M20K memory used to

buffer the input data can be accessed 4 times in parallel before replication is required. As the number

M20K are the dominating resource for this code, replication needs to be avoided wherever possible.

Table 13 lists the resources required by the ANN kernels when compiled using the latest BSP and version

17.1.2 of the Intel OpenCL tools.

Table 13 : ANN kernel resources required (BSP v2)

10.2 POWER MONITORING

When the ANN code is running with the power monitoring enabled in the ARM processor the following

plot can be produced. It lists the processing time on the x axis (usecs) versus the power draw in the Y

axis. The power analysis of the ANN code will be discussed in more detail in OPERA deliverable D4.3.

Figure 41 : ANN power

0

5

10

15

20

25

30

35

-200000 0 200000 400000 600000 800000 1000000 1200000 1400000

ANN Power Consumption

Kernel/Partition ALUTs (%) FFs (%) RAMs (%) DSPs (%)

Kernel system

Partition 23 23 24 8

Compute kernel (%) 15 5 14 16

Producer 3 3 7 0

Consumer 1 1 1 0

Page 66: FPGA Design Implementation – final release

65 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

11 CONCLUSION

The purpose of this deliverable was to deliver a platform that enabled the Arria10 SOC device as an

offload acceleration device and as enabler for heterogeneity between x86 and Power systems.

During the course of the OPERA project it was found that the ARM processor provided no significant

compute benefit when accelerating the software deployed for the different use cases in OEPRA. To this

end having the ARM as the master proved to be a bottleneck in performance and programmability,

inhibiting the partners ability to code and accelerate applications. Therefore, a second BSP was created

that followed the standard Intel OpenCL programming tool flow. This expedited the development of

application code and allowed the consortium to include CNN offload within the OPERA project timeline.

The ARM processor is now used for power/system monitoring. This was used to measure the efficiency

of different FPGA implementations, that would not be possible without the ARM’s close interaction with

the FPGA as part of the SOC package. This monitoring approach is described in more detail in deliverable

D4.3.

In conclusion the best approach is to use SOC devices where monitoring and low-level system

management are required. For the applications studied in OPERA the SOC does not offer any

performance benefit versus a non SOC device, however it does provide the ability to monitor

performance for power based design optimisations. Therefore, SOC devices have a place in HPC systems

that require fine grain performance monitoring or low level control offloaded from the host platform.

Page 67: FPGA Design Implementation – final release

66 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

12 APPENDIX: EXAMPLE COMMAND QUEUE OPENCL

12.1 COMMAND QUEUE EXAMPLE

/*

This kernel can be used as template for the MICMAC use case.

*/

#define IDLE 0x0

#define WRITE_TO_GLOBAL_MEM 0x1

#define READ_FROM_GLOBAL_MEM 0x2

#define HOST_SYNC 0x3

#define KERNEL_SYNC 0x4

// Declare host input and output channels

channel ulong4 device_in __attribute__((depth(0)))

__attribute__((io("host_to_dev")));

channel ulong4 device_out __attribute__((depth(0)))

__attribute__((io("dev_to_host")));

// Declare serial link input and output channels.

// Used for inter-card communication.

channel ulong4 sch_in0 __attribute__((depth(4))) __attribute__((io("kernel_input_ch0")));

channel ulong4 sch_out0 __attribute__((depth(4))) __attribute__((io("kernel_output_ch0")));

channel ulong4 sch_in1 __attribute__((depth(4))) __attribute__((io("kernel_input_ch1")));

channel ulong4 sch_out1 __attribute__((depth(4))) __attribute__((io("kernel_output_ch1")));

// Create a synchronisation channel to allow the host to synchronise with the application kernel and

// visa versa

channel bool HostToKernelRequestChannel;

channel bool KernelToHostAcknowledgeChannel;

// Example helper kernel for reading from host to global memory to replace normal clEnqueueWriteBuffer

// commands. Here the firs word read from the interface is a header. This can be user defined to

// fit with the users needs. A header is not necessary if the packet sizes are know.

// The number for commands to service is set by the ARM host.

__kernel

void ServiceHostCommandQueue(__global ulong4 *restrict ddr_buffer,int NoCommands)

{

int command = 0;

unsigned char state=IDLE;

unsigned int packet_count = 0;

unsigned int PacketSize = 0; // Bytes in multiples of 16 (I.e. ulong4)

unsigned int Offset = 0; // Bytes in multiples of 16 (I.e. ulong4)

Page 68: FPGA Design Implementation – final release

67 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

while (command < NoCommands) // Fully pipelined main loop for best performance

{

ulong4 Data;

if (state != READ_FROM_GLOBAL_MEM)

Data = read_channel_altera(device_in);

switch (state)

{

case IDLE:

packet_count = 0; // Reset packet count

state = Data.s0;

PacketSize = Data.s1; // Bytes in multiples of 16 (I.e. ulong4)

Offset = Data.s2; // Bytes in multiples of 16 (I.e. ulong4)

break;

case WRITE_TO_GLOBAL_MEM:

ddr_buffer[Offset+(packet_count>>4)] = Data;

if (packet_count != (PacketSize-16))

packet_count += 16;

else

{

state = IDLE;

command++;

}

break;

case READ_FROM_GLOBAL_MEM:

ulong4 output = ddr_buffer[Offset+(packet_count>>4)];

write_channel_altera(device_out,output);

if (packet_count != (PacketSize-16))

packet_count += 16;

else

{

state = IDLE;

command++;

}

break;

// Synchronisation routines.

case HOST_SYNC:

// Instructs kernel that host is ready for it start.

// I.e. data has been written to global memory.

write_channel_altera(HostToKernelRequestChannel,1);

state = IDLE;

break;

case KERNEL_SYNC:

// Command queue will pause until kernel sends acknowledgement

read_channel_altera(KernelToHostAcknowledgeChannel);

state = IDLE;

break;

default : break;

}

Page 69: FPGA Design Implementation – final release

68 D6.6| FPGA design implementation – final release

OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure

and Platform in Industrial and Societal Applications

}

}