lisard: labview-integrated softcore · pdf fileconclusion - the softcore architecture for...

1
Interface to measured value logging Interface to workstation Interface to sensor system Interface to actuators Integrated sequence control system Preprocessing of measured data Control Sensor data compaction Internal communication & synchronisation Ilmenau University of Technology Faculty of Computer Science and Automation Institute for Computer Engineering Computer Architecture and Embedded Systems Group www.tu-ilmenau.de/ra Dr.-Ing. Alexander Pacholik, Dipl.-Inf. Johannes Klöckner, Dipl.-Inf. Marcus Müller, MSc. Dipl.-Ing. Irina Gushchina, Prof. Dr.-Ing. habil. Wolfgang Fengler {alexander.pacholik, johannes.kloeckner, marcus.mueller, irina.guschtschina, wolfgang.fengler}@tu-ilmenau.de Ilmenau University of Technology, Germany www.tu-ilmenau.de LISARD: LABVIEW-INTEGRATED SOFTCORE ARCHITECTURE FOR RECONFIGURABLE DEVICES Abstract - The development of industrial control and measurement systems is often based on modular commercial off-the-shelf hardware. Lately, for these platforms reconfigurable I/O modules with field-programmable gate-arrays (FPGA) have gained significance, since they allow the implementation of data processing functionality very close to the data acquisition interfaces. However, algorithm complexity and floating-point support are limited by FPGA resources and design methods. This contribution presents an application-specifically configurable DSP softcore architecture built around a scalable double-precision floating-point arithmetic/logic unit. The core can be seamlessly utilized as a functional component in LabVIEW-based FPGA designs. A small study shows the performance of an example application implemented on the presented core in comparison to other embedded architectures. Motivation: The Collaborative Research Centre 622 „Nano-positioning and Nano-measuring Machines“ In the context of this research and development project high- performance data acquisition, processing and control algorithms have to be optimally implemented to satisfy challenging requirements for process precision under strict real-time conditions. I.e. the control system of a nanometer scale positioning and measuring machine incorporates multiple connected recursive filters, with a target loop frequency of 100 kHz, resulting in a processing time of 10 µs. To achieve the desired process quality double-precision floating-point computations are required. PXI embedded controller with LabVIEW RT PXI FPGA card Data acquisition and output DAC ADC PXI backplane DIO soft-CPU Supervisory application Floating- point algorithm LiSARD Architecture: a floating-point microprocessor architecture, integrated in LabVIEW for implementation on PXI FPGA modules Conceptual Consideration In the following the application developer, who utilizes the softcore component within a control application, has to be distinguished from the function developer, who tailors the application specific softcore component. The application developer will be provided with an HDL component, that computes a fixed complex floating-point arithmetic function. The function developer has to create an application-specific component from a flexible base architecture by designing the algorithmic function and customizing the core structure to fit the algorithmic requirements. Program Memory Interface Input Registers Output Registers Core Pipeline Program Memory Data Memory Data Memory Interface Sync. Sync. ± * ё = Softcore in a functional component for on-chip design The interface of the LiSARD component embedded in LabVIEW Processor Pipeline Architecture LiSARD Core Pipeline The core pipeline is the heart of the LiSARD component. The main task of the core pipeline is to manage pipelined arithmetic operators with different pipeline delays by scheduling input and output operands in the correct way. Due to the advantages of reduced hardware effort and easy usability of parallelism, the core architecture is based on the VLIW (the very long instruction word) approach. The main component of the architecture is the arithmetic logic unit (ALU) containing the required floating-point operators with two read operands. The overall processing of the ALU is managed by the instruction words stored in the program memory. These words are divided into two parts - execution (contains the addresses of the source operands in the data memory and operation mode) and write-back (contains the address of the destination operand and its source in the ALU). In the first stage an instruction word is fetched from program memory and decoded in the second stage. The third stage fetches the source operands from the data memory. The execution stage provides the source operands and instruction code to the ALU. In the last stage the calculated result is collected from the respective operator and written back to the data memory. one write and Microcode Development Assembler code allows a low level design entry to quickly implement small algorithms. This ASM can easily be transformed into the VLIW binary code by instruction scheduling. For high level design entry data-flow graphs are considered. In contrast to control- flow oriented programs in languages like C and Assembler, that require a lot of effort to optimize ILP exploitation, data-flow programs, i.e. LabVIEW Vis, represent so called superblocks of instructions, that inherently express global data and instruction dependencies. Thus Data-flow optimization allows to determine an globally optimal order of instructions regarding runtime and resource utilization for a certain algorithm. Dataflow Graph Assembler Source Dataflow Optimization Instruction Scheduling Binary Code Design Entry Design Entry Transformation The synchronization implements a two-wire handshake for input and output and triggers the computation. Software development tool chain Kalman Filter execution performance on different target architectures 1. The execution performance contains the execution time per Kalman filter iteration and overall data path latency including data transfer (Contoller and LiSARD implementations). 2. Runtime on the PXI controller has been determined for a native LabVIEW (generic and minimized) implementation. The jitter results from task scheduling and compulsory data cache misses in the LabVIEW RT runtime environment. Comparison with the pure algorithm written in data-flow optimized C-code (DLL in LabVIEW) shows the significant overhead of the RT execution environment. LiSARD configuration Slices FF LUT DSP Standard (generic and minimized implementation) 18672 36 % 14508 28 % 13180 26 % 13 27 % Minimal (data-flow optimized implementation) 10818 21 % 8218 16 % 8043 26 % 13 27 % LiSARD softcore resource utilization Scalability The micro-architecture allows the extension to multiple homogeneous or heterogeneous ALUs. The table lists performance increase for the Kalman filter example running on multiple homogeneous ALUs of the configuration described above. # ALUs 1 2 3 4 Ticks 215 143 118 116 Runtime 1.8 µs 1.2 µs 1.0 µs 1.0 µs Speedup 1.0 1.5 1.8 1.9 Efficiency 1 0.75 0.60 0.46 A used development platform in the measurement and control domain is the PXI hardware and the graphical framework LabVIEW. In a typical application, the programmable hardware serves as connection to the specific protocols of sensors/actuators and provides an adequate timing of data acquisition and actuation, while the controller CPU is used for computing. Incidentally, the communication latency from sensors to the controller CPU and back to actuators over the PXI backplane is approximately 30 µs. A relocation of floating-point algorithm implementation to the FPGA is proposed to reduce the transmission delays and thus decreasing the closed loop period of control applications. Due to the limited support for floating- point hardware synthesis, efficient implementation of floating-point control algorithms into the FPGA using hardware description languages (HDL) is time consuming and expensive, compared to CPU or DSP programming. PXI System Setup with FPGA Module Ш The arithmetic operators are taken from the FPLibrary or from the Xilinx ISE 10.x core generator. Ш The LiSARD core features separate memories for program and data. Ш The memories can be accessed from outside the core during system runtime. This enables debugging and allows the reprogramming of the core without recompiling the FPGA design. The Kalman filter mainly consists of matrix operations on relatively small matrices with double precision floating-point operands: Ш the generic version requires one division, 280 multiplications and 220 additions Ш the structurally minimized requires one division, 80 multiplications and 85 additions The implementation of the Kalman filter on the LiSARD core is compared to equivalent implementations targeting an Intel Core Duo T9400 in the PXI Realtime Controller NI PXI-8108 and the direct hardware implementation for the Xilinx Virtex 5 LX 85 on the NI 7853R PXI FPGA module. Standart: 10 32 Inputs, 16 Outputs, data memory depth of 2 , providing 38 bit instruction word length and program memory depth of 2048 entries. 8 Minimal: 4 Inputs, 4 Outputs, data memory depth of 2 and program memory depth of 256 entries. Resource utilization of LabVIEW FPGA implementations FPGA implementation Slices FF LUT DSP FPGA generic 42008 81 % 37472 72 % 38372 75 % 40 83 % FPGA optimized area (120 MHz) 23856 46 % 26022 50% 30618 60 % 34 71 % FPGA optimized speed (80MHz) 41490 80 % 34870 67% 30618 60 % 40 83 % System overview Case Study and Performance Evalution: The implementation of a Kalman filter Conclusion - The softcore architecture for utilization as a function component in LabVIEW is highly configurable in terms of VLIW- organisation, memory organization, operator selection and I/O-connectivity. The architecture is designed to process complex floating-point algorithms for real-time control applications on distributed FPGA platforms. The LiSARD core is not supposed to substitute general purpose softcore CPUs, but to be used as a DSP core in a data flow design, or supplementing a CPU in an SoC. Program Memory Read Interface Instruction Fetch Instruction Decode Operand Fetch Execute Write Back Instruction Decoder Data Memory Read Interface Source2 Source1 ExecuteMode WriteBackOperation Target ALU ALU MUX MUX Data Memory Write Interface Input Registers Output Registers

Upload: doque

Post on 18-Feb-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Interface to measured value

logging

Interface to workstation

Interface to sensor system

Interface toactuators

Integrated sequence control

system

Preprocessing of measured data

Control

Sensor data compaction

Inte

rna

l c

om

mu

nic

ati

on

&s

yn

ch

ron

isa

tio

n

M

a

n

a

g

e

m

e

n

t

d

e

r

S

t

ör

g

r

ö

ß

e

n

D

y

n

a

m

i

s

c

h

e

s

M

a

n

a

g

e

m

e

n

t

M

e

c

h

a

t

r

o

n

i

s

c

h

e

s

S

y

s

t

e

m

d

e

r

N

P

M

M

-2

0

0

B

a

h

n

p

l

a

n

u

n

g

Z

u

s

t

a

n

d

s

r

e

g

l

e

r

S

o

l

l

b

a

h

n

V

o

r

s

t

e

u

e

r

u

n

g

F

o

l

g

e

r

e

g

l

e

r

w e

w

,

w

,

w

,

….

.

.

M

a

n

a

g

e

m

e

n

t

d

e

r

F

üh

r

u

n

g

s

g

r

ö

ß

e

n

K

i

n

.

R

a

n

d

-b

e

d

i

n

g

u

n

g

e

n

Z

u

s

t

a

n

d

s

r

e

k

o

n

s

t

r

u

k

t

i

o

n

S

t

ör

b

e

o

b

a

c

h

t

e

r

uF

uCx

z

yu

M

a

n

a

g

e

m

e

n

t

d

e

r

S

t

ör

g

r

ö

ß

e

n

M

a

n

a

g

e

m

e

n

t

d

e

r

S

t

ör

g

r

ö

ß

e

n

D

y

n

a

m

i

s

c

h

e

s

M

a

n

a

g

e

m

e

n

t

M

e

c

h

a

t

r

o

n

i

s

c

h

e

s

S

y

s

t

e

m

d

e

r

N

P

M

M

-2

0

0

B

a

h

n

p

l

a

n

u

n

g

Z

u

s

t

a

n

d

s

r

e

g

l

e

r

S

o

l

l

b

a

h

n

V

o

r

s

t

e

u

e

r

u

n

g

F

o

l

g

e

r

e

g

l

e

r

w e

w

,

w

,

w

,

….

.

.

M

a

n

a

g

e

m

e

n

t

d

e

r

F

üh

r

u

n

g

s

g

r

ö

ß

e

n

K

i

n

.

R

a

n

d

-b

e

d

i

n

g

u

n

g

e

n

Z

u

s

t

a

n

d

s

r

e

k

o

n

s

t

r

u

k

t

i

o

n

S

t

ör

b

e

o

b

a

c

h

t

e

r

uFuF

uCuCxx

zz

yyuu

Ilmenau University of TechnologyFaculty of Computer Science and AutomationInstitute for Computer Engineering Computer Architecture and Embedded Systems Groupwww.tu-ilmenau.de/ra

Dr.-Ing. Alexander Pacholik, Dipl.-Inf. Johannes Klöckner, Dipl.-Inf. Marcus Müller, MSc. Dipl.-Ing. Irina Gushchina,

Prof. Dr.-Ing. habil. Wolfgang Fengler

{alexander.pacholik, johannes.kloeckner, marcus.mueller, irina.guschtschina, wolfgang.fengler}@tu-ilmenau.de

I lmenau Universi ty of Technology, Germany

www.tu-ilmenau.de

LISARD: LABVIEW-INTEGRATED SOFTCORE ARCHITECTURE

FOR RECONFIGURABLE DEVICES

LISARD: LABVIEW-INTEGRATED SOFTCORE ARCHITECTURE

FOR RECONFIGURABLE DEVICES

Abstract - The development of industrial control and measurement systems is often based on modular commercial off-the-shelf hardware. Lately, for these platforms reconfigurable I/O modules with field-programmable gate-arrays (FPGA) have gained significance, since they allow the implementation of data processing functionality very close to the data acquisition interfaces. However, algorithm complexity and floating-point support are limited by FPGA resources and design methods. This contribution presents an application-specifically configurable DSP softcore architecture built around a scalable double-precision floating-point arithmetic/logic unit. The core can be seamlessly utilized as a functional component in LabVIEW-based FPGA designs. A small study shows the performance of an example application implemented on the presented core in comparison to other embedded architectures.

Motivation: The Collaborative Research Centre 622 „Nano-positioning and Nano-measuring Machines“

In the context of this research and development project high-

performance data acquisition, processing and control algorithms

have to be optimally implemented to satisfy challenging

requirements for process precision under strict real-time

conditions. I.e. the control system of a nanometer scale positioning

and measuring machine incorporates multiple connected

recursive filters, with a target loop frequency of 100 kHz, resulting

in a processing time of 10 µs. To achieve the desired process

quality double-precision floating-point computations are required.

PXIembedded controllerwith LabVIEW RT

PXI FPGAcard

Data acquisitionand output

DACADC

PXI backplane

DIO

soft-CPU

Supervisory application Floating-

point algorithm

LiSARD Architecture: a floating-point microprocessor architecture, integrated in LabVIEW for implementation on PXI FPGA modules

Conceptual Consideration

In the following the application developer, who utilizes the softcore component within a

control application, has to be distinguished from the function developer, who tailors the

application specific softcore component. The application developer will be provided with an

HDL component, that computes a fixed complex floating-point arithmetic function. The

function developer has to create an application-specific component from a flexible base

architecture by designing the algorithmic function and customizing the core structure to fit

the algorithmic requirements.

ProgramMemoryInterface

ProgramMemoryInterface

InputRegisters

InputRegisters

OutputRegisters

OutputRegisters

CorePipeline

CorePipeline

ProgramMemory

ProgramMemory

DataMemory

DataMemory

DataMemoryInterface

DataMemoryInterface

Sync. Sync.± *

ё

=

Softcore in a functional component for on-chip design

The interface of the LiSARD component embedded in LabVIEW Processor Pipeline Architecture

LiSARD Core Pipeline

The core pipeline is the heart of the LiSARD component. The main task of the core

pipeline is to manage pipelined arithmetic operators with different pipeline delays by

scheduling input and output operands in the correct way.

Due to the advantages of reduced hardware effort and easy usability of parallelism, the

core architecture is based on the VLIW (the very long instruction word) approach.

The main component of the architecture is the arithmetic logic unit (ALU) containing the

required floating-point operators with two read operands. The overall

processing of the ALU is managed by the instruction words stored in the program memory.

These words are divided into two parts - execution (contains the addresses of the source

operands in the data memory and operation mode) and write-back (contains the address

of the destination operand and its source in the ALU).

In the first stage an instruction word is fetched from program memory and decoded in the

second stage. The third stage fetches the source operands from the data memory. The

execution stage provides the source operands and instruction code to the ALU. In the last

stage the calculated result is collected from the respective operator and written back to the

data memory.

one write and

Microcode Development

Assembler code allows a low level design entry to quickly implement small

algorithms. This ASM can easily be transformed into the VLIW binary code by instruction

scheduling.

For high level design entry data-flow graphs are considered. In contrast to control-

flow oriented programs in languages like C and Assembler, that require a lot of effort to

optimize ILP exploitation, data-flow programs, i.e. LabVIEW Vis, represent so called

superblocks of instructions, that inherently express global data and instruction

dependencies. Thus Data-flow optimization allows to determine an globally optimal

order of instructions regarding runtime and resource utilization for a certain algorithm.

Dataflow Graph

AssemblerSource

Dataflow Optimization

InstructionScheduling

BinaryCode

Design Entry Design Entry

Transformation

The synchronization implements a

two-wire handshake for input and

output and triggers the computation.

Software development tool chain

Kalman Filter execution performance on different target architectures

1. The execution performance contains the execution time per Kalman filter iteration and

overall data path latency including data transfer (Contoller and LiSARD implementations).

2. Runtime on the PXI controller has been determined for a native LabVIEW (generic and

minimized) implementation. The jitter results from task scheduling and compulsory data

cache misses in the LabVIEW RT runtime environment. Comparison with the pure algorithm

written in data-flow optimized C-code (DLL in LabVIEW) shows the significant overhead of

the RT execution environment.

LiSARD configuration Slices FF LUT DSP

Standard (generic and minimized implementation)

18672 36 %

14508 28 %

13180 26 %

13 27 %

Minimal (data-flow optimized implementation)

10818 21 %

8218 16 %

8043 26 %

13 27 %

LiSARD softcore resource utilization

Scalability

The micro-architecture allows the extension to multiple

homogeneous or heterogeneous ALUs. The table lists

performance increase for the Kalman filter example

running on multiple homogeneous ALUs of the

configuration described above.

# ALUs 1 2 3 4

Ticks 215 143 118 116

Runtime 1.8 µs 1.2 µs 1.0 µs 1.0 µs

Speedup 1.0 1.5 1.8 1.9

Efficiency 1 0.75 0.60 0.46

A used development platform in the measurement and control domain is the PXI hardware

and the graphical framework LabVIEW. In a typical application, the programmable hardware

serves as connection to the specific protocols of sensors/actuators and provides an adequate

timing of data acquisition and actuation, while the controller CPU is used for computing.

Incidentally, the communication latency from sensors to the controller CPU and back to

actuators over the PXI backplane is approximately 30 µs. A relocation of floating-point

algorithm implementation to the FPGA is proposed to reduce the transmission delays and thus

decreasing the closed loop period of control applications. Due to the limited support for floating-

point hardware synthesis, efficient implementation of floating-point control algorithms into the

FPGA using hardware description languages (HDL) is time consuming and expensive,

compared to CPU or DSP programming.

PXI System Setup with FPGA Module

Ш The arithmetic operators are

taken from the FPLibrary or from the

Xilinx ISE 10.x core generator.

Ш The LiSARD core features

separate memories for program and

data.

Ш The memories can be accessed

from outside the core during system

runtime. This enables debugging

and allows the reprogramming of

the core without recompiling the

FPGA design.

The Kalman filter mainly consists of matrix operations on relatively small matrices with double

precision floating-point operands:

Ш the generic version requires one division, 280 multiplications and 220 additions

Ш the structurally minimized requires one division, 80 multiplications and 85 additions

The implementation of the Kalman filter on the LiSARD core is compared to equivalent

implementations targeting an Intel Core Duo T9400 in the PXI Realtime Controller NI PXI-8108

and the direct hardware implementation for the Xilinx Virtex 5 LX 85 on the NI 7853R PXI FPGA

module.Standart: 10 32 Inputs, 16 Outputs, data memory depth of 2 , providing 38 bit instruction word length and

program memory depth of 2048 entries.

8 Minimal: 4 Inputs, 4 Outputs, data memory depth of 2 and program memory depth of 256 entries.

Resource utilization of LabVIEW FPGA implementations

FPGA implementation Slices FF LUT DSP

FPGA generic 42008 81 %

37472 72 %

38372 75 %

40 83 %

FPGA optimized area (120 MHz) 23856 46 %

26022 50%

30618 60 %

34 71 %

FPGA optimized speed (80MHz) 41490 80 %

34870 67%

30618 60 %

40 83 %

System overview

Case Study and Performance Evalution: The implementation of a Kalman filter

Conclusion - The softcore architecture for utilization as a function component in LabVIEW is highly configurable in terms of VLIW-organisation, memory organization, operator selection and I/O-connectivity. The architecture is designed to process complex floating-point algorithms for real-time control applications on distributed FPGA platforms. The LiSARD core is not supposed to substitute general purpose softcore CPUs, but to be used as a DSP core in a data flow design, or supplementing a CPU in an SoC.

ProgramMemory

ReadInterface

ProgramMemory

ReadInterface

InstructionFetch

InstructionDecode

OperandFetch

Execute Write Back

InstructionDecoder

InstructionDecoder

DataMemory

ReadInterface

DataMemory

ReadInterface

Source2Source1

ExecuteModeWriteBackOperationTarget

AL

UA

LU M

UX

MU

X

DataMemory

WriteInterface

DataMemory

WriteInterface

InputRegisters

InputRegisters

OutputRegisters

OutputRegisters