field-programmable technology: today s and tomorrow s ... engineering, imperial college, ... 8 1 9...

Download Field-programmable technology: today s and tomorrow s ...  Engineering, Imperial College, ... 8 1 9 3 9 57 9 1 9 1 3 5 7 1 9 9 2 0 0 3 2 0 0 1 0 2 0 0 7 9 x x x x x ... Delta Gamma Accumulator Control 5

Post on 01-May-2018

215 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • Field-programmable technology: todays and tomorrows Wayne LUK

    Computer Engineering, Imperial College, London, UK - w.luk@imperial.ac.uk

    1

    1

    Field-Programmable Technology: Todays and Tomorrows

    Wayne LukImperial College London

    TWEPP-07Prague

    4 September 20072

    Outline: technology = devices + design1. overview: motivation and vision2. field-programmable devices: today

    - Xilinx Virtex-4, Virtex-5; Stretch S5 3. field-programmable design : today

    - enhance optimality and re-use 4. field-programmable devices: tomorrow

    - hybrid FPGA, die stacking 5. field-programmable design : tomorrow

    - guided synthesis, representation, upgradability6. summary Thanks to colleagues, students and collaborators from Imperial College, University of

    British Columbia, Chinese University of Hong Kong, University of Massachusetts Amherst, UK Engineering and Physical Sciences Research Council, Stretch, Xilinx

    3

    1. Motivation: good - Moores Law

    rising capacity speed

    falling power/MHz price

    Source: Xilinx4

    1

    Logi

    cT r

    ansi

    stor

    spe

    r Ch i

    p`(K

    )

    Pro d

    uct i v

    i tyTr

    ans.

    / Sta

    f f- M

    onth

    10

    100

    1,000

    10,000

    100,000

    1,000,000

    10,000,000

    10

    100

    1,000

    10,000

    100,000

    1,000,000

    10,000,000

    100,000,000Logic (Moores Law) Transistors/ChipTransistor/Staff Month

    58%/Yr. compoundComplexity growth rate

    21%/Yr. compoundProductivity growth rate

    1 98 1

    1 98 3

    1 98 5

    1 98 7

    1 98 9

    1 99 1

    1 99 3

    1 99 5

    1 99 7

    1 99 9

    2 00 3

    2 00 1

    2 00 5

    2 00 7

    2 00 9

    xx xx x x

    x

    challenge: reduce the design productivity gap

    Not so good: design productivity gap

    Source: SEMATECH

    5

    Our vision: unified synthesis + analysis

    C/Matlab/Premiere

    Compiler

    Machinecode

    Run-timeinterface

    Configurationinformation

    Fixed processor Custom processor

    Custom computing system

    /Progolapplicationdescription

    manual/automatic partition, compile, analysis, validate

    system-specific architecture

    system-specific programming interface

    Java/

    CPUs + DSPs FPGAs + sensors

    6

    500MHz FlexibleSoft Logic Architecture

    500MHz Programmable DSP Execution Units

    0.6-11.1GbpsSerial Transceivers

    450MHz PowerPC Processorswith

    Auxiliary Processor Unit

    1Gbps DifferentialI/O

    500MHz multi-portDistributed RAM

    500MHz DCM DigitalClock Management

    2a. Devices today: Virtex-4 FPGA

    fine-grain fabric + special function units challenge: use resources effectively

    Source: Xilinx

    47

  • 2

    7

    2b. Virtex-5 FPGA

    Source: Xilinx

    capacity rises

    logic speed rises

    I/O speed rises

    all good news?

    8

    Growing gap: amount of gates vs I/O

    Source: Xilinx

    I/O off-chip serial

    I/O on-chip scalable flexible easy to use

    interconnect heterogeneous customisable

    9

    2c. Software configurable engine: S5Wide Register File (WRF) 32 Wide Registers (WR) 128-bit WideLoad/Store Unit 128-bit Load/Store Auto Increment/Decrement Immediate, Indirect, Circular Variable-byte Load/Store Variable-bit Load/Store

    ISEF Instruction Specialization Fabric Compute Intensive Arbitrary Bit-width Operations 3 Inputs and 2 Outputs Pipelined, Bypassed, Interlocked Random Logic Support Internal State Registers

    ALUFPU

    32-BIT RF

    CO

    NTR

    OL 128-BIT WRF32-BIT RF

    ALUFPU

    S5 ENGINE

    ISEFInstruction-Set

    Extension Fabric

    DATA RAM32KB

    SRAM256KB

    D-CACHE32KB

    I-CACHE32KB

    MMU

    RISC Processor Tensilica Xtensa V 32 KB I & D Cache On-Chip Memory, MMU 24 Channels of DMA, FPU

    Source: Stretch

    10

    3. Design today: overview structural or register-transfer level (RTL)

    e.g. VHDL, Verilog low-level, little automation, small designs

    behavioural, system-level descriptions e.g. SystemC (public-domain: systemc.org) MARTES: +UML for real-time embedded systems

    general-purpose software languages e.g. C, Java; with hardware support: Handel-C high-level, large automation, large designs

    special-purpose descriptions e.g. System Generator (signal processing) high-level, domain-specific optimisations

    11

    3a. Enhance optimality and re-use

    design optimality: quality select algorithm and devices: meet requirements mapping: regular to systolic, rest to processor I/O: dictates on-chip parallelism, buffering schemes control speed/area/power: pipelining, layout plan partitioning: coarse vs fine grain logic and memory

    design re-use: productivity separate aspects specific to application/technology library of customisable components with trade-offs compose and customise to meet requirements uniform interface to memory and I/O: hide details pre-verified parts: ease system verification

    12

    Example: systolic summation tree n-input adder

    tree of (n-1)2-input adders

    each adder has k stages each stage

    has s-bit adder figure shows

    k = 3 s = 3

    high k, low s less cycle time more area less power

    48

  • 3

    13

    Finance application: value-at-risk

    sampling from multivariateGaussian distribution

    DSP units: matrix multiplication

    systolic tree: accumulate result

    RC2000 board Virtex-4 xc4vsx55 400MHz

    64DSPs

    48DSPs

    GaussianRNG

    Delta GammaAccumulator

    Control RC2000Interface

    IO Clock domain

    FIFO

    FIFO

    + + + +

    + +

    +

    33 times faster than 2.2GHz quad Opteron(including all IO overheads, PC-FPGA communications, and using AMD

    optimised BLAS for software)14

    CO

    NTR

    OL

    3b. Stretch program development profile code

    identify hotspots special instruction

    implement Cfunctions in single instructions

    bit-width optim. software compiler

    generate instructionschedule instruction

    multiple data (WR)perform operations

    in parallel efficient data

    movement intrinsic load store

    operations20+ DMA channels

    APPLICATIONC/C++

    COMPILEDMACHINE

    CODE

    Compiler

    InstructionDefinition

    NEW INSTRUCTIONS

    INSTRUCTIONGENERATION

    TAILOR ISEF TO APPLICATION

    AUTOMATIC

    Source: Stretch

    15

    500MHz FlexibleSoft Logic Architecture

    500MHz Programmable DSP Execution Units

    0.6-11.1GbpsSerial Transceivers

    450MHz PowerPC Processorswith

    Auxiliary Processor Unit

    1Gbps DifferentialI/O

    500MHz multi-portDistributed RAM

    500MHz DCM DigitalClock Management

    4. Devices tomorrow: more diversed

    Source: Xilinx

    Replaced by other functional units, e.g. floating-point units

    16

    4a. Hybrid FPGA: architecture

    most digital circuits datapath: regular, word-based logic control: irregular, bit-based logic

    hybrid FPGA customised coarse-grained block:

    domain-specific requirements fine-grained blocks:

    existing FPGA architecture good match to computing

    applications for given domain

    17

    Coarse-grained fabric library

    D=9, M=4, R=3, F=3, 2 add, 2 mul: best density over benchmarks18

    Evaluation

    6 benchmark circuits digital signal processing kernels: e.g. bfly (for FFT) linear algebra: e.g. matrix multiplication complete application: e.g. bgm (financial model)

    circuits: partitioned to control + datapath control: vendor tools to fine-grained units datapath: manually map to coarse-grained units

    comparison directly synthesized to Xilinx Virtex-II devices

    49

  • 4

    19

    Results

    2.4818.3Geometric Mean

    2.4316.724.343020710.001810bgm

    2.2515.121.9382389.74545ode

    2.6313.823.488898.90642mm3

    2.6130.423.68112909.06371fir4

    2.2514.522.78961410.11661dscg

    2.7224.324.57137339.02565bfly

    Delay(times)

    Area(times)

    Delay(ns)

    Area(slices)

    Delay(ns)

    Area(slices)

    XC2V3000-6Floating Point hybrid FPGA

    20

    4b. On-chip memory bandwidth

    storage hierarchy: registers, LUT RAM, block RAM processor cache: address lack of I/O bandwidth

    Source: Xilinx

    21

    High density die stacking

    Source: Xilinx 22

    Programmable circuit board

    Source: Xilinx

    die stacking: 3D interchip connections customisable system-in-package: productivity gap?

    23

    5a. Design tomorrow: guided synthesis guided transformation of design descriptions

    automate tedious and error-prone steps applicable to various levels of abstraction

    focus: two timing models strict timing model: cyclecycle--accurate accurate -- efficiencyefficiency flexible timing model: behaviouralbehavioural -- productivityproductivity

    combine cycle-accurate and behavioural models rapid development with high quality design maintainability and portability

    based on high-level language library developer: provide optimised building blocks application developer: customise building blocks

    24

    Timing models: strict vs flexible

    {delta = b*b - ((a*c) 0)

    num_sol = 2;else if (delta == 0)

    num_sol = 1;else

    num_sol = 0;}

    {delta = b*b - ((a*c) 0)

    num_sol = 2;else if (delta == 0)

    num_sol = 1;else

    num_sol = 0;}

    b a c

    * * 0

    num_sol

    == 02

    1

    0

    2

    MU

    X

    MU

    X

    executed executed at cycleat cycle 11

    executed executed at cycle 2at cycle 2

    cc

    cctrue

Recommended

View more >