farabet and lecun’s original talk: gti neuflow: a runtime...

Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 1

NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision

Paper by: Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello and Yann LeCun

Presentation by: Brendan Adkins and Tarun Khubchandani

Farabet and LeCun’s original talk: https://www.youtube.com/watch?v=KaJtT1K3GtI

Introduction and Background

Computer Vision

● Extract high level information from images

○ Form relationship between high-dimensional data to low dimensional space

● Object Recognition○ Dense feature extraction from

regularly spaced samples● GPUs increasing in prominence in CV

○ Inexpensive, easily available, easily programmable

○ Poor performance/power consumption compared to custom HW

Contribution

● Provide real-time detection, categorization and localization of pipelined megapixel images○ 10x less power consumption than

laptop computer (~10W)○ 100x speedup in application

● Similar work being carried out at NEC Labs, Stanford and Kaist

Architecture

Dataflow Grid

● 2D Grid of Processing Tiles (PT)○ Bank of Processing Operators ○ Routing MUX connecting local data

lines to global/neighbor tiles● Smart DMA

○ Asynchronous data transfers with priority to off-chip memory

● Global/Local Data Lines○ Global lines connect PT to SDMA○ Local Data lines connect neighbors

● Runtime Configuration Bus○ Reconfigure grid to specialize at

runtime

Runtime Reconfiguration● FPGA:

○ Versatile, simple processing elements (~104/package)○ ~ms reconfiguration time but ~hr synthesis time

● Multicore Processor:○ Simple usage (extensions to programming languages for parallelism)○ Far fewer processing elements (10-100)

● Proposed Architecture:○ Halfway between above options○ Applications specialise in vision

Optimized Processing Tiles● Specialized heavily on 2D

convolutions○ Top row PTs are MACs used as

convolvers, implemented in FPGA by hardwired MAC

○ Middle row is general purpose ops○ Bottom row is non-linear mapping

(normalization, linear activation, etc.). Done with look-up or linear decomposition

● Pipelined to have 1 result/cycle● Pixels stored Q8.8 and scaled to 32-bit

in operations

Architecture Constraints

● High throughput, but not necessary low latency

○ Operations replicated in both dimensions○ # similar computations > latency in pipelined

processor● Must be stallable

○ Allows any path to be configured, even if requiring more bandwidth than available

○ Achieved with FIFO buffer

Architecture Constraints

● Configuration time ≈ system latency○ Crucial to runtime reconfiguration, achieved

with configuration bus● Coarse grained processing elements

○ Maximize ratio between computing and routing

Smart DMA

● Custom engine to allow multiple async access

● Arbiter MUX/DEMUX access to memory with high bandwidth

● Ports can be configured to R/W specific chunks and communicate status to Control Unit

○ Dataflow: Operation driven fully by data

Compiler

Purpose● Extracting levels of parallelism

from graph descriptions of algorithms

● Graphs are given in the Torch5 environment

○ Matrix representation similar to MATLAB

● Known sequence of operators are matched to pre-optimized routines

Training a XOR gate in Torch 5, http://torch5.sourceforge.net/manual/newbieTutorial.html

Parallelization Methods● Across modules

○ Special cases○ Cascading convolutions and nonlinear mapping

● Across images○ Can use multiple PTs to convolve multiple inputs with a kernel at once○ NueFlow/LuaFlow’s strength and the most simple method

● Within an image

Application and Performance Comparison

Application: Street Scenes● Trained with LabelMe dataset of

Spanish cities.● 3 phases of training● Post training network mapped to

NueFlow using LuaFlow

Phase 1: CN1● 3 Convolutions● Small kernels (5x5)

○ Small receptive field

● Focus on minimizing cross entropy to promote rare categories

Phase 2: CN2● 3 Convolutions● Kernels increased to 9x9 size

○ Receptive field 2% of image

Phase 3: CN3● 4 Convolutions● Kernels kept at 9x9 size

○ Receptive field 5% of image

Performance Comparison● V6

○ Competitive GOP rate○ Strength in power efficiency,

indicates potential use case in systems in which the speed of an mGPU would suffice, but are power-constrained.

● IBM○ Vast projected improvement in GOP rate and efficiency○ Could fully eclipse the mGPU in speed and efficiency

Questions?

farabet and lecun’s original talk: gti neuflow: a runtime...

Documents

protocol for wireless microsensor energy-eﬃcient...

internet-of-things introduction to embedded systems...

wireless home automation networks: a survey of...

a survey of hard real-time scheduling for multiprocessor...

iesr interfaces: current services and future plans ann apps...

neuflow: a runtime reconﬁgurable dataﬂow processor for...

novel device technologies - ziyang.eecs.umich.edu

disampaikan oleh - iesr

introduction to embedded systems research: course goals,...

brown to green - iesr

introduction to the iesr

eie: efficient inference engine on compressed deep neural...

interactive distributed embedded systems embedded system...

power to the people: leveraging human physiological traits...

advanced digital logic design – eecs 303 homework...

ii11 zero - ziyang.eecs.umich.edu

berly martawardaya (feui) disampaikan pada diskusi iesr...

iesr metadata ann apps mimas, the university of manchester,...

multiscale thermal analysis for nanometer-scale...

iesr interfaces: current services and future plans