farabet and lecun’s original talk: gti neuflow: a runtime...
Post on 02-Aug-2020
3 Views
Preview:
TRANSCRIPT
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 1
NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision
Paper by: Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello and Yann LeCun
Presentation by: Brendan Adkins and Tarun Khubchandani
1
Farabet and LeCun’s original talk: https://www.youtube.com/watch?v=KaJtT1K3GtI
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 2
Introduction and Background
2
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 3
Computer Vision
3
● Extract high level information from images
○ Form relationship between high-dimensional data to low dimensional space
● Object Recognition○ Dense feature extraction from
regularly spaced samples● GPUs increasing in prominence in CV
○ Inexpensive, easily available, easily programmable
○ Poor performance/power consumption compared to custom HW
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 4
Contribution
● Provide real-time detection, categorization and localization of pipelined megapixel images○ 10x less power consumption than
laptop computer (~10W)○ 100x speedup in application
● Similar work being carried out at NEC Labs, Stanford and Kaist
4
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 5
Architecture
5
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 6
Dataflow Grid
6
● 2D Grid of Processing Tiles (PT)○ Bank of Processing Operators ○ Routing MUX connecting local data
lines to global/neighbor tiles● Smart DMA
○ Asynchronous data transfers with priority to off-chip memory
● Global/Local Data Lines○ Global lines connect PT to SDMA○ Local Data lines connect neighbors
● Runtime Configuration Bus○ Reconfigure grid to specialize at
runtime
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 7
Runtime Reconfiguration● FPGA:
○ Versatile, simple processing elements (~104/package)○ ~ms reconfiguration time but ~hr synthesis time
● Multicore Processor:○ Simple usage (extensions to programming languages for parallelism)○ Far fewer processing elements (10-100)
● Proposed Architecture:○ Halfway between above options○ Applications specialise in vision
7
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 8
Optimized Processing Tiles● Specialized heavily on 2D
convolutions○ Top row PTs are MACs used as
convolvers, implemented in FPGA by hardwired MAC
○ Middle row is general purpose ops○ Bottom row is non-linear mapping
(normalization, linear activation, etc.). Done with look-up or linear decomposition
● Pipelined to have 1 result/cycle● Pixels stored Q8.8 and scaled to 32-bit
in operations
8
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 9
Architecture Constraints
● High throughput, but not necessary low latency
○ Operations replicated in both dimensions○ # similar computations > latency in pipelined
processor● Must be stallable
○ Allows any path to be configured, even if requiring more bandwidth than available
○ Achieved with FIFO buffer
9
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 10
Architecture Constraints
● Configuration time ≈ system latency○ Crucial to runtime reconfiguration, achieved
with configuration bus● Coarse grained processing elements
○ Maximize ratio between computing and routing
10
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 11
Smart DMA
● Custom engine to allow multiple async access
● Arbiter MUX/DEMUX access to memory with high bandwidth
● Ports can be configured to R/W specific chunks and communicate status to Control Unit
○ Dataflow: Operation driven fully by data
11
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 12
Compiler
12
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 13
Purpose● Extracting levels of parallelism
from graph descriptions of algorithms
● Graphs are given in the Torch5 environment
○ Matrix representation similar to MATLAB
● Known sequence of operators are matched to pre-optimized routines
13
Training a XOR gate in Torch 5, http://torch5.sourceforge.net/manual/newbieTutorial.html
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 14
Parallelization Methods● Across modules
○ Special cases○ Cascading convolutions and nonlinear mapping
● Across images○ Can use multiple PTs to convolve multiple inputs with a kernel at once○ NueFlow/LuaFlow’s strength and the most simple method
● Within an image
14
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 15
Application and Performance Comparison
15
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 16
Application: Street Scenes● Trained with LabelMe dataset of
Spanish cities.● 3 phases of training● Post training network mapped to
NueFlow using LuaFlow
16
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 17
Phase 1: CN1● 3 Convolutions● Small kernels (5x5)
○ Small receptive field
● Focus on minimizing cross entropy to promote rare categories
17
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 18
Phase 2: CN2● 3 Convolutions● Kernels increased to 9x9 size
○ Receptive field 2% of image
18
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 19
Phase 3: CN3● 4 Convolutions● Kernels kept at 9x9 size
○ Receptive field 5% of image
19
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 20
Performance Comparison● V6
○ Competitive GOP rate○ Strength in power efficiency,
indicates potential use case in systems in which the speed of an mGPU would suffice, but are power-constrained.
20
● IBM○ Vast projected improvement in GOP rate and efficiency○ Could fully eclipse the mGPU in speed and efficiency
Presentation By: Brendan Adkins and Tarun Khubchandani | EECS 507 Winter 2020 | 21
Questions?
21
top related