"designing and selecting instruction sets for vision," a presentation from cadence
Post on 21-Aug-2015
183 Views
Preview:
TRANSCRIPT
Copyright © 2015 Cadence Design Systems 1
Chris Rowen
12 May 2015
Designing and Selecting Instruction Sets
for Vision
Copyright © 2015 Cadence Design Systems 2
• A top design automation supplier:
• analog, digital and system verification tools
• interface, processor and protocol verification IP
• Leading supplier of DSP and other data-rich embedded processing cores
and software—Xtensa® Innovation Platform℠
• IVP family: advanced imaging and vision DSP cores with almost 1000
library functions and applications
• Massively parallel SIMD/VLIW processors with automated
configurability and extensibility of ISA, memory, and interface
• One of Fortune Magazine’s “Top 100 Places to Work”
Cadence in Nutshell
Copyright © 2015 Cadence Design Systems 3
• The Vision Performance Challenge
• The Vision Instruction Set Puzzle
• Application Diversity Drives ISA Flexibility
• The Hardwired Accelerator Problem
• Examples:
• Pedestrian Detection
• Lane Departure Warning
• Convolutional Neural Network
• Wrap-up
Outline
Copyright © 2015 Cadence Design Systems 4
ADAS Processing Requirements are high
VGA: approaching 100 GOPs
The Vision Performance Challenge
Source: SoC for car navigation systems with a 53.3 GOPS image recognition engine, Hot Chips 21 (2009)
Copyright © 2015 Cadence Design Systems 5
• Complexity grows an order of
magnitude for full HD processing
• Accelerating algorithmic sophistication
• Scaling best addressed by
• more parallelism
• application specific optimizations
• architectural enhancements
• move to advanced process nodes
• A good architecture
• accelerates core functions
• supports a wide range of
application specific optimizations
The Vision Performance Challenge
1080p60 ADAS is a teraOp problem
0
5
10
15
20
25
30
QVGA VGA HD Full HD
Computation increase with resolution
(brute force approach)
Copyright © 2015 Cadence Design Systems 6
Key dimensions:
• Local memory bandwidth
• Memory hierarchy for
data streaming
• Data types
• SIMD/vector organization
• Scalar operation
bandwidth
• Instruction issue
parallelism (VLIW)
• Vision-specific operations
• Multi-processor support
The Vision Instruction Set Puzzle
What to look for:
1.High local memory bandwidth
2.Effective latency hiding for DDR access
3.Data-types: 8b,16b, 32b fixed-point,
floating point
4.Sustained ops/cycle from combination
of VLIW and SIMD
5.Vision -specific operations: 2D data
access, histogram, convolution, search,
non-linear functions
6.Automatic compiler inference of
vectors, complex operations
7.Scale-up with custom operations
8.Scale-up with parallel cores
Copyright © 2015 Cadence Design Systems 7
• Real design is full of trade-
offs:
• Memory reference vs. ALU
ops
• Multiplies vs. other ALU ops
• Mix of scalar vs. vector ops
• Vector computation vs. data
reorganization
• Measured a set of 45 major
kernels and applications in
vision and imaging
• Look at key ratios to assess
trends
Application Diversity Drives ISA Flexibility
Functions include:
• Face detection
• Fast9
• SURF
• Oriented FAST and Rotated BRIEF feature
detector (ORB)
• Harris Corners
• H.265 Motion Compensation
• Haar Cascade and Classifiers
• Optical Flow
• Affine transform
• Perspective Warp
• Various Filters—bilateral, denoising
• High Dynamic Range
• Color Space and format conversions
• Histogram equalization
Copyright © 2015 Cadence Design Systems 8
• Typically several ALU ops per
load operation
• Wide range of ALU : Load/store
ratio (1:2 to 5:1)
• Many important functions don’t
do multiplies
• A fraction have very heavy
multiply usage—e.g.
convolutions
• ISA should handle wide range of
ratios efficiently
Application Diversity Drives ISA Flexibility
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.5 1 1.5 2 2.5Ve
cto
r L
oa
d/S
tore
Op
s
pe
r In
str
uctio
n
Vector ALU Ops per Instruction
Memory ops vs ALU ops
0
0.5
1
1.5
2
2.5
0 0.25 0.5 0.75 1Oth
er A
LU
V
ecto
r O
ps
pe
r In
str
uctio
n
Multiply Vector Ops per Instruction
Multiply ops vs Other ALU ops
Copyright © 2015 Cadence Design Systems 9
• A successful architecture
maximizes the fraction of
kernels that can be vectorized
• A small number of functions may
still use scalar ops heavily
• On-the-fly data reorganization
may be important in a few
kernels
• ALU : Reorg ratio varies from
10:1 to 1:1
• Efficient data reorganization
boosts benefit of vectorization
Application Diversity Drives ISA Flexibility
0
0.5
1
1.5
2
2.5
3
3.5
4
0 0.5 1 1.5 2 2.5 3 3.5
Vecto
r O
ps p
er
Instr
uctio
n
Scalar Ops per Instruction
Scalar ops vs Vector ops
0
0.25
0.5
0.75
1
0 0.5 1 1.5 2 2.5
Da
ta R
eo
rg V
ecto
r O
ps
pe
r In
str
uctio
n
ALU Vector Ops per Instruction
ALU ops vs Data Reorganization ops
Copyright © 2015 Cadence Design Systems 10
• Certain tasks beg for immense performance—
hardwired functions are tempting
• Issue 1: Changes in algorithms hard to anticipate.
Hardwired functions often under-used on deployed
systems
• Issue 2: Hardwired functions difficult to control from
software—operation start/stop, memory
management, context switching
• Techniques to improve hardwired functions:
• Flexible chaining of interface to hardwired blocks
• more reusable primitives
• Direct incorporation into processor ISA
• Instruction-mapped instead of memory-mapped
The Hardwired Accelerator Problem
Processor
Accelerator
Copyright © 2015 Cadence Design Systems 11
Pedestrian Detection Application Example
Key Functions % of
Processing
Pyramid generation 10%
Gradient magnitude
and orientation
calculation 25%
Histogram of
Gradients
calculation 25%
Histogram
normalization 5%
SVM Classifier 35%
• Fractional co-ordinate calculations (16b co-
ordinates)
• Pixel Interpolations (8b values)
• Finite differences or Sobel (8b pixels)
• Sum of squares (8/16b gradients)
• Squareroot (16/32b values)
• Divide (8/16b values)
• Arctan (8/16b values)
• Magnitude projection on bins (16b values)
• Weighted histograms (16b values)
• L1 (sum) or L2 (sum of squares) (16b values)
• Squareroot (32b values)
• Divide (16b values)
• Multiply accumulate (16b values)
A good architecture supports a wide variety of operations and precisions
Copyright © 2015 Cadence Design Systems 12
• Camera system parameters (resolution,
field of view, focal length) determine
person height vs. distance
• Dynamically tradeoff detection latency
for far-away pedestrians based on
vehicle speed—higher resolution levels
may not need high frame rate !
Pedestrian Detection Application Example
Ref: “Pedestrian Detection: An Evaluation of the State of the Art”, IEEE Transactions on Analysis and Machine
Intelligence, Volume: 34 , Issue: 4
h
f D H
fov
Detection resolution
Using Pinhole camera model:
ℎ𝑝𝑖𝑥𝑒𝑙𝑠 =𝐻
𝐷𝑓𝑝𝑖𝑥𝑒𝑙𝑠
=𝐻
𝐷
𝐼𝑚𝑎𝑔𝑒𝐻𝑒𝑖𝑔ℎ𝑡
tan (𝑓𝑜𝑣
2)
Copyright © 2015 Cadence Design Systems 13
Lane Departure Warning Processing Functions
Camera
Input
Pre
processing
Feature
extractio
n
Post
processing Tracking
Road and
vehicle
model
• Color
conversion
• Noise removal
• Contrast
enhancement
• Steerable/
Gabor filters
• Image
segmentation
(Intensity,
color)
• Pyramid
generation
• Perspective
warp
• Edge detection
(Sobel, Canny)
• Edge
magnitude and
orientation
• Edge
directional
response
• Thresholding
• Morphology
• Corner
detection
(Harris, Fast,
..)
• Hough
transform
• Neural
network
• Template
matching
and updating
• Road model
fitting
• Outlier
removal
• Connected
components
Vehicle
data
(speed,
steering)
Constant
curvature,
Parabolic
• Kalman filter
High computations
A wide variety of functions are used and must be well supported
Copyright © 2015 Cadence Design Systems 14
Cascade Classifier Application Example
• Parallelism drops quickly in
traditional SIMD implementation
after early stages of cascade
• Need architectural approach to
exploit available parallelism:
• Distributed detection
windows Distributed
features within a detection
window
• Switch type of parallelism as you
progress through the cascade:
• parallelize over pixels in
window
• parallelize over windows
• parallelize over features
• A good architecture supports
many types of parallelism
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10111213141516171819202122
Conventional S
IMD
para
llelis
m
Cascade Stages
Parallelism in conventional SIMD processor
in different cascade stages
Copyright © 2015 Cadence Design Systems 15
• Key computational kernels in CNN are
• Convolution (Highest cost)
• Subsampling (box filter, max pooling)
• Non-linear function (Tanh, Sigmoid)
• For practical implementations a range of tradeoffs are possible for convolutions
• Precision tradeoffs
• Separable kernels
• Symmetric kernels
Convolutional Neural Network (CNN) Example
A good architecture supports a range of options for fast
convolutions
Input Convolution Non-
linearity Sub-
sampling Repeat … Classifier
Result – face
identified
Convolutions model
locally receptive
visual cortex cells by
sampling a small
region and generating
features
Non-linearity
like tanh
function
models on-off
behavior of
neurons
Subsampling
models cells
with larger
receptive fields
(provide local
invariance)
Repeat
previous
steps for
neural
network
layers
Final
classifier
stage
Copyright © 2015 Cadence Design Systems 16
How to Choose a Vision Processor ISA:
• Measure on your real application—don’t just look at paper feeds-and-
speeds
• Expect massive parallelism
• Look for balance and versatility in available operations
• Consider not just raw ops rate, but also ability to handle complex data
organization and on-the-fly reorganization
• The compiler is part of the ISA—look at efficiency, robustness and
analysis tools
• Judge hardwired accelerators by reusability on possible future
applications
• Look for multi-processor support in hardware and software
Wrap-Up
Copyright © 2015 Cadence Design Systems 17
• More readily-available imaging/vision source code, including OpenVX
graphs
• Open reference video streams for testing vision apps
• More substance and less hype around CNN and ADAS
• Standard input data sets
• Standard description of neural networks
• Reference trained parameters
Wish List
Copyright © 2015 Cadence Design Systems 18
• Cadence Imaging/Vision Products:
http://ip.cadence.com/ipportfolio/tensilica-ip/image-video-
processing
• Some Cadence Vision Partners
• Morpho: http://www.morphoinc.com/en/
• Almalence: http://www.almalence.com/
• Irida Labs: http://www.iridalabs.gr/
• Ittiam: http://www.ittiam.com/
• Dream Chip: http://www.dreamchip.de/
• OpenVX: https://www.khronos.org/openvx/
Resources
Cadence, Xtensa and Tensilica are registered trademarks of Cadence Design Systems, Inc. All
other trademarks and logos are the property of their respective holders.
top related