"designing and selecting instruction sets for vision," a presentation from cadence

Chris Rowen

12 May 2015

Designing and Selecting Instruction Sets

for Vision

• A top design automation supplier:

• analog, digital and system verification tools

• interface, processor and protocol verification IP

• Leading supplier of DSP and other data-rich embedded processing cores

and software—Xtensa® Innovation Platform℠

• IVP family: advanced imaging and vision DSP cores with almost 1000

library functions and applications

• Massively parallel SIMD/VLIW processors with automated

configurability and extensibility of ISA, memory, and interface

• One of Fortune Magazine’s “Top 100 Places to Work”

Cadence in Nutshell

• The Vision Performance Challenge

• The Vision Instruction Set Puzzle

• Application Diversity Drives ISA Flexibility

• The Hardwired Accelerator Problem

• Examples:

• Pedestrian Detection

• Lane Departure Warning

• Convolutional Neural Network

• Wrap-up

Outline

ADAS Processing Requirements are high

VGA: approaching 100 GOPs

The Vision Performance Challenge

Source: SoC for car navigation systems with a 53.3 GOPS image recognition engine, Hot Chips 21 (2009)

• Complexity grows an order of

magnitude for full HD processing

• Accelerating algorithmic sophistication

• Scaling best addressed by

• more parallelism

• application specific optimizations

• architectural enhancements

• move to advanced process nodes

• A good architecture

• accelerates core functions

• supports a wide range of

application specific optimizations

The Vision Performance Challenge

1080p60 ADAS is a teraOp problem

QVGA VGA HD Full HD

Computation increase with resolution

(brute force approach)

Key dimensions:

• Local memory bandwidth

• Memory hierarchy for

data streaming

• Data types

• SIMD/vector organization

• Scalar operation

bandwidth

• Instruction issue

parallelism (VLIW)

• Vision-specific operations

• Multi-processor support

The Vision Instruction Set Puzzle

What to look for:

1.High local memory bandwidth

2.Effective latency hiding for DDR access

3.Data-types: 8b,16b, 32b fixed-point,

floating point

4.Sustained ops/cycle from combination

of VLIW and SIMD

5.Vision -specific operations: 2D data

access, histogram, convolution, search,

non-linear functions

6.Automatic compiler inference of

vectors, complex operations

7.Scale-up with custom operations

8.Scale-up with parallel cores

• Real design is full of trade-

• Memory reference vs. ALU

• Multiplies vs. other ALU ops

• Mix of scalar vs. vector ops

• Vector computation vs. data

reorganization

• Measured a set of 45 major

kernels and applications in

vision and imaging

• Look at key ratios to assess

trends

Application Diversity Drives ISA Flexibility

Functions include:

• Face detection

• Fast9

• SURF

• Oriented FAST and Rotated BRIEF feature

detector (ORB)

• Harris Corners

• H.265 Motion Compensation

• Haar Cascade and Classifiers

• Optical Flow

• Affine transform

• Perspective Warp

• Various Filters—bilateral, denoising

• High Dynamic Range

• Color Space and format conversions

• Histogram equalization

• Typically several ALU ops per

load operation

• Wide range of ALU : Load/store

ratio (1:2 to 5:1)

• Many important functions don’t

do multiplies

• A fraction have very heavy

multiply usage—e.g.

convolutions

• ISA should handle wide range of

ratios efficiently

0 0.5 1 1.5 2 2.5Ve

Vector ALU Ops per Instruction

Memory ops vs ALU ops

0 0.25 0.5 0.75 1Oth

Multiply Vector Ops per Instruction

Multiply ops vs Other ALU ops

• A successful architecture

maximizes the fraction of

kernels that can be vectorized

• A small number of functions may

still use scalar ops heavily

• On-the-fly data reorganization

may be important in a few

kernels

• ALU : Reorg ratio varies from

10:1 to 1:1

• Efficient data reorganization

boosts benefit of vectorization

0 0.5 1 1.5 2 2.5 3 3.5

Scalar Ops per Instruction

Scalar ops vs Vector ops

0 0.5 1 1.5 2 2.5

ALU Vector Ops per Instruction

ALU ops vs Data Reorganization ops

• Certain tasks beg for immense performance—

hardwired functions are tempting

• Issue 1: Changes in algorithms hard to anticipate.

Hardwired functions often under-used on deployed

systems

• Issue 2: Hardwired functions difficult to control from

software—operation start/stop, memory

management, context switching

• Techniques to improve hardwired functions:

• Flexible chaining of interface to hardwired blocks

• more reusable primitives

• Direct incorporation into processor ISA

• Instruction-mapped instead of memory-mapped

The Hardwired Accelerator Problem

Processor

Accelerator

Pedestrian Detection Application Example

Key Functions % of

Processing

Pyramid generation 10%

Gradient magnitude

and orientation

calculation 25%

Histogram of

Gradients

calculation 25%

Histogram

normalization 5%

SVM Classifier 35%

• Fractional co-ordinate calculations (16b co-

ordinates)

• Pixel Interpolations (8b values)

• Finite differences or Sobel (8b pixels)

• Sum of squares (8/16b gradients)

• Squareroot (16/32b values)

• Divide (8/16b values)

• Arctan (8/16b values)

• Magnitude projection on bins (16b values)

• Weighted histograms (16b values)

• L1 (sum) or L2 (sum of squares) (16b values)

• Squareroot (32b values)

• Divide (16b values)

• Multiply accumulate (16b values)

A good architecture supports a wide variety of operations and precisions

• Camera system parameters (resolution,

field of view, focal length) determine

person height vs. distance

• Dynamically tradeoff detection latency

for far-away pedestrians based on

vehicle speed—higher resolution levels

may not need high frame rate !

Pedestrian Detection Application Example

Ref: “Pedestrian Detection: An Evaluation of the State of the Art”, IEEE Transactions on Analysis and Machine

Intelligence, Volume: 34 , Issue: 4

Detection resolution

Using Pinhole camera model:

ℎ𝑝𝑖𝑥𝑒𝑙𝑠 =𝐻

𝐷𝑓𝑝𝑖𝑥𝑒𝑙𝑠

𝐼𝑚𝑎𝑔𝑒𝐻𝑒𝑖𝑔ℎ𝑡

tan (𝑓𝑜𝑣

Lane Departure Warning Processing Functions

Camera

processing

Feature

extractio

processing Tracking

Road and

vehicle

• Color

conversion

• Noise removal

• Contrast

enhancement

• Steerable/

Gabor filters

• Image

segmentation

(Intensity,

color)

• Pyramid

generation

• Perspective

• Edge detection

(Sobel, Canny)

• Edge

magnitude and

orientation

• Edge

directional

response

• Thresholding

• Morphology

• Corner

detection

(Harris, Fast,

• Hough

transform

• Neural

network

• Template

matching

and updating

• Road model

fitting

• Outlier

removal

• Connected

components

Vehicle

(speed,

steering)

Constant

curvature,

Parabolic

• Kalman filter

High computations

A wide variety of functions are used and must be well supported

Cascade Classifier Application Example

• Parallelism drops quickly in

traditional SIMD implementation

after early stages of cascade

• Need architectural approach to

exploit available parallelism:

• Distributed detection

windows Distributed

features within a detection

window

• Switch type of parallelism as you

progress through the cascade:

• parallelize over pixels in

window

• parallelize over windows

• parallelize over features

• A good architecture supports

many types of parallelism

1 2 3 4 5 6 7 8 9 10111213141516171819202122

Conventional S

llelis

Cascade Stages

Parallelism in conventional SIMD processor

in different cascade stages

• Key computational kernels in CNN are

• Convolution (Highest cost)

• Subsampling (box filter, max pooling)

• Non-linear function (Tanh, Sigmoid)

• For practical implementations a range of tradeoffs are possible for convolutions

• Precision tradeoffs

• Separable kernels

• Symmetric kernels

Convolutional Neural Network (CNN) Example

A good architecture supports a range of options for fast

convolutions

Input Convolution Non-

linearity Sub-

sampling Repeat … Classifier

Result – face

identified

Convolutions model

locally receptive

visual cortex cells by

sampling a small

region and generating

features

Non-linearity

like tanh

function

models on-off

behavior of

neurons

Subsampling

models cells

with larger

receptive fields

(provide local

invariance)

Repeat

neural

network

layers

classifier

How to Choose a Vision Processor ISA:

• Measure on your real application—don’t just look at paper feeds-and-

speeds

• Expect massive parallelism

• Look for balance and versatility in available operations

• Consider not just raw ops rate, but also ability to handle complex data

organization and on-the-fly reorganization

• The compiler is part of the ISA—look at efficiency, robustness and

analysis tools

• Judge hardwired accelerators by reusability on possible future

applications

• Look for multi-processor support in hardware and software

Wrap-Up

• More readily-available imaging/vision source code, including OpenVX

graphs

• Open reference video streams for testing vision apps

• More substance and less hype around CNN and ADAS

• Standard input data sets

• Standard description of neural networks

• Reference trained parameters

Wish List

• Cadence Imaging/Vision Products:

http://ip.cadence.com/ipportfolio/tensilica-ip/image-video-

processing

• Some Cadence Vision Partners

• Morpho: http://www.morphoinc.com/en/

• Almalence: http://www.almalence.com/

• Irida Labs: http://www.iridalabs.gr/

• Ittiam: http://www.ittiam.com/

• Dream Chip: http://www.dreamchip.de/

• OpenVX: https://www.khronos.org/openvx/

Resources

Cadence, Xtensa and Tensilica are registered trademarks of Cadence Design Systems, Inc. All

other trademarks and logos are the property of their respective holders.

"designing and selecting instruction sets for vision," a presentation from cadence

cadence design systems

vision instruction

vision specific operations

vision dsp cores

real design

design automation supplier

data reorganization

car navigation systems

Technology

cadence virtuoso

selecting highly optimal architectural feature sets with

alchemist xf understanding cadence - sam · pdf...

cadence flow

a conceptual framework for selecting environmental indicator...

sorting, sets, and selecting

cadence simulation

formation cadence

users manual - images-na.ssl-images-amazon.com5 thank you...

cadence ppt

cadence tutrial2

cadence tute

culture clash: agile cadence vs. business cadence

fast data: selecting the right streaming technologies for...

cadence encounter

a primer on selecting grain boundary sets for comparison...

running cadence once the cadence environment has...

cadence vol

summarizing sets of categorical sequences selecting...

2001 cadence