always on vision processing unit for mobile...

........................................................................................................................................................................................................................

ALWAYS-ON VISION PROCESSING UNITFOR MOBILE APPLICATIONS

........................................................................................................................................................................................................................

MYRIAD 2 IS A MULTICORE, ALWAYS-ON SYSTEM ON CHIP THAT SUPPORTS

COMPUTATIONAL IMAGING AND VISUAL AWARENESS FOR MOBILE, WEARABLE, AND

EMBEDDED APPLICATIONS. THE VISION PROCESSING UNIT INCORPORATES PARALLELISM,

INSTRUCTION SET ARCHITECTURE, AND MICROARCHITECTURAL FEATURES TO PROVIDE

HIGHLY SUSTAINABLE PERFORMANCE EFFICIENCY ACROSS A RANGE OF COMPUTATIONAL

IMAGING AND COMPUTER VISION APPLICATIONS, INCLUDING THOSE WITH LOW LATENCY

REQUIREMENTS ON THE ORDER OF MILLISECONDS.

......Computer vision is beginning totransition from the laboratory to the realworld in a host of applications that require anew approach to supporting the associatedpower, weight, and space constraints. Cur-rent computational platforms and program-ming paradigms derive from PC-based im-plementations of computer vision algorithmsusing conventional CPUs, GPUs, and high-bandwidth DRAMs using programmingenvironments like OpenCV and Matlab.Although it’s convenient for prototypingnew algorithms, this approach of building areference implementation on a PC has nodirect path to implementation in applica-tions such as mobile phones, tablets, wear-able devices, drones, and robots, where cost,power, and thermal dissipation are key con-cerns. Work has begun in earnest on transi-tioning such applications from the lab toembedded applications using conventionalmobile phone application processors. How-ever, such processors still present issues tothe application developer; specifically, thehard real-time requirements imposed by

managing low-latency computer vision arevery difficult to satisfy on even a capableplatform running an operating system suchas Android which has not been expresslydesigned for low latency. Additionally, satis-fying the computational requirements ofcomputer vision applications requires almostall of the computational resources on atypical application processor, including amulticore processor, digital signal processor(DSP), hardware accelerators, and GPU,leaving very little over to run virtual realityworkloads involving the GPU. Marshallingthese diverse resources is possible, but it putsa major strain on both the programmer andthe application processor fabric, meaningthat corners must be cut in terms of qualitywhile still straining to keep the power below10 W. Dissipating 10 W in a drone mightnot be a thermal issue, but it leaves lesspower available for flying, whereas in amobile phone 3 to 4 W can be dissipated onaverage before the phone becomes literallytoo hot to handle. Clearly, across all of thepossible applications a better solution is

Brendan Barry

Cormac Brick

Fergal Connor

David Donohoe

David Moloney

Richard Richmond

Martin O’Riordan

Vasile Toma

Movidius

............................................................

56 Published by the IEEE Computer Society 0272-1732/15/$31.00�c 2015 IEEE

required to evolve to truly capable computervision systems with intelligent local decisionmaking for autonomous robots and drones aswell as truly immersive virtual reality experi-ences. The vision processing unit (VPU) solu-tion outlined in our Hot Chips presentation1

and expanded upon here uses a combinationof low-power very long instruction word(VLIW) processors with vector and SIMDoperations capable of very high parallelism ina clock cycle, allied with hardware accelera-tion for key image processing and computervision kernels, backed by a very high band-width on-chip multicore memory subsystem.The Myriad2 VPU aims to provide an orderof magnitude higher performance efficiency,allowing high-performance computer visionsystems with very low latency to be builtwhile dissipating less than 1 W.

Always-on computer visionThe possibility of computationally fusing

images has existed in the minds of scientistslike Lippmann2 since the early 1900s, yet ithas had to wait for Moore’s law to catch upwith the enormous computational and powerefficiency requirements to be realized, espe-cially, in mobile devices. Computationalimaging and visual awareness applications areshifting the imaging paradigm in mobiledevices from taking pictures (and video) tomaking them, replacing complex optics withsimplified lens assemblies and multiple aper-tures; combining images captured by hetero-geneous sensors including RGB (red, green,blue), infrared (IR), and depth; and extract-ing scene metadata from still images andvideo streams. This revolution in image andvideo capture is enabling a broad range ofnew application areas and use cases that werepreviously limited to desktops and worksta-tions rather than battery-powered mobiledevices, including mobile handsets, tablets,wearables, and personal robotics. These devi-ces support a broad range of computer visionuse cases including multiaperture (array)cameras, high dynamic range photographyand video, multispectral imaging (for exam-ple, IR plus visible spectrum and depth sen-sors working in concert), gesture or facialrecognition and analytics, virtual reality gam-ing, visual search, simultaneous localizationand mapping for robotics, and many others.

Always-on computer vision has the poten-tial to greatly enhance user experiences; how-ever, these applications are in a constant stateof flux, precluding the use of completelyfixed-function hardware because the algo-rithms are subject to change. The currentcrop of heterogeneous platforms, such as themobile phone application processors shownin Figure 1, are often used to prototypealgorithms and appear limited to VGA 30frames-per-second (fps) resolution and 6 to8 W power dissipation (http://elinux.org/Jetson/Computer Vision Performance). Despite theimpressive computational resources availablein current application processors, implement-ing even relatively simple pipelines involves ahigh level of complexity in marshalling heter-ogeneous resources. A typical application pro-cessor contains multiple ARM processors withNEON SIMD extensions and other DSPsand GPUs, as shown in Figure 1, all of whichmust be coordinated and must allow collabo-rative access to data structures in memorythrough shared access or copied data. Indeed,the latency of GPU access can rule out itsusage for some applications, such as low-latency virtual reality gaming. In such archi-tectures, the power overhead of explicitlycopying blocks of data (memcopy) can be onthe order of 1 W, and the latency in movingframes to and from the GPU subsystem canbe on the order of 5 to 10 ms with targetframe rates that can be in excess of 60 fps.

A typical example of this new breed ofcomputational imaging devices and applica-tions is Google’s Project Tango (www.google.com/atap/projecttango), which uses twoMovidius Myriad 1 computational visionprocessors.3 Indeed, the next generationof computational vision requirements forarray cameras, complex vision pipelines,and image signal processing pipelines fornew sensor types such as RGB near-infra-red, RGBC (red, green, blue, clear), andsensor fusion for heterogeneous sensorsin devices such as Project Tango willdemand exceptional levels of processingbacked by high memory bandwidths forsustainable performance within a global1-W power envelope to enable always-onuser experiences (see Figure 2).

In this context, Moore’s law (Dennardplanar 2D scaling) is slowing.4 Long before

.................................................................

MARCH/APRIL 2015 57

we reach the purported 5-nm limit around2020, the already high wafer prices will ruleout custom silicon solutions for all but thehighest-volume applications; this means thatsolutions will have to be highly program-mable to be economically viable. Horowitzidentifies power efficiency as a major issue,5

and wirelessly transmitting data for remotecomputation can cost up to a million timesmore energy per operation compared toprocessing locally in a device. Coupled withthis, the power required for cloud-basedvideo applications scales up to 1,000 timesworse than the power consumed by local in-device computation,6 meaning that largeamounts of new infrastructure would have tobe built if computation is not pushed out tothe network edge and ideally right beside theimage sensor. Another reason for pushingprocessing and even limited decision makingto the network edge in applications such asvirtual or augmented reality, autonomousvehicles, and robots is that datacenters havemajor latency issues. Privacy is clearly a majorissue if video or still images are being

uploaded to the cloud for processing, and,again, processing at the network edge andexchanging metadata can help mitigate pri-vacy concerns.

The industry has an opportunity to dis-tribute computation and build systems thatcan think locally and act globally by prepro-cessing data within personal devices usinglow-power processors. In this model, data isprocessed as close as possible to the sensor, inthe same way as our human brains, eyes, andears are located in close proximity. Onlymetadata rather than video is streamed to thecloud, resolving the power-efficiency, latency,and security issues and requiring radically lessexpensive infrastructure. The key challenge,of course, is programming such systems in ahighly portable and performance- and power-efficient way.

Vision processor design challengesWhile clearly interesting from a research

and commercial point of view, computationalimaging presents a series of fundamental

MMU

GPU100-150Gflopsdata

parallel

MMU

Multimedia fabric

MMU

MMU

Other MMfunctions

MMU

Fabric and memory controller

CPU2/4/8core

MMU

System fabric

DSP

MMU

Otherfunctions

MMU

Offload to GPU Offload to DSP

Codec

Display

Camera

L2cache

Figure 1. The architecture of a typical application processor contains multiple subsystems of dedicated ISP hardware, digital

signal processors (DSPs), multicore processors, and a GPU subsystem that usually has its own private memory spaces

between which data must be moved under program control or direct memory access hardware.

..............................................................................................................................................................................................

HOT CHIPS

.................................................................

58 IEEE MICRO

processing challenges: extreme computationalcomplexity (hundreds to thousands of Gflops),a requirement for sustained performance, andhigh sensitivity to latency. Although these con-cerns are all valid, computational imaging hasnot been a reality in mobile devices untilrecently, primarily because of two power con-sumption issues: the heat dissipation potentialin small form factor mobile phones, and theenergy density of available battery technolo-gies. Current mobile processing architectures(such as CPUs and GPUs) can’t deal efficientlywith computational imaging and computervision workloads under these constraints, andthe future does not bode well as Moore’s law isslowing down, causing the power and per-formance benefits in the transition to the nextnode to decrease. In our opinion, the next dec-ade will mark the era of special-purpose pro-cessors focused on decreasing the energy peroperation in a new way.

The general design principles followed inthe first-generation Myriad 1 design andextended in Myriad 2 are based on Agarwal’s

observation that beyond a certain frequencylimit for any particular design and targetprocess technology, the designer starts to payquadratically in power for linear increases inoperating frequency. Thus, increased parallel-ism is preferred in terms of maximizingthroughput per milliwatt per megahertz. Theobjective in the Myriad 2 design in 28-nmprocess technology was to achieve 20 to 30�the processing per watt of that achieved in65 nm by the previous generation Myriad 1.3

A 5� performance increase compared toMyriad 1 was achieved by increasing thenumber of Streaming Hybrid ArchitectureVector Engine (SHAVE) vector processorsfrom 8 to 12 and increasing the clock ratefrom 180 to 600 MHz. The remaining 15 to25� performance increase was achieved byincluding 20 hardware accelerators inMyriad, two of which can output one fullycomputed output pixel per cycle. The hard-ware accelerator rationale fits with theMoore’s law trend below 28 nm, becausememory access is expensive in terms of

500

1,000

1,500

GOPS

ISP

Computationalphotography

Computervision

Application processortypical performance

within thermal and battery limits100Gbytes/s

200Gbytes/s

400Gbytes/s

Figure 2. The requirements of always-on computer vision applications extend to all parts of

the image and video-processing chain, from novel and heterogeneous sensors to array

cameras to dedicated image and video processing acceleration to vector and SMID

processors and high-level APIs that let programmers extract performance from the

underlying hardware and firmware.

.................................................................

MARCH/APRIL 2015 59

power, and memory is now scaling at 20 to30 percent (down from 50) versus 50 percentfor logic below the 28-nm node.7 Generally,this means that the chip architect must lookat trading arithmetic operations per memoryaccess for lower memory storage and band-width requirements, which also has the bene-fit of globally lower power.8

Myriad 2 system architectureAs Figure 2 shows, application processor

systems on chip (SoCs) are typically based onone or more 32-bit reduced-instruction-setcomputing (RISC) processor surrounded byhardware accelerators that share a commonmultilevel cache and DRAM interface. Thisshared infrastructure is attractive from a costperspective but creates major bottlenecks interms of memory access, where multimediaapplications demand real-time performancebut must contend with user applications anda platform OS such as Android. Thus, theplatform often underdelivers in terms ofcomputational video performance or powerefficiency. A more attractive model is to builda software programmable architecture as acoprocessor to an application processor,rather than using fixed-function or configura-ble hardware.1 Such a coprocessor can thenabstract away a lot of the hard real-timerequirements in dealing with multiple cam-era sensors and accelerometers from theapplication processor and making the entireensemble appear as a single super-intelligentMIPI camera below the Android camera’s hard-ware abstraction layer. As a result, the architec-ture presented in this article focuses on power-efficient operation, as well as area efficiency,allowing product derivatives to be imple-mented entirely in software where previouslyhardware and mask costs would have beenincurred. The software programmable model isespecially interesting in areas where standardsdo not exist, such as body-pose estimation.9

As a VPU SoC, Myriad 2 has a software-controlled, multicore, multiported memorysubsystem and caches that can be config-ured to allow handling of a large range ofworkloads and provide high, sustainableon-chip data and instruction bandwidth tosupport the 12 processors, two RISC pro-cessors, and high-performance video hard-

ware accelerator filters. A multichannel (ormultiagent) direct memory access engineoffloads data movement between the pro-cessors, hardware accelerators, and mem-ory, and a large range of peripheralsincluding cameras, LCD panels, and massstorage, communicate with processors andhardware accelerators. Additional program-mable hardware acceleration helps speed uphard-to-parallelize functions required byvideo codecs, such as H.264 CABAC, VP9,and SVC, as well as video processing kernelsfor always-on computer-vision applications.Up to 12 independent high-definition cam-eras can be connected to 12 programmableMIPI D-PHY lanes supporting CSI-2 organ-ized in six pairs, each of which can be inde-pendently clocked, to provide an aggregatebandwidth of 18 Gbits per second. Thedevice also contains a USB 3.0 interface anda gigabit Ethernet media access controller, aswell as various interfaces such as SerialPeripheral Interface (SPI), Inter-IntegratedCircuit I2Cð Þ, and Universal AsynchronousReceiver Transmitter (UART), connected toa reduced number of I/O pins using a tree ofsoftware-controlled multiplexers. In thisway, these interfaces support a broad rangeof use cases in a low-cost plastic ball-gridarray package with integrated 2 to 4 Gbitlow-power DDR2/3 synchronous DRAMstacked in package using a combination offlip-chip bumping for the VPU die and wirebonding for the stacked DRAM.

Because power efficiency is paramount,the device employs a total of 17 powerislands, including one for each of the 12 inte-grated SHAVE processors, allowing veryfine-grained power control in software. Thedevice supports 8-, 16-, 32-, and some 64-bitinteger operations as well as fp16 (Open-EXR) and fp32 arithmetic, and it canaggregate 1,000 Gflops (fp16). The resultingarchitecture offers increased performance perwatt across a broad spectrum of computervision and computational imaging applica-tions from augmented reality10 to simultane-ous localization and mapping.11

SHAVE processor microarchitectureTo guarantee sustained high performance

and minimize power use, the Movidius

..............................................................................................................................................................................................

HOT CHIPS

.................................................................

60 IEEE MICRO

SHAVE processor contains wide and deepregister files coupled with a variable-lengthlong instruction word (VLLIW) controllingmultiple functional units including extensiveSIMD capability for high parallelism andthroughput at both a functional unit andprocessor level. The SHAVE processor is ahybrid stream processor architecture combin-ing the best features of GPUs, DSPs, andRISC with both 8-, 16-, and 32-bit integerand 16- and 32-bit floating-point arithmeticas well as unique features such as hardwaresupport for sparse data structures. The archi-tecture maximizes performance per wattwhile maintaining ease of programmability,especially in terms of support for design andporting of multicore software applications.

As Figure 3 shows, VLIW packets controlmultiple functional units that have SIMD

capability for high parallelism and through-put at a functional unit and processor level.The functional units are the predicated exe-cution unit (PEU), the branch and repeatunit, two 64-bit load-store units (LSU0 andLSU1), the 128-bit vector arithmetic unit(VAU), the 32-bit scalar arithmetic unit, the32-bit integer arithmetic unit, and the 128-bitcompare move unit (CMU), each of which isenabled separately by a header in the variable-length instruction. The functional units arefed with operands from a 128-bit � 32-entryvector register file with 12 ports and a 32-bit� 32-entry general register file with 18 portsdelivering an aggregate 1,900 Gbytes per sec-ond (GBps) of SHAVE register bandwidthacross all 12 SHAVEs at 600 MHz. The addi-tional blocks in the diagram are the instruc-tion decode and the debug control unit. A

IRF 32x32-bit (18 ports)

LSU1LSU02-KbyteI-cache PEU BRU

VRF 32x128-bit (12 ports)

IAU SAU1-KbyteD-cache VAU CMU

128-bit AXI

64-bitCMXport

64-bitCMXport

3x128-bitports

DCU

IDC

8 parallel SHAVE functional units supplied with VRF and IRF data

32- bitAPB

SHAVE VLIW vector processor

2x 64-bit DDR ports

128/256 Mbyte LPDDR2/3 stacked die

DDR controller

256-Kbyte 2-way L2 cache

2-Mbyte CMX SRAM

Figure 3. The Streaming Hybrid Architecture Vector Engine (SHAVE) processor microarchitecture is an eight-slot very long

instruction word processor capable of performing multiple 128-bit vector operations in parallel with multiple load/store, scalar

floating-point, integer, and control-flow operations in a single clock cycle.

.................................................................

MARCH/APRIL 2015 61

constant instruction fetch width of 128 bitsfor variable-length instructions—with anaverage instruction width of around 80 bits—packed contiguously in memory and the five-entry instruction prefetch buffer guaranteethat at least one instruction is always readywhile taking account of branches.

Computer-vision hardware accelerationOn Myriad 1, SHAVE implementations of

software filters offered, on average, in therange of 1.5 to tens of cycles per pixel (the fast-est kernel was downscaling Gaussian 5x1 at0.675 cycles/pixel). With Myriad 2, designersmade a concerted effort to profile the keyperformance-limiting kernels and define hard-ware accelerators that could be flexibly com-bined into hybrid hardware/software pipelinesfor computer vision and advanced imagingapplications. These hardware accelerators sup-port context switching in order to allow theprocessing of multiple parallel camera streams.For Myriad 2, the global performance targetwas to achieve 20 to 30� the performance ofthe previous generation Myriad 1 VPU.

The programmable hardware acceleratorsimplemented in the Myriad 2 VPU include apoly-phase resizer, lens shading correction,Harris corner detector, Histogram of OrientedGradients edge operator, convolution filter,sharpening filter, gamma correction, tonemapping, and luminance and chrominancedenoising. Each accelerator has multiplememory ports to support the memoryrequirements and local decoupling buffers tominimize instantaneous bandwidth to andfrom the 2-Mbyte multicore memory subsys-tem. A local pipeline controller in each filtermanages the read and writeback of results tothe memory subsystem. The filters are con-nected to the multicore memory subsystemvia a crossbar, and each filter can output onefully computed pixel per cycle for the inputdata, resulting in an aggregate throughput of600 Mpixels per second at 600 MHz.

Multicore memory subsystemThe ability to combine image processing

and computer vision processing into pipe-lines consisting of hardware and software ele-ments was a key requirement because currentapplication processors are limited in this

respect. Thus, sharing data flexibly betweenSHAVE processors and hardware acceleratorsvia the multiported memory subsystem was akey challenge in the design of the Myriad 2VPU. In the 28-nm Myriad 2 VPU, 12SHAVE processors, hardware accelerators,shared data, and SHAVE instructions residein a shared 2-Mbyte memory block calledConnection Matrix (CMX) memory, whichcan be configured to accommodate differentinstruction and data mixes depending on theworkload. The CMX block comprises 16blocks of 128 Kbytes, which in turn comprisefour 32-Kbyte RAM instances organized as4,096 words of 64 bits each, which are inde-pendently arbitrated, allowing each RAMblock in the memory subsystem to be accessedindependently. The 12 SHAVEs actingtogether can move (theoretical maximum) 12� 128 bits of code and 24 � 64 bits of data,for an aggregate CMX memory bandwidth of3,072 bits per cycle (1,536 bits of data). Thissoftware-controlled multicore memory subsys-tem and caches can be configured to allowmany workloads to be handled, providinghigh sustainable on-chip bandwidth with apeak bandwidth of 307 GBps (sixty-four64-bit ports operating at 600 MHz) to sup-port data and instruction supply to the 12SHAVE processors and hardware accelera-tors (see Figure 4). Furthermore, the CMXsubsystem supports multiple traffic classesfrom latency-tolerant hardware acceleratorsto latency-intolerant SHAVE vector pro-cessors, allowing construction of arbitrarypipelines from a mix of software runningon SHAVEs and hardware accelerators,which can operate at high, sustained rateson multiple streams simultaneously withoutperformance loss and at ultra-low-powerlevels.

Myriad 2 implementationFigure 5 shows a die plot of the 27 mm2

device, highlighting the major functionalblocks. Because power efficiency is para-mount in mobile applications, Myriad 2 pro-vides extensive clock and functional unitgating and support for dynamic clock andvoltage scaling for dynamic power reduction.It also contains 17 power islands—one foreach of the 12 SHAVE processors; one for

..............................................................................................................................................................................................

HOT CHIPS

.................................................................

62 IEEE MICRO

the CMX memory subsystem; one for themedia subsystem, including video hardwareaccelerators and RISC1; one for RISC2 andperipherals; one for the clock and powermanagement; and one always-on domain.This allows fine-grained power control insoftware with minimal latency to return tonormal operating mode, including mainte-nance of the static RAM (SRAM) state thateliminates the need to reboot from externalstorage. The device has been designed tooperate at 0.9 V for 600 MHz.

Myriad 2 applicationsThe SHAVE DSP supports streaming

workloads from the ground up, making deci-sions about pixels or groups of pixels as they

are processed by the 128-bit VAU. Because128-bit comparison using the CMU andpredication using the PEU can be performedin parallel with the VAU, higher performancecan be achieved compared to a GPU inwhich decisions must be made about stream-ing data because GPUs suffer from perform-ance loss due to branch divergence. Forexample, the SHAVE processor excels whenprocessing the FAST9 algorithm, where 25pixels on a Bresenham circle around the cen-ter pixel must be evaluated for each pixel in,for instance, a 1080p frame. The FAST9algorithm looks for nine contiguous pixelsout of 25 on a Bresenham circle around thecenter that are above or below a center-pixelluminance value, meaning hundreds of oper-ations must be computed for each pixel in a

Mai

n b

us

DDR controller

128

Bridge

SW-controlled I/O multiplexing

UARTJTAGETH 1Gb

Stacked KGD1-8 Gbit LP-DDR2/3

PLL and CPM

RISC 1

L1 4/4Kbytes

L2 32 Kbytes

MIPID-PHY x12 lanes NALLCD CIF

AMC crossbar

SIPP

Arbiter and 16:1 mux

L2 cache. 256 Kbytes

32xHW

mutex

power17 independent power islands

Inter-SHAVE Interconnect (ISI)

CMX 2-Mbyte multiported RAM subsystem

SDIO

MIPIx6 RAW LSC Debayer Chromadenoising

Sharpenfilter

Polyphasescaler

RISC2

L132/32

KbytesL2

128KbytesROM 64Kbytes

Edgeoper.

Conv.kernelLUTMedian

filterLuminancedenoising

Colorcombination

Harriscorner

RAM14x

32kB

SH

AV

E 1

SH

AV

E 2

SH

AV

E 3

SH

AV

E 4

SH

AV

E 5

SH

AV

E 6

SH

AV

E 7

SH

AV

E 8

SH

AV

E 9

SH

AV

E 1

0

SH

AV

E 1

1

SH

AV

E 1

2

RAM24x

32kB

RAM34x

32kB

RAM44x

32kB

RAM54x

32kB

RAM64x

32kB

RAM74x

32kB

RAM84x

32kB

RAM94x

32kB

RAM104x

32kB

RAM114x

32kB

RAM124x

32kB

RAM134x

32kB

RAM144x

32kB

RAM154x

32kB

RAM164x

32kB

SPIx3 USB3

OTG

12Cx3

12Sx3

Figure 4. Myriad 2 vision processing unit (VPU) system on chip (SoC) block diagram shows the 12 SHAVE cores and

associated Inter-SHAVE Interconnect, located below the multicore memory subsystem (connection matrix or CMX). Above

the CMX are the hardware accelerators for computer vision and image processing controlled by a first reduced-instruction-set

computing (RISC) processor, and above that again are the I/O peripherals, which are controlled by a second RISC processor.

Finally, all of the processors in the system share access to the 64-bit DRAM interface.

.................................................................

MARCH/APRIL 2015 63

high-definition image. This requires hun-dreds of instructions on a scalar processor,and performance optimization requires theuse of machine learning and training toimprove detector performance.12

GPU-Optimized FAST (GO-FAST) is anadaptation of FAST for SHAVE, making useof decomposition and code optimization toproduce identical outputs efficiently on streamprocessors. The SHAVE assembler codeshown in Figure 6 illustrates the operationsrequired to find a contiguous segment of ninepixels on a circle surrounding a center pixel.

On SHAVE, conversely, the massive parallel-ism allows the core of the FAST9 algorithm tobe implemented in three lines of a VLIWassembler using a brute-force implementationthat evaluates 8� rotations per VLIW packetusing three VLLIW slots (PEU, VAU, andCMU). The result is that a single SHAVEprocessor running at less than 200 MHz canimplement the FAST9 detector operating on1080p input frames at over 60 Hz and detect-ing hundreds of feature points.

The following list shows the code imple-menting the 528ð16þ 4� 128Þ equivalent

SHAVE02-Mbyte

multiported CMXmemory array

MIPI

DDRinterface

SIPPfiltersSoC

subsystem

USB3

PLLs

SHAVE1

SHAVE2

SHAVE11

SHAVE10

SHAVE9

SHAVE8S

HA

VE

4

SH

AV

E5

SH

AV

E6

SH

AV

E7

SH

AV

E2

Figure 5. Myriad 2 VPU SoC die plot and 6� 6 mm ball-grid array packaging. In the die plot, the SHAVE processors and

hardware accelerators are clustered around the CMX memory system. On the outer rim of the die, the DRAM interface and

MIPI, USB, and other peripherals as well as phase-locked loops dominate the periphery of the die, with the remainder of the

die being occupied by the RISC processor and peripheral subsystems.

cycles PEU VAU CMU

1

2

3

4

VAU.AND.u16

VAU.AND.u16

PEU.ORACC

PEU.PC8C.OR

CMU.CMVV.i16

CMU.CMVV.i16

mark_as_corner

Figure 6. SHAVE very long instruction word mapping of GPU-Optimized FAST (kernel) to

microarchitecture. Here, the parallelism using the predicated execution unit, vector

arithmetic unit, and compare move unit allows eight of the 16 possible rotations about the

center pixel to be evaluated in a single cycle, a complex sequence of operations that would

require tens of cycles on a conventional machine.

..............................................................................................................................................................................................

HOT CHIPS

.................................................................

64 IEEE MICRO

operations, resulting in a performance effi-ciency of 1,188 GOPS/W (16-bit), and usingall 12 SHAVEs at the maximum 600-MHzclock rate, under typical process, voltage, andtemperature conditions for a total power dis-sipation of 800 mW, including peripheralsand stacked DRAM:

� PEU: 8.� VAU: 128.� CMU: 128.� Operations/corner: 528.� Operations/cycle: 132.� MHz: 600.� SHAVEs: 12.� mW: 800.� GOPS/W Myriad2: 1,188.

As we previously mentioned, applica-tions such as augmented reality10 requirelow latency. To this end, the Myriad 2 VPUhas hard real-time support for low-latencycomputer vision, which is essential for line-sync-based, super-low-latency processing.This is enabled by deterministic data accessdue to 2 Mbytes of on-chip SRAM and thededicated stacked DDR, which is dedi-cated to computer vision tasks without anyCPU contention—as is typically seen inapplication processors with multiple ARMcores running Android in contention withDSPs and GPUs. Finally, the VPU has low-latency (less than 2 ls worst-case) interrupthandling with no contention with peripheralssuchas USB or SPI, and 64-bit timestamp sup-port. As a result, an OpenCV-compatible,multiscale Haar Cascade with 20 stages,computed using 12 SHAVEs and one hard-ware accelerator, can calculate 50,000 multi-scale classifications for each 1080p resolutionframe, in less than 7 ms.

T he Myriad2 microarchitecture presentedhere is a first generation of a true VPU,

which we believe will become as synonymouswith vision processing as a GPU is withgraphics processing. The future challenges interms of evolving the VPU concept will be inachieving yet higher processing efficiency permilliwatt. Improving absolute efficiency willrely on improved analysis and microarchitec-tural enhancements to reduce power dissipatedper clock cycle. Larger gains in processing effi-

ciency may come through using hardware-assisted event filtering in the always-on powerdomain to detect if the inertial measurementunit or other events occur and to power up theMyriad hardware accelerators and SHAVEprocessors only where it is highly likely thatthere is useful visual data to process. This willreduce the average power dissipation for manyinteresting use cases in wearable cameras,phones and tablets, and even small dronesor robots. We also expect future Myriad var-iants to provide enhanced support for visualobject detection and classification and sup-port for path finding and artificial intelli-gence by providing enhanced hardware andsoftware support for deep learning and con-volutional neural networks. This will allowlocal low-latency decision making to bemade on the basis of the output of thevision processing pipeline, which is essentialfor the safe operation of autonomousvehicles, robots, and drones that interactclosely with humans. MICRO

AcknowledgmentsThe research leading to these results has

received funding from the European UnionSeventh Framework Programme (FP7/2007-2013) under grant agreement no. 611183(EXCESS Project, www.excess-project.eu).

....................................................................References1. D. Moloney et al., “Myriad 2: ‘Eye of the Com-

putational Vision Storm,’” Hot Chips 26, 2014.

2. “Integral Photography: A New Discovery by

Professor Lippmann,” Scientific American,

19 Aug. 1911, p. 164.

3. D. Moloney, “1TOPS/W Software Program-

mable Media Processor,” Hot Chips 23,

2011.

4. A. Lal Shimpi, “Intel: By 2020 the Size of

Meaningful Compute Approaches Zero,”

AnandTech, 10 Sept. 2012; www.anandtech

.com/show/6253/intel-by-2020-the-size-of

-meaningful-compute-approaches-zero.

5. J. Lipsky, “ISSCC Keynote: No Silicon, Soft-

ware Silos,” EE Times, 5 Sept. 2014; www

.eetimes.com/document.asp?doc id¼1320954.

6. R. Merritt, “Google Cloud Needs Lower

Latency,” EE Times, 10 Feb. 2014; www

.eetimes.com/document.asp?doc id¼1320945..................................................................

MARCH/APRIL 2015 65

7. Z. Or-Bach, “28nm—The Last Node of Moore’s

Law,” EE Times, 19 Mar. 2014; www.eetimes

.com/author.asp?doc id¼1321536.

8. D. Bates et al., “Exploiting Tightly-Coupled

Cores,” J. Signal Processing Systems, Aug.

2014, doi: 10.1007/s11265-014-0944–6.

9. J. Shotton et al., “Real-Time Human Pose

Recognition in Parts from Single Depth

Images,” Proc. IEEE Conf. Computer Vision

and Pattern Recognition, 2011, pp. 1297–

1304.

10. J. Toledo et al., “Augmented Reality,”

Design of Embedded Augmented Reality

Systems, S. Maad, ed., InTech, 2010.

11. D. Scaramuzza and F. Fraundorfer, “Visual

Odometry: Part I—The First 30 Years and

Fundamentals,” IEEE Robotics and Automa-

tion Mag., vol. 18, no. 4, 2011, pp. 80–92.

12. E. Rosten and T. Drummond, “Machine

Learning for High-Speed Corner Detection,”

Proc. European Conf. Computer Vision,

2006, pp. 430–44.

Brendan Barry is vice president of engineer-ing at Movidius, where he leads developmentof hardware and design tools and contributesto the Myriad and SHAVE microarchitec-ture and physical design. His research inter-ests include hardware computer-vision accel-eration, processor, and SoC architecture.Barry has a BEng in electronic engineeringfrom Dublin City University. Contact himat [email protected].

Cormac Brick is vice president of softwareengineering at Movidius, where he works onintegrating Movidius Myriad technology toenable new computer vision and imagingproducts. His research interests includecomputer vision, sensor fusion, and artificialintelligence. Brick has a BEng in electronicengineering from University College Cork.Contact him at [email protected].

Fergal Connor is a staff engineer at Movi-dius, where he is working on the next gener-ation of the SHAVE processor. His researchinterests include power-efficient processorarchitectures and multicore processor design.Connor has an MEng in electronic engineer-

ing from Dublin City University. Contacthim at [email protected].

David Donohoe is an embedded softwarearchitect at Movidius, where he leads theresearch, development, and software specifica-tion activities around high-throughput pixel-processing pipelines, targeting embedded andmobile applications. Donohoe has a BEng inelectronic engineering from Dublin CityUniversity. Contact him at [email protected].

David Moloney is the chief technology offi-cer of Movidius. His research interestsinclude processor architecture, computervision algorithms and hardware accelerationand systems, hardware and multiprocessordesign for DSP communications, high-performance computing, and multimediaapplications. Moloney has a PhD in mechan-ical engineering on high-performance com-puter architecture from Trinity CollegeDublin. Contact him at [email protected].

Richard Richmond is a staff engineer atMovidius. His research interests include low-power hardware acceleration of computation-ally intensive image signal processing andcomputer vision algorithms. Richmond has anMEng in electronic engineering from Edin-burgh University. Contact him at [email protected].

Martin O’Riordan is the chief compilerarchitect at Movidius, where he leads theSHAVE processor compiler developmentbased on LLVM infrastructure and alsoworked on the SHAVE instruction set archi-tecture and Myriad microarchitecture.O’Riordan has a BSc in computer sciencefrom Trinity College Dublin. Contact himat [email protected].

Vasile Toma is a design engineer at Movi-dius. His research interests include chip-card security, SoC and processor architec-ture, DSP, and computer vision. Toma has aBEng electronic engineering from the Poli-tehnica University of Timisoara, Romania.Contact him at [email protected].

..............................................................................................................................................................................................

HOT CHIPS

.................................................................

66 IEEE MICRO

always on vision processing unit for mobile...

Documents