a 2.9 tops/w deep convolutional neural network soc in fd ... · 14.1: a 2.9 tops/w deep...

14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems

© 2017 IEEE International Solid-State Circuits Conference 1 of 32

Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-pal Singh,

Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, Paolo Zambotti,

Manuj Ayodhyawasi, Harvinder Singh, Nalin Aggarwal

STMicroelectronics

A 2.9 TOPS/W Deep Convolutional Neural Network

SoC in FD-SOI 28nm for Intelligent Embedded Systems



Outline

IntroductionChip architectureDCNN mappingHardware acceleratorsPhysical implementationResults



Outline




Deep Convolutional Neural Networks a key enabler

Coffee Croissant BeverageMorningBreakfast Food

Object Detection

• DCNNs excel in many computer vision applications

• Being applied to other tasks too

• Can become a key component of ‘intelligent’ IoTnetworks for edge computing

• BUT …Stanford's CS 231N course by Andrej Karpathy and Justin Johnson

Clarifai.com LSVRC2014 images

Scene Classification



DCNN’s Complexity Evolution

0.0002 1.0 1.5

19.6

11.3

0.01

6050

138

150

ANNs(1997-2007)

3 layers

AlexNET(2012)

7 layers

GoogleLeNet(2014)

22 layers

VGG19(2014)

19 layers

ResNet(2015)

152 layers

Operations (GOPS)

Parameters (Millions)



AlexNet basics

Krizhevsky et all, NIPS 2012

35K 307K 884K 649K 442KTot. Parameters: ~ 60M

37M 16M 4M

Tot. Operations: 832 MC

ON

V 11

x11

REL

U, N

OR

MPO

OL

CO

NV

5x5

CO

NV

3x3

CO

NV

3x3

REL

U, P

OO

L

CO

NV

3x3

REL

U, P

OO

L

FC FCFC

105M 223M 149M 224M 74M

REL

U

REL

U37M 16M 4M

Pooling

Kernels Feature Maps

Activation



A complete SoC for DCNN applications

Enabling use of DCNNs in embedded systems:• Power efficiency, low cost• Efficient memory hierarchy• Flexibility to adapt to

different DCNN topologies• Input/output capability• Integrated with state of the

art Deep Learning tools

Full xBAR (64 bits)

SRAMHW Acc

Subsystem

HOSTPERIPH + EXT MEM

DSPcluster

DSPcluster

x4 x4



DSP Cluster8 DSP clusters, each with 232-bit DSPs, 4-way 16KB I-Cache, 64KB Local RAM anda shared 64KB RAM

(6uW/[email protected]) up to1GHz in ST FD-SOI 28nmtechnology

ISA extensions for DCNNexecution

STRED5

I+DMEM64K

LXBAR

I$ 16K

C-MEM64K

CXBAR

STRED5

I+DMEM64K

LXBAR

I$ 16K

DMA2D-DMA with independentchannels, linked list, stride,padding, etc.



Memory Hierarchy

4 MB (4x16x64KB) ofshared RAM grouped witha 64 bits bus port x bank tosustain peak DCNNthroughput

L2 SW controlled cache for feature maps and parametersFits an AlexNet level of complexity (compressed)

Each 64KB cut hasindividual sleep linecontrol to activate it ondemand

Local SRAMOn-chip SRAM

LPDDR

Energy/power x word access

1x 10x100x



HW Accelerators subsystem

Stream

Switch

Bus Interface

Display

DMA 15

. .

H264

DMA0

Sensor IF

IPx

JPEGCA 0

CA 7

. . .. .

Sensor IF

8 Conv Accelerators16 CDNN specialized streaming DMA engines

Additional IPsH264, MJPEG, 2 Census, 2 croppers, Corner detector, 4 color conv, 4 sensors input IFs, 1 DVI output IF, digital MIC array IF

Configurable framework supports data-flow based processing



AlexNet HW/SW partitioningC

ON

V 11

x11

REL

U, N

OR

MPO

OL

CO

NV

5x5

CO

NV

3x3

CO

NV

3x3

REL

U, P

OO

L

CO

NV

3x3

REL

U, P

OO

L

FC FCFC

105M 223M 149M 224M 74M 37M 16M 4M

REL

U

REL

U

HW

A

cc

DSP

• CONV layers: 1 Conv Acc → 36 MACs• Non conv layers to DSPs to accommodate

DCNN future evolution (leaky RELU, etc.)

85-90% of total operations

Tot. Operations: 832 M



AlexNet memory footprint

35K 307K 884K 649K 442K 37M 16M 4M

SRA

M

EXT

MEM

• On-chip SRAM• 2318 KB for parameters (8 bits)• 1436 KB for feature maps (16 bits)

• ~10 MB of external RAM for FC layers

Tot. Parameters: ~ 60M

CO

NV

11x1

1R

ELU

, NO

RM

POO

L

CO

NV

5x5

CO

NV

3x3

CO

NV

3x3

REL

U, P

OO

L

CO

NV

3x3

REL

U, P

OO

L

FC FCFCREL

U

REL

U



Logical to physical mapping

KERQ

KER0

FEATUREDEPTH

FEATURE WIDTH

BATCH SIZE

KERNEL WIDTH

FEATURE HEIGHT

KER 0…

…IN FEATURE

MAP

BATCH 0

BATCH N

Σ

KER QΣ

Feature maps and kernels are sliced into batches processed iteratively and results are accumulated

Batch size set x layerMatching features andkernels parameters to HWresources and ceilings

OUT FEATUREMAP



RUNTIMESTARTUP

DESIGNReconfigurable Accelerator

FrameworkDesign time parametric

Unidirectional stream links create ad-hoc processing chains.

DMA engines run autonomously through linked lists and synch up with the DSP clusters

Ctrl Regs Bus Arbiter & Interface

Display IF

. . .

Color conv

15

Sensor IF

JPEG

CA 0

. . .

CA 1

CA 7

012314

Crop

4

KERNELFMAP

BATCHBATCH-1

RGBIMAGE

Sensor IF

DMAs



Reconfigurable Accelerator Framework

Virtual stream links

More flexible than hardwired data pathsMore power efficient than a bus

• Ferry data to/from acc,IFs and DMA engines

• Flow control supported

• Streams multicast tomultiple destinations

Ctrl Regs Bus Arbiter & Interface

Display IF

. . .

Color conv

15

Sensor IF

JPEG

CA 0

. . .

CA 1

CA 7

012314

Crop

4

KERNELFMAP

BATCHBATCH-1

RGBIMAGE

Sensor IF

DMAs



Virtual Stream Links

Sync.

A 0IF 0 A 1 DMA E0A 0IF 0 A 1 DMA E0

A 2 DMA E1A 0IF 0 DMA E1A 1DMA E0

A 0IF 0DMA E0

A 1IF 1

A 0IF 0DMA E0

BUF

A 0 A 1 DMA E0DMA E0

A2

Chains with forks

Simple chains

Joins single IF

Joins with multiple IF

Forksand hops

A2

A2



DCNN Convolution Accelerator

Kernels 1x1 to 12x128bit => 16 bitup to 4 in parallel

Batch size

Up to 16

Feat.size

512: ker > 6x61024: ker > 3x32048: othersH/V paddingH/V stride

MACs 16 bits inputs40 bits accum.scaling, rounding, saturation

KER. BUFFER

FEATURE LINE

BUFFER

BUF

36 READPORTS

36 x 16x16BIT

MACs13 INPUT ADDER

TREE

IF

FEATURE STREAM

KERNELSTREAM

ACCUM(N-1)

STREAM

OUPUT DATA STREAM

CONV. OUT

CTRL

CTRL

12 READPORTS

BUFIF

BUFIF BUF IF



Exploit Parallelism and Locality

CA 0

CA N DMA

…

FMAPKER. 0/1

BATCH(N-1) K0

OUT K1 DMABATCH(N-1) K1

DMAOUT K0

CA 0

CA N DMA

…

OUT

DMAFMAP next batch

CA 0.0

CA 0.M DMA

…

FMAP KER. 0-3

BATCH(N-1) K0

OUT K0

DMA

CA N.0DMA

CA N.M DMA

…BATCH(N-1) K1

OUT K1

Chained and parallel batch execution on multiple CAs reduces bandwidth, power, and # of DMA channels

Parallel Batch Execution

Chained Batch Execution

Parallel/Chained Batch Execution

DMA DMA

DMA FMAPKER. 0/1

BATCH(N-1) K0

FMAP next batch



Parameter Compression

• Kernel weights can be quantized non linearly with 8 offewer bits (e.g. with KNN),

• Convolution Accelerator supports decompression in HW• AlexNet top-1 classification error rate increase of 0.3%

-12000

-8000

-4000

0

4000

8000

12000

0 127 254

Layer 1

-1000

-500

0

500

1000

1500

2000

0 127 254

Layer 3



Ultra-Wide DVFS Range

200266

450650

95011752930

2691

1977

1423969 801

0.575 0.6 0.7 0.825 1 1.1

Wide DVFS Range

Frequency GOPS/W

• LVT design with heterogeneous Poly-Bias levels → perf vs leakage

• GALS and low insertiondelay clock networks tominimize on chip variationmargins;

• Mono Supply memories with fine grained power switches and sleep mode

• DVFS energy efficiency improvements via body bias



High Performance Low Voltage Mono-supply SRAM

• Wide Voltage Range 0.12u2 single p-well bitcell with reduced variability In-situ tracking of bitcell current and programmable

read time for best speed and lowest dynamic power In-situ tracking of wordline delay and slope for robust

low voltage read/ write

• Energy conservation Independent array and periphery power switches In-built isolation for FSM stability in power-down modes Extensive internal signal and clock gating Programmable buffers to optimize performance and

power across instances



Chip SpecsTechnology FD-SOI 28nmPackage FBGA 15x15x1.83Frequency 200MHz–1.175GHzSupply voltages

0.575–1.1 V digital1.8V I/O

Power (**) 41 mW

On-chip RAM 4x1MB 8x192KB128KB

Host ARM® Cortex®-M4No of DSPs 16Peak DSP perf

75 GOPS (dual 16b MAC(*) loop)

No of CAs 8CAs perf 676 GOPS(*) peak

PLL

CHIPTO

CHIP

HW A

ccSU

BSYS

TEM

GLOBALMEMORY

SUBSYSTEM

(DSP) CORES ANDLOCAL MEMS

CamIFs

M4

(*) 1 MAC defined as 2 OPS (ADD + MUL)(**) Only HW Acc avg power for AlexNet

6239.2 um

5598

.2 um



AlexNet CAs Performance

Layer MOPS Time[ms]

Load % GOPs/W Pwr

[mW] GOPs/W Pwr[mW]

16(F)x16(W)→16 8(F)x8(W)→16max avg max avg

1 210.8 2.5 80 1228 988 86 1810 1456 582 447.8 6.5 86 1475 1262 54 2175 1861 373 299 3.6 73 1987 1445 58 2930 2131 394 224.2 2.7 73 1987 1445 58 2930 2130 395 149.6 1.8 72 1987 1434 58 2930 2114 39

Total 1331.6 17.1 77 1677 1287 61 2473 1898 41

200MHz @ 0.575V 25C, 4 chains of 2 CAs, batch of 1 image (227x227)



Comparisons with prior workThis work [3] [4] [5] [6]

Process (nm) 28 FD-SOI 65 65 28 65

VDD (V) 0.575–1.1 1.2 NA 0.9 0.82–1.17(1)Power

(mW) 39 45 485 15970 278

Freq (MHz) 200-1175 125 980 606 100–250Memory

(kB)4096

8*192+128 36 44 16x20484096 108

Die size (mm2)

2.2 CAs34 whole chip

16 3.02 67 12.25

Peak perf(GOPS)

676 (CAs)76 (DSP) 64 452 5580 84

Peak effic.(GOPS/W) 2930 1420 932 350 302

(1) Power for best efficiency (this work @ 200 MHz, 0.575V, 25C)

2930

1420932

350 302

Thiswork

[3] [4] [5] [6]

2.06x 3.14x 8.37x 9.7x

[3] J. Sim, et al. 2016[4] T. Chen et al 2015[5] Y. Chen, et al 2014[6] Y. Chen et al 2016



AlexNet Complete Application

37.5 mW @ 200MHz, 0.6V10 FPS (38 ms DSPs, 62 ms 2 chained CAs)

Dynamic: 10 mW CAs + 17 mW systemStatic: 0.6 mW CAs + 9.9 mW system

Sensor IF

RGB→YUV

CROP227x227

JPEG DMA MEM

DMA

DMAIN FMAP

KER. MEMDMA

OUT FMAPDSPs

DMADMA

CA

CAHOST

+SPI+

DMA

To PC

Input image



Demo session

User Switch Setting

ST Morpho And Arduino Expansion Headers

2 MIPI & 2 Parallel Interface

Orlando Expansion (Non STM32ODE)

JTAG

Orlando -28FDSOI @1GHz• ARM M4 Cortex• 8X2 REISC5 Cluster• 8 CDNN HW Accerlator• Image/Video Acc.• On Chip RAM 48Mb• 2 CSI2 Camera• 2 Parallel Camera• Chip to Chip Bridge 8GHz• On Board Memory

• Pseudo RAM128MB• 128MB NOR FLASH

Tools• GCC Tool Chain• ARM tool Chain• model2plateform-CDNN

engine programmer



Summary• An ultra-low-power SoC for DCNN real-word

embedded and IoT applications– Reconfigurable HW acceleration data-flow framework– Parametric HW accelerator for convolutional layers of

large DCNNs– Exploits different kinds of parallelism to improve

performance and reduce power– DSP array with an optimized ISA for other DCNN’s

layers and additional processing needs– Designed in FD-SOI28, ultra wide DVFS capability– Peak efficiency of 2.9 TOPS/W on AlexNet

a 2.9 tops/w deep convolutional neural network soc in fd ... · 14.1: a 2.9 tops/w deep...

Documents