a 2.9 tops/w deep convolutional neural network soc in fd ... · 14.1: a 2.9 tops/w deep...
TRANSCRIPT
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 1 of 32
Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-pal Singh,
Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, Paolo Zambotti,
Manuj Ayodhyawasi, Harvinder Singh, Nalin Aggarwal
STMicroelectronics
A 2.9 TOPS/W Deep Convolutional Neural Network
SoC in FD-SOI 28nm for Intelligent Embedded Systems
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 2 of 32
Outline
IntroductionChip architectureDCNN mappingHardware acceleratorsPhysical implementationResults
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 3 of 32
Outline
IntroductionChip architectureDCNN mappingHardware acceleratorsPhysical implementationResults
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 4 of 32
Deep Convolutional Neural Networks a key enabler
Coffee Croissant BeverageMorningBreakfast Food
Object Detection
• DCNNs excel in many computer vision applications
• Being applied to other tasks too
• Can become a key component of ‘intelligent’ IoTnetworks for edge computing
• BUT …Stanford's CS 231N course by Andrej Karpathy and Justin Johnson
Clarifai.com LSVRC2014 images
Scene Classification
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 5 of 32
DCNN’s Complexity Evolution
0.0002 1.0 1.5
19.6
11.3
0.01
6050
138
150
ANNs(1997-2007)
3 layers
AlexNET(2012)
7 layers
GoogleLeNet(2014)
22 layers
VGG19(2014)
19 layers
ResNet(2015)
152 layers
Operations (GOPS)
Parameters (Millions)
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 6 of 32
AlexNet basics
Krizhevsky et all, NIPS 2012
35K 307K 884K 649K 442KTot. Parameters: ~ 60M
37M 16M 4M
Tot. Operations: 832 MC
ON
V 11
x11
REL
U, N
OR
MPO
OL
CO
NV
5x5
CO
NV
3x3
CO
NV
3x3
REL
U, P
OO
L
CO
NV
3x3
REL
U, P
OO
L
FC FCFC
105M 223M 149M 224M 74M
REL
U
REL
U37M 16M 4M
Pooling
Kernels Feature Maps
Activation
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 7 of 32
IntroductionChip architectureDCNN mappingHardware acceleratorsPhysical implementationResults
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 8 of 32
A complete SoC for DCNN applications
Enabling use of DCNNs in embedded systems:• Power efficiency, low cost• Efficient memory hierarchy• Flexibility to adapt to
different DCNN topologies• Input/output capability• Integrated with state of the
art Deep Learning tools
Full xBAR (64 bits)
SRAMHW Acc
Subsystem
HOSTPERIPH + EXT MEM
DSPcluster
DSPcluster
x4 x4
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 9 of 32
DSP Cluster8 DSP clusters, each with 232-bit DSPs, 4-way 16KB I-Cache, 64KB Local RAM anda shared 64KB RAM
(6uW/[email protected]) up to1GHz in ST FD-SOI 28nmtechnology
ISA extensions for DCNNexecution
STRED5
I+DMEM64K
LXBAR
I$ 16K
C-MEM64K
CXBAR
STRED5
I+DMEM64K
LXBAR
I$ 16K
DMA2D-DMA with independentchannels, linked list, stride,padding, etc.
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 10 of 32
Memory Hierarchy
4 MB (4x16x64KB) ofshared RAM grouped witha 64 bits bus port x bank tosustain peak DCNNthroughput
L2 SW controlled cache for feature maps and parametersFits an AlexNet level of complexity (compressed)
Each 64KB cut hasindividual sleep linecontrol to activate it ondemand
Local SRAMOn-chip SRAM
LPDDR
Energy/power x word access
1x 10x100x
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 11 of 32
HW Accelerators subsystem
Stream
Switch
Bus Interface
Display
DMA 15
. .
H264
DMA0
Sensor IF
IPx
JPEGCA 0
CA 7
. . .. .
Sensor IF
8 Conv Accelerators16 CDNN specialized streaming DMA engines
Additional IPsH264, MJPEG, 2 Census, 2 croppers, Corner detector, 4 color conv, 4 sensors input IFs, 1 DVI output IF, digital MIC array IF
Configurable framework supports data-flow based processing
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 12 of 32
IntroductionChip architectureDCNN mappingHardware acceleratorsPhysical implementationResults
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 13 of 32
AlexNet HW/SW partitioningC
ON
V 11
x11
REL
U, N
OR
MPO
OL
CO
NV
5x5
CO
NV
3x3
CO
NV
3x3
REL
U, P
OO
L
CO
NV
3x3
REL
U, P
OO
L
FC FCFC
105M 223M 149M 224M 74M 37M 16M 4M
REL
U
REL
U
HW
A
cc
DSP
• CONV layers: 1 Conv Acc → 36 MACs• Non conv layers to DSPs to accommodate
DCNN future evolution (leaky RELU, etc.)
85-90% of total operations
Tot. Operations: 832 M
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 14 of 32
AlexNet memory footprint
35K 307K 884K 649K 442K 37M 16M 4M
SRA
M
EXT
MEM
• On-chip SRAM• 2318 KB for parameters (8 bits)• 1436 KB for feature maps (16 bits)
• ~10 MB of external RAM for FC layers
Tot. Parameters: ~ 60M
CO
NV
11x1
1R
ELU
, NO
RM
POO
L
CO
NV
5x5
CO
NV
3x3
CO
NV
3x3
REL
U, P
OO
L
CO
NV
3x3
REL
U, P
OO
L
FC FCFCREL
U
REL
U
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 15 of 32
Logical to physical mapping
KERQ
KER0
FEATUREDEPTH
FEATURE WIDTH
BATCH SIZE
KERNEL WIDTH
FEATURE HEIGHT
KER 0…
…IN FEATURE
MAP
BATCH 0
BATCH N
Σ
KER QΣ
Feature maps and kernels are sliced into batches processed iteratively and results are accumulated
Batch size set x layerMatching features andkernels parameters to HWresources and ceilings
OUT FEATUREMAP
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 16 of 32
IntroductionChip architectureDCNN mappingHardware acceleratorsPhysical implementationResults
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 17 of 32
RUNTIMESTARTUP
DESIGNReconfigurable Accelerator
FrameworkDesign time parametric
Unidirectional stream links create ad-hoc processing chains.
DMA engines run autonomously through linked lists and synch up with the DSP clusters
Ctrl Regs Bus Arbiter & Interface
Display IF
. . .
Color conv
15
Sensor IF
JPEG
CA 0
. . .
CA 1
CA 7
012314
Crop
4
KERNELFMAP
BATCHBATCH-1
RGBIMAGE
Sensor IF
DMAs
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 18 of 32
Reconfigurable Accelerator Framework
Virtual stream links
More flexible than hardwired data pathsMore power efficient than a bus
• Ferry data to/from acc,IFs and DMA engines
• Flow control supported
• Streams multicast tomultiple destinations
Ctrl Regs Bus Arbiter & Interface
Display IF
. . .
Color conv
15
Sensor IF
JPEG
CA 0
. . .
CA 1
CA 7
012314
Crop
4
KERNELFMAP
BATCHBATCH-1
RGBIMAGE
Sensor IF
DMAs
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 19 of 32
Virtual Stream Links
Sync.
A 0IF 0 A 1 DMA E0A 0IF 0 A 1 DMA E0
A 2 DMA E1A 0IF 0 DMA E1A 1DMA E0
A 0IF 0DMA E0
A 1IF 1
A 0IF 0DMA E0
BUF
A 0 A 1 DMA E0DMA E0
A2
Chains with forks
Simple chains
Joins single IF
Joins with multiple IF
Forksand hops
A2
A2
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 20 of 32
DCNN Convolution Accelerator
Kernels 1x1 to 12x128bit => 16 bitup to 4 in parallel
Batch size
Up to 16
Feat.size
512: ker > 6x61024: ker > 3x32048: othersH/V paddingH/V stride
MACs 16 bits inputs40 bits accum.scaling, rounding, saturation
KER. BUFFER
FEATURE LINE
BUFFER
BUF
36 READPORTS
36 x 16x16BIT
MACs13 INPUT ADDER
TREE
IF
FEATURE STREAM
KERNELSTREAM
ACCUM(N-1)
STREAM
OUPUT DATA STREAM
CONV. OUT
CTRL
CTRL
12 READPORTS
BUFIF
BUFIF BUF IF
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 21 of 32
Exploit Parallelism and Locality
CA 0
CA N DMA
…
FMAPKER. 0/1
BATCH(N-1) K0
OUT K1 DMABATCH(N-1) K1
DMAOUT K0
CA 0
CA N DMA
…
OUT
DMAFMAP next batch
CA 0.0
CA 0.M DMA
…
FMAP KER. 0-3
BATCH(N-1) K0
OUT K0
DMA
CA N.0DMA
CA N.M DMA
…BATCH(N-1) K1
OUT K1
Chained and parallel batch execution on multiple CAs reduces bandwidth, power, and # of DMA channels
Parallel Batch Execution
Chained Batch Execution
Parallel/Chained Batch Execution
DMA DMA
DMA FMAPKER. 0/1
BATCH(N-1) K0
FMAP next batch
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 22 of 32
Parameter Compression
• Kernel weights can be quantized non linearly with 8 offewer bits (e.g. with KNN),
• Convolution Accelerator supports decompression in HW• AlexNet top-1 classification error rate increase of 0.3%
-12000
-8000
-4000
0
4000
8000
12000
0 127 254
Layer 1
-1000
-500
0
500
1000
1500
2000
0 127 254
Layer 3
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 23 of 32
IntroductionChip architectureDCNN mappingHardware acceleratorsPhysical implementationResults
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 24 of 32
Ultra-Wide DVFS Range
200266
450650
95011752930
2691
1977
1423969 801
0.575 0.6 0.7 0.825 1 1.1
Wide DVFS Range
Frequency GOPS/W
• LVT design with heterogeneous Poly-Bias levels → perf vs leakage
• GALS and low insertiondelay clock networks tominimize on chip variationmargins;
• Mono Supply memories with fine grained power switches and sleep mode
• DVFS energy efficiency improvements via body bias
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 25 of 32
High Performance Low Voltage Mono-supply SRAM
• Wide Voltage Range 0.12u2 single p-well bitcell with reduced variability In-situ tracking of bitcell current and programmable
read time for best speed and lowest dynamic power In-situ tracking of wordline delay and slope for robust
low voltage read/ write
• Energy conservation Independent array and periphery power switches In-built isolation for FSM stability in power-down modes Extensive internal signal and clock gating Programmable buffers to optimize performance and
power across instances
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 26 of 32
Chip SpecsTechnology FD-SOI 28nmPackage FBGA 15x15x1.83Frequency 200MHz–1.175GHzSupply voltages
0.575–1.1 V digital1.8V I/O
Power (**) 41 mW
On-chip RAM 4x1MB 8x192KB128KB
Host ARM® Cortex®-M4No of DSPs 16Peak DSP perf
75 GOPS (dual 16b MAC(*) loop)
No of CAs 8CAs perf 676 GOPS(*) peak
PLL
CHIPTO
CHIP
HW A
ccSU
BSYS
TEM
GLOBALMEMORY
SUBSYSTEM
(DSP) CORES ANDLOCAL MEMS
CamIFs
M4
(*) 1 MAC defined as 2 OPS (ADD + MUL)(**) Only HW Acc avg power for AlexNet
6239.2 um
5598
.2 um
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 27 of 32
IntroductionChip architectureDCNN mappingHardware acceleratorsPhysical implementationResults
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 28 of 32
AlexNet CAs Performance
Layer MOPS Time[ms]
Load % GOPs/W Pwr
[mW] GOPs/W Pwr[mW]
16(F)x16(W)→16 8(F)x8(W)→16max avg max avg
1 210.8 2.5 80 1228 988 86 1810 1456 582 447.8 6.5 86 1475 1262 54 2175 1861 373 299 3.6 73 1987 1445 58 2930 2131 394 224.2 2.7 73 1987 1445 58 2930 2130 395 149.6 1.8 72 1987 1434 58 2930 2114 39
Total 1331.6 17.1 77 1677 1287 61 2473 1898 41
200MHz @ 0.575V 25C, 4 chains of 2 CAs, batch of 1 image (227x227)
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 29 of 32
Comparisons with prior workThis work [3] [4] [5] [6]
Process (nm) 28 FD-SOI 65 65 28 65
VDD (V) 0.575–1.1 1.2 NA 0.9 0.82–1.17(1)Power
(mW) 39 45 485 15970 278
Freq (MHz) 200-1175 125 980 606 100–250Memory
(kB)4096
8*192+128 36 44 16x20484096 108
Die size (mm2)
2.2 CAs34 whole chip
16 3.02 67 12.25
Peak perf(GOPS)
676 (CAs)76 (DSP) 64 452 5580 84
Peak effic.(GOPS/W) 2930 1420 932 350 302
(1) Power for best efficiency (this work @ 200 MHz, 0.575V, 25C)
2930
1420932
350 302
Thiswork
[3] [4] [5] [6]
2.06x 3.14x 8.37x 9.7x
[3] J. Sim, et al. 2016[4] T. Chen et al 2015[5] Y. Chen, et al 2014[6] Y. Chen et al 2016
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 30 of 32
AlexNet Complete Application
37.5 mW @ 200MHz, 0.6V10 FPS (38 ms DSPs, 62 ms 2 chained CAs)
Dynamic: 10 mW CAs + 17 mW systemStatic: 0.6 mW CAs + 9.9 mW system
Sensor IF
RGB→YUV
CROP227x227
JPEG DMA MEM
DMA
DMAIN FMAP
KER. MEMDMA
OUT FMAPDSPs
DMADMA
CA
CAHOST
+SPI+
DMA
To PC
Input image
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 31 of 32
Demo session
User Switch Setting
ST Morpho And Arduino Expansion Headers
2 MIPI & 2 Parallel Interface
Orlando Expansion (Non STM32ODE)
JTAG
Orlando -28FDSOI @1GHz• ARM M4 Cortex• 8X2 REISC5 Cluster• 8 CDNN HW Accerlator• Image/Video Acc.• On Chip RAM 48Mb• 2 CSI2 Camera• 2 Parallel Camera• Chip to Chip Bridge 8GHz• On Board Memory
• Pseudo RAM128MB• 128MB NOR FLASH
Tools• GCC Tool Chain• ARM tool Chain• model2plateform-CDNN
engine programmer
14.1: A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems
© 2017 IEEE International Solid-State Circuits Conference 32 of 32
Summary• An ultra-low-power SoC for DCNN real-word
embedded and IoT applications– Reconfigurable HW acceleration data-flow framework– Parametric HW accelerator for convolutional layers of
large DCNNs– Exploits different kinds of parallelism to improve
performance and reduce power– DSP array with an optimized ISA for other DCNN’s
layers and additional processing needs– Designed in FD-SOI28, ultra wide DVFS capability– Peak efficiency of 2.9 TOPS/W on AlexNet