SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
GAP8 IOT Application Processor
A PULP/RISCV BASED PLATFORM FOR NEAR-SENSOR ANALYTICS
Eric Flamand. CoFounder & CTOGreenwaves Technologies
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
2
?
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
Cost of transporting these data over the air ?
Even assuming distributed computing is marginaly more efficient than centralized we win big if data volume to be exchanged over the air is srinked by several order of magnitude moving from quantitative data to qualitative data!
Serial short reach link: best results around 0.5 pJ/bitLTE: between 300 and 600 uJ/bit
3
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
So in practice ….If we want data reach sensors
Move from (raw) data to meta data (abstract/pertinent)Perform this transformation close to sensorWhile fitting in a tight power and cost budgetAnd being seamlessly integrated to the Internet over the air
4
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
Three main sources of intensive data
• Image: Raw input in the order of 100KB/s for a small sensor
• Scene classification• Posture analysis• Identification
• Voice/Sound: Raw input in the order of 10KB/s per mic
• Recognition• Identification• Signature analysis
• Vibrations: Raw input in the order of 10KB/s
• Preventive maintenance• Monitoring
Once properly processed, common denominator is:extremely compact output (single index, alarm, …)
Output is a single index
Output is a single index
Output is a single index or an alarm
Bandwidth is reduced by
several order of magnitude
5
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
What we want to achieve
Giga/Mega Bytes per second of incoming raw data from sensors
Few (Kilo) Bytes per second of outgoing, heavily processed data @ minimum
Joule per operation
6
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
System level view
Long range, low BW
Short range, BW
Low rate (periodic) data
SW update, commands
U Controller
L2 Memory
IOs
ULP
Parallel Processor
Battery powered
Sense Analyze and Classify,Manage radio diversity
Transmit
LTE CatMNB-IOT…..
Resolution just enough for the job, order of QVGA/VGA
Standard SW stack here (legacy)
Hooked as an a computing accelerator. This is where we want to make the difference
7
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
General pattern for content understanding
• Extract descriptors from raw data• 2D: Corners, blobs, HOG, DOG, …• 1D: LPC coefficients, Cepstral coeffs, …
• Use descriptors to classify data among representative families• Machine learning (CNN, SVM, Boost), Bayesian, ….
Usually highly parallel
Also highly parallel
8
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
GAP8: Ultra Low Power IoT Processor
GAP8 has a unique energy efficiency across a very large range of computing power
Architecture efficiency• Extended Risc-V ISA• Low contention shared memory 8 +1
core clustered architecture• Tight synchronization• CNN based pattern matching engine
(HWCE)
Application affinity• Dominant signal processing
part• Limited memory requirement• Limited SW legacy
Performances• up to 12GOPS• up to 0.4GOPS @ 1mW, • up to 40MOPS @ 300uW• 3 uWatt stand-by power
consumption
Leveraging open source projects• Risc-V (Berkeley)• PULP (ETHZ, UniBo)
HW features• Smart IOs• Voltage
regulator/DVFS • RTC• Secured execution
Low cost processor
• 55nm LP• 0.5MB L2• aQFN 84
Core
0
Core
1
Core
2
Core
3
Core
4
Core
5
Core
6
Core
7
Logarithmic Interconnect
Shared L1 Memory
Shared Instruction CacheDbg Unit
DMA
HWCE
HW Sync
ClusterL2 Memory
LVDSUARTSPII2SI2C
// 10bGPIOsJtag
FabricCtrler
I$
L1Mic
ro D
MA
ClkDbg
Rom
PMU
DC/DC
RTC
Greenwaves Technologies. Confidential 10
eR
ISC
-V
Logarithmic Interconnect
Shared L1 Memory
Shared Instruction CacheDbg Unit
DMA
CNN-HWE
HW Sync
ClusterL2 Memory
LVDSUARTSPII2SI2C
// 10bGPIOs
HyperBus
eRISC-V
I$
L1
Mic
ro D
MA
ClkDbg
Rom
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
GAP8 Hierarchical Architecture
monitoring event qualification,protocol stack,system control
data analysis & classification,SW modem
Smart I/Osvoltage regulator & RTCSRAM in retentive mode
extended RISC-V extended RISC-Vefficient 8 core parallelization
HW synchronizationshared instruction cache
CNN HW engine
quasi stand-by low computing power high computing power
uWs mWs 10 to 20 mWs
primary energy consumption
primary energy consumption
Greenwaves Technologies. Confidential 11
GAP8 architectural energy efficiency gains
data analysis & classification,
SW modem
extended RISC-Vefficient 8 core parallelization
HW synchronizationshared instruction cache
CNN HW engine
3-5x
1.4x
5x
1.5x
overall, in practice on
targeted algorithms,
typically 20x
eR
ISC
-V
Logarithmic Interconnect
Shared L1 Memory
Shared Instruction CacheDbg Unit
DMA
CNN-HWE
HW Sync
ClusterL2 Memory
LVDSUARTSPII2SI2C
// 10bGPIOs
HyperBus
eRISC-V
I$
L1
Mic
ro D
MA
ClkDbg
Rom
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
Greenwaves Technologies. Confidential 12
GAP8 Advanced Power Management
eR
ISC
-V
Logarithmic Interconnect
Shared L1 Memory
Shared Instruction CacheDbg Unit
DMA
CNN-HWE
HW Sync
ClusterL2 Memory
LVDSUARTSPII2SI2C
// 10bGPIOs
HyperBus
eRISC-V
I$
L1
Mic
ro D
MA
ClkDbg
Rom
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
✔ Embedded DC/DC, low current✔ Real Time Clock 32KHz only✔ L2 Memory partially retentive
MCU sleep mode
eR
ISC
-V
Logarithmic Interconnect
Shared L1 Memory
Shared Instruction CacheDbg Unit
DMA
CNN-HWE
HW Sync
ClusterL2 Memory
LVDSUARTSPII2SI2C
// 10bGPIOs
HyperBus
eRISC-V
I$
L1
Mic
ro D
MA
ClkDbg
Rom
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
✔ Embedded DC/DC, high current✔ Voltage can dynamically change✔ One clock gen active, frequency can
dynamically change✔ Systematic clock gating
MCU active modeeR
ISC
-V
Logarithmic Interconnect
Shared L1 Memory
Shared Instruction CacheDbg Unit
DMA
CNN-HWE
HW Sync
ClusterL2 Memory
LVDSUARTSPII2SI2C
// 10bGPIOs
HyperBus
eRISC-V
I$
L1
Mic
ro D
MA
ClkDbg
Rom
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
eR
ISC
-V
✔ Embedded DC/DC, high current✔ Voltage can dynamically change✔ Two clock gen active, frequencies can
dynamically change✔ Systematic Clock Gating
MCU + Parallel processor active mode
uW r
ange
1 m
W r
ange
10-2
0 m
W r
ange
Ultra fast switching time from one mode to anotherUltra fast voltage and frequency change time
Highly optimized system level power consumption
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
Qualitative data from real life applications
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
The work horse for radio, sound and vibration: FFT
Key operations for performanceComplex MultiplicationsComplex RotationsPost modified accessesVectorial operations
All these butterflies are evaluated in parallel
Radix4 Butterfly
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
The work horse for radio, sound and vibration: FFT
FFT 256 FFT 1024 FFT 4096
1167 4842 22710
Number of Cycles running on 8 cores
FFT 256 FFT 1024 FFT 4096
11264 56320 225280
Number of operations (*,+,>>,Ld/St)
Ideal means perfect scaling, perf on N cores = Perf on 1 core / N
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
The work horse for radio, sound and vibration: FFT
ARM FFT1024 Q15 Data are with CMSIS optimized library
X 9X 12X 180
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
Visual Localization: FFT2D + HOG
1 2 4 80
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
FFT2D 256x256. Radix 4
FFT 2D 256 (Radix4)
Ideal
Number of Cores
Cyc
les
1 2 4 80
10
20
30
40
50
60
70
Histogram Of Gradients
8x8 Cells, Blocks: [2x2] Cells, 50% Overlap. 640x480 BW Image
actual
ideal
Number of Cores
Cyc
les
Pe
r P
ixe
l
1 384 000 cycles per image 589 000 cycles per image
We need only 2 MHz per image
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
CNN based Image Classification32x32
5x5 Conv
MaxPool 2x2:1
Linear
28x28 28x28
MaxPool 2x2:1
14x14 14x14
5x5 Conv
MaxPool 2x2:1
10x10 10x10
MaxPool 2x2:1
5x5 5x5
1 10
1 8
1
1
1 12
12
8
Cifar10
Layer0 Layer1 Layer2 Total0
100000
200000
300000
400000
500000
600000
700000
800000
229049
473663
8330
711042
1782643088
4119
65033
Cifar10
1 Core
2 Cores4 Cores
8 Cores
8 Cores + HWCECyc
les
28x28
5x5 Conv
MaxPool 2x2:1
Linear
24x24 24x14
MaxPool 2x2:1
12x12 12x12
5x5 Conv
MaxPool 2x2:1
8x8 8x8
MaxPool 2x2:1
4x4 4x4
1 10
1 32
1
1
1 64
64
32
ReLu ReLu
24x24 24x141 32
ReLu ReLu
8x8 8x81 64
Mnist
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
889282
6706656
24787
7620725
57536
749451
9744
816731
Mnist
1 Core
2 Cores
4 Cores
8 Cores
8 Cores + HWCECyc
les
1 core to 8 cores + HWCE: 10.9 speedup 1 core to 8 cores + HWCE: 9.3 speedup
CIFAR10 MNIST
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
Global Normalization
Contrastive Normalization
Convolution 3x3
ReLU
MaxPool 2x2
Convolution 3x3
ReLU
MaxPool 2x2
Convolution 3x3
ReLU
MaxPool 2x2
Linear
ReLU
Linear
128 x 128
128 x 128
1 x 128 x 128 => 32 x 126 x 126
32 x 126 x126 => 32 x 63 x 63
32 x 63 x 63 => 32 x 61 x 61
32 x 61 x 61 => 32 x 30 x 30
32 x 61 x 61 => 32 x 61 x 61
32 x 30 x 30 => 32 x 28 x 28
32 x 28 x 28 => 32 x 14 x 14
32 x 28 x 28 => 32 x 28 x 28
6272 x 64 => 64
64 => 64
64 x 13 => 13
32 x 126 x 126 => 32 x 126 x 126
Trainable Par: 421 263Neurons: 1 511 904
050
100150200250300350400450
391.3
205.8
112.869.1
33.3
CNN 13 Layers, 128x128 Input, 14 Outputs
Time @ 250MHz, ms
33ms per image
CNN based Image Classification
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
People Counting• Filtering + Difference of Gradient + SVM-RBF• Open Space. Accuracy: approx 90%• 1 Image every 3 minutes => 10 years on a battery
0
20000000
40000000
60000000
80000000
100000000
120000000
Cycles
0100000020000003000000400000050000006000000700000080000009000000
Cycles
x 88.6 X 7.4
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
Audio Processing
FFT
N sub bands, one triangle filter per sub band, mel scale Spectru m
3N Coefficients (Coeff, Coeff', Coeff'') at each sample time
0.65 MHz on 8 Cores
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
Hierarchical Power Processing?
Sound ? Phonem ? Word ?
Core
0
Core
1
Core
2
Core
3
Core
4
Core
5
Core
6
Core
7
Logarithmic Interconnect
Shared L1 Memory
Shared Instruction CacheDbg Unit
DMA
HWCE
HW Sync
ClusterL2 Memory
LVDSUARTSPII2SI2C
// 10bGPIOsJtag
FabricCtrler
I$
L1Mic
ro D
MA
ClkDbg
Rom
PMUDC/D
C
RTC
Gradually turn on resources
SOI Workshop & Tutorial. Nanjing. Sept 2017. Eric Flamand
Conclusion• GAP8 bridges the gap between ultra low power
MCU and multi-core processor:– Smart sensing on data rich sensors achievable
within tinny power budget: uW in Idle, mW in micro controller mode, 5-20mW in number crunching mode : Few Mops to up to 12Gops
– Low cost bill of material
• GAP8 agile power management architecture combined with IOT low duty cycling is a perfect fit for FDSOI process