“speech recognition on low power devices” · please contact [email protected] if you are...
TRANSCRIPT
“Speech Recognition on low power devices”
Vikrant Tomar and Sam Myer – Fluent.ai Inc.
September 15, 2020
tinyML Talks Sponsors
Additional Sponsorships available – contact [email protected] for info
PAGE 3| Confidential Presentation ©2020 Deeplite, All Rights Reserved
VISIT bit.ly/Deeplite FOR MORE INFO
WE USE AI TO MAKE OTHER AI FASTER, SMALLER ANDMORE POWER EFFICIENT
Automatically compress SOTA models like MobileNet to <200KB withlittle to no drop in accuracy for inference on resource-limitedMCUs
Reduce model optimization trial & error from weeks to days usingDeeplite's design space exploration
Deploy more models to your device without sacrificing performance orbattery life with our easy-to-use software
bit.ly/Deeplite bit.ly/Deeplite
Copyright © EdgeImpulse Inc.
TinyML for all developers
Get your free account at http://edgeimpulse.com
Test
Edge Device Impulse
Dataset
Embedded and edge
compute deployment
options
Acquire valuable
training data securely
Test impulse with
real-time device
data flows
Enrich data and train
ML algorithms
Real sensors in real time
Open source SDK
Maxim Integrated: Enabling Edge IntelligenceSensors and Signal Conditioning
Health sensors measure PPG and ECG signals critical to understanding vital signs. Signal chain products enable measuring even the most sensitive signals.
Low Power Cortex M4 Micros
The biggest (3MB flash and 1MB SRAM) and the smallest (256KB flash and 96KB SRAM) Cortex M4 microcontrollers enable algorithms and neural networks to run at wearable power levels
Advanced AI Acceleration
AI inferences at a cost and power point that makes sense for the edge. Computation capability to give vision to the IoT, without the power cables. Coming soon!
Wide range of ML methods: GBM, XGBoost, Random
Forest, Logistic Regression, Decision Tree, SVM, CNN, RNN,
CRNN, ANN, Local Outlier Factor, and Isolation Forest
Easy-to-use interface for labeling, recording, validating, and
visualizing time-series sensor data
On-device inference optimized for low latency, low power
consumption, and a small memory footprint
Supports Arm® Cortex™- M0 to M4 class MCUs
Automates complex and labor-intensive processes of a
typical ML workflow – no coding or ML expertise required!
Industrial Predictive Maintenance
Smart Home
Wearables
Qeexo AutoML for Embedded AIAutomated Machine Learning Platform that builds tinyML solutions for the Edge using sensor data
Automotive
Mobile
IoT
QEEXO AUTOML: END-TO-END MACHINE LEARNING PLATFORM
Key Features Target Markets/Applications
For a limited time, sign up to use Qeexo AutoML at automl.qeexo.com for FREE to bring intelligence to your devices!
Extensive, highly-optimized feature spaces
Super-compact code for MCUs and Gateways
Sensor selection and placement analysis
AI-driven component specs
Automated data quality checks
Data collection, augmentation & labeling services
No open source - clean licensing
Next-Generation AI Tools for
Product Development
Get started w/ a special tinyML Talks offer for corporate customers: https://reality.ai/get-started
$
SynSense (formerly known as aiCTX) builds ultra-low-power(sub-mW) sensing and inference hardware for embedded, mobile and edge devices. We design systems for real-time always-on smart sensing, for audio, vision, bio-signals and
more.
https://SynSense.ai
Next tinyML Talks
Date Presenter Topic / Title
Tuesday,September 29
Michael GieldaVP Business Development and co-founder, Antmicro
Stuart FefferCo-founder and CEO, Reality AI
Running TF Lite on Microcontrollers without hardware in Renode
Building Products using Edge AI / TinyML on MCUs
Webcast start time is 8 am Pacific timeEach presentation is approximately 30 minutes in length
Please contact [email protected] if you are interested in presenting
Vikrant Tomar
Vikrant is Founder and CTO of Fluent.ai Inc. He is a scientist and executive with nearly 10 years of experience in speech recognition and machine/deep learning. He obtained his PhD in automatic speech recognition at McGill University, Canada, where he worked on manifold learning and deep learning approaches for acoustic modeling. In the past, he has also worked at Nuance Communications Inc. and Vestec Inc. as a Research Scientist.
Sam Myer
Sam is the lead developer at Fluent.ai Inc.,
where his responsibilities include Fluent's
embedded speech recognition engine. He
has a M.Sc. in signal processing from
Queen Mary University of London and a
B.Sc. in computer science from McGill
University. Sam has extensive software
development experience encompassing
nearly 15 years and multiple cities
including New York, Berlin and Montreal.
Overview
• About Fluent.ai
• Model Transformation
• Model Compression
• Fluent.ai µCore
• Demos
12
About Fluent.ai
• Founded in 2015 after over 7 years of ground-breaking machine
learning/AI research by international thought-leaders
• Research partnerships with many leading research labs and
institutions
• Strong and experienced team of leading scientists, engineers, sales
staff and managers/executives (~25)
• Working with customers in North America, Europe and Asia (robust,
multilingual and off-line).
Strong Institutional Backers
13
Fluent.ai Technology
Lights please
Please turn on the
lights
Lig
hts
On
Please turn on the
lights
Please turn on the
lights
Speech to Text NLP
End to end Spoken Language Understanding
Conventional approaches
End to end SLU
• Smaller training data needs
• Higher accuracy and
robustness against noise
• Offline and personalizable
• Any language, multi-language
14
Our Right to Win
15
“Fluent’s unique models can be
trained quickly to deliver the required
accuracy in many dialects,
languages and noise conditions
and be embedded on the world’s
devices.”
– William Tunstall-Pedoe, Founder of Evi
(Acquired by Amazon Alexa)
Advisor/Investor in Fluent.ai
Large Global Opportunity &
Demand
• Wake up your voice-enabled device with one of
our low-power keyword spotting solutions that
beat state of the art systems.
• Less than 5% FRR at 3FAs/24hrs
16
Single, Multiple,
User-Trainable
Smart-home
Devices
Smart
Toys
System Req. RAM Storage Latency Min. Freq.
Arm Cortex M4 25 KB 200 KB 100 ms 48 MHz
WearablesCar
InfotainmentRobotics
&
Industrial IoT
Voice
Remotes
17
• Offline/on-device for guaranteed privacy and security
• Any language and accent, multiple languages
concurrently
• Personalizable by the end-user
• Lower development cost, faster Time-to-Market
World’s first end to end spoken
language understanding system
=>faster, more flexible and more
accurate voice user interfaces than
conventional technologies.
Voice AI for on-device
speech understanding
Smart
Speakers
Smart Home
Hubs
Wearables
Voice Remotes
System Req. RAM Storage Latency Min. Freq.
Arm Cortex M4 100 KB 550 KB < 200 ms 48 MHz
Car
Infotainment Wearables
Demo: Multilingual voice control on ESP32https://bit.ly/fluent-m5-demo-public
18
Community contributions
• Fluent Speech Commands Dataset
• Speech to intent dataset
• ~28,000 utterances from ~100 speakers
• 31 intents, 254 commands
• Link: bit.ly/fluent-speech-commands
• Downloaded over 500 times. Many research papers.
• Speech Brain Project at MILA
• A PyTorch based speech toolkit
19
Fluent.ai µCore
20
Fluent.ai µCore
• Proprietary low-resource spoken language understanding
library
• Detects wakephrase(s) or keywords + commands
21
Challenges
• Taking neural networks from GPUs to MCUs:
⎬
• Low-footprint (memory and CPU usage)
• Real-time processing for low latency recognition⎬
22
Model compression & Fluent.aiµCore
Fluent.ai Transformer
1 Model compression
• Compress size of the model by removing unimportant weights:
• Filter pruning, Kernel pruning, or Layer pruning
• One shot or iterative pruning
• Fine-grained pruning or coarse-grained pruning
• Some popular pruning methods:
• Level pruner
• Slim pruner
• Net_adapt
• AGP
1 Automated model compression (AMC)
• Using reinforcement learning algorithm (DDPG) to
automatically learn the pruning ratio for each layer
• Reward is a function of accuracy and FLOPS
1 AMC on Fluent_MN
• Compressing Fluent_MN
architecture up to 50%
with less than 1 percent
accuracy loss
Accuracy FLOPS
Original 99.634 14.61
Compressed 99.270 9.06
2 Transforming models
• Trained model on GPU using PyTorch
• Perform post-processing
• Generate C++ code describing model
• Compile model C++ code with library 26
• Generates C++ code
• Conditional compilation
• 8-bit quantization
• Weight reordering
Fluent.ai µCore
3 Real-time processing
Convolution in existing libraries designed for images, not time-
series
Training - batched utterances Inference - streaming audio
Entire utterance is available Audio streamed one frame at a time
Decoding latency not considered Latency minimized for good user
experience
Finite utterance length Continuously listening, NN applied in
overlapping windows
Activations for entire network stored
in memory
Memory usage must be minimized
27
Layer types
• Streaming layer types
• Unidirectional recurrent layers (GRU, LSTM)
• Convolution / depthwise-separable
• Windowed functions (e.g. MaxPool)
• Streaming cumulative functions (e.g. GlobalMaxPool)
• Skip connections
• Activation functions (ReLU, sigmoid, tanh)
28
Fluent µCore -- NN Layer structure
• All NN weights are stored in Flash
• Arm MCU platform allows network
weights to be fetched layer by layer
• Only activation buffer is stored in
RAM
• Process function
• Uses CMSIS
• Calculates activations as data is
received and updates buffer
• 1 frame input/output
29
Sequence of layers
• Layers joined in sequence
• Input frame propagates through layers
• Layers are independent
30
Convolution example
Streaming convolution, kernel size = 3, stride = 2
Process
Buffer OutputInput
31
Convolution example
Streaming convolution, kernel size = 3, stride = 2
Process
Buffer
NULL
OutputInputFrame
1
32
Convolution example
Streaming convolution, kernel size = 3, stride = 2
Process
Buffer
NULL
OutputInputFrame
2
33
Convolution example
Streaming convolution, kernel size = 3, stride = 2
Process
Buffer OutputInputFrame
3
34
Convolution example
Streaming convolution, kernel size = 3, stride = 2
Process
Frame
3
35
Convolution example
Streaming convolution, kernel size = 3, stride = 2
Process
Buffer
NULL
OutputInputFrame
4
36
Advantages of streaming NN
• No need to keep features/activations for entire utterance
• => lower memory requirements
• Live processing while the user is speaking
• => lower latency
• Redundant calculations eliminated by not using
overlapping windows
• => lower CPU usage
37
µCore vs tflite-micro
• Same Fluent wakeword model running on µCore and
tflite on a Linux machine
26.553 48
80
250
770
692
80
300
163 147
400
Tensor RAM (kB) Decoding t ime (ms)
MIPS (MHz/s) Output in terval (ms)
Fluent tflite-micro win=1.16s tflite-micro win=1.48s
38
µCore Summary
• Low footprint
• CPU efficient code, reduced model size & memory load operations
• small code size — to fit into limited flash memory
• small memory footprint (RAM)
• e.g., Arm Cortex M4, M33 @48 MHz, DSPG, XMOS
• Streaming NN: optimized for low latency / real-time operations
• Cross platform
39
! Cheaper yet effective device designs !
more demos
40
Demo: Multiple WWs and multilingual
intent on Arm Cortex M4• Arm Cortex-M4 microcontroller running at up
to 100 MHz.
41
bit.ly/fluent_ww_air_cortexm4
Demo: Smart-home voice control on Cortex
M7 https://bit.ly/fluent-m7-demo
42
Copyright Notice
This presentation in this publication was presented as a tinyML® Talks webcast. The content reflects the opinion of the author(s) and their respective companies. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.
There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.
tinyML is a registered trademark of the tinyML Foundation.
www.tinyML.org