developing your own wake word engine just like alexa and...
TRANSCRIPT
Developing Your Own Wake Word Engine
Just Like “Alexa” and “OK Google”
Xuchen Yao, CEO, KITT.AI
Guoguo Chen, CTO, KITT.AI
What’s a “wake word”?
• Wake word
• Hot word
• Offline
• Code runs on
CPU/DSP/MCU
• 7x24• Always listening
• One shot
understanding
• Online
• Code runs on cloud
• On Demand
• Explicit permission
Alexa
OK Google
Hey Siriwhat’s the weather today?
Conversational UI Pipeline
wake up
device
speech text
text
understandingdialogue
management
text speech
text
voice
a customizable hotword detection engine
a.k.a: deep neural network in 2MB of RAM
hotword.io video blog
10,000+ developers, 7000+ unique hotwords
Who’s using it (released 5/2016)
Dominating developer community for hotword detection
Use Cases
#1 Hotword: Smart Mirrorhttps://github.com/evancohen/smart-mirror (credits to Evan Cohen) video link
Command & Control: GoPiGo(credits to Paul Matz) video link
Project RePL(credits to Chris Burns) video link
Conversational UI Pipeline
wake up
device
speech text
text
understandingdialogue
management
text speech
text
voice
Speech Pipeline
VoiceMicrophone
Array
Wake Word
Detection
Speech
Recognition
local
• Close talking
• Far field (3-9
feet)
• 2, 4, or 6
microphones
• Linear/circular
cloud/local
• Voice Activity
Detection
• Auto Gain
Control
• Fast response
(0.1 second)
• High accuracy
• Adaptive Echo
Cancellation
• Beam forming
• IBM/Microsoft/Nua
nce/Google
• Alexa Voice Service
• Kaldi
• PocketSphinx
• HTK
• Command & Control
• Language
Understanding
• Telephone
(8KHz Sampling)
• Others (16KHz)
• Noises: TV,
radio, street,
café, car, music
• Pitch: children,
adults, senior
• Accent:
US/UK/Europe/
Asian…
Speech Pipeline
Supported Platforms and Wrappers
• Raspberry Pi
• Mac OS X
• iPhone/iPad/iPod
• x86/64bit Ubuntu
• Android
• Pine 64
• Intel Edison
• Samsung Artik
• Allwinner R-series
• Ingenic X1000
• Rockchip
Personal vs. Universal modelsPersonal Universal
Voice samples needed 3 At least 1500
Speaker-independent No Yes
Speaker-specific Sort of No
Robust against noise No Yes
Free Yes No
Time needed Immediately 2 weeks
Customizing a universal model
define
hotwordcollect voice
train a
model
deliver &
evaluate
deploy to
beta users
ship &
success
collect voice
from device
hotword
web API
Iterate & Improve
desired performance:
>90% detection rate
<= 3 false alarms in 24 hours
Science behind wake word
Challenges
• High detection rate
• Low false alarm
• Efficient: detect every 0.1 second
• Small RAM: <2MB
• Too much ambiguity, not much context
Is this “Alexa”?
short window longer window
Existing Algorithm
Existing Algorithm
Existing Algorithm
• Advantage:
–Simplified pipeline
–Simplified decoder
• Disadvantage:
–Massive hotword specific training data
Possible Ways to Improve
• Data augmentation
– Adding noise
– Adding reverberation
– And so on…
original add noise add noise
and reverberation
Possible Ways to Improve
• Network models
– Model selection
• Feedforward models? Recurrent models?
– Model compression
• 32-bit float 16-bit float 8-bit integer
• Parameters with small absolute value
Possible Ways to Improve
• Decoder redesigning
– Modeling smaller units
• Syllables, phones, etc
– False alarm suppression
• Additional classifier?
Training with Tesla K20/K80
• Positive data
– 1,500 hotword samples
• Negative data
– Thousands of hours of speech
• Training time
– Half a day with 4 K80 GPUs
Software Architecture
FrontendBackend
KITT.AI Scientific Computing
Deep Learning Cloud
DevicesProduction
Cloud
Traffic
ELB
Content
Websocket
audio, msg
HTTPs
Message
Queue
Data Training Model Deploy
Running Your First Snowboy Demo