the rise of voice platforms - comparing voice related api's
TRANSCRIPT
Voice First Footprint
In 2017 there will be 33 mio devices
● The Voice 2017 Report - VoiceLabs analysis combined with research from CIRP, KPCB and InfoScout
Voice adoption
The ‘Voice First’ era has already started
● Alexa in 4% of US households (end 2016)
● Siri handles over 2bn commands a week
● 20% of Google searches on Android handsets input by voice
Alexa
Google home
Ding Dong
Voice Devices
Creating an open ecosystem
Amazon EchoSkills and Alexa Voices Service
Google HomeGoogle Assistant Actions
Speech Recognition API
Developing for the Amazon Alexa● Limit understanding
Amazon Echo is build for predefined options (e.g. no custom notes). Session is ended after 8 sec.
● Predefined wake word defines the customer experience.Only 4 wake words available and must be in any conversation.
● No notifications and no presenceYou can’t alert the user of an event. You cannot react on e.g. welcome home.
● No audio / No identificationAnybody can use Alexa (guests, etc.) and access all informations
Technology Stack
Components enabling Voice User Interfaces
Implemented use cases leveraging the Hardware and AI Software
Software that interprets speech, enables conversations and provide natural voice.
Devices the consumer is interacting like Amazon Echo or Google Home
Applications
AI Software
Hardware
Speech Recognition API
Real time speech-to-text API’sGoogle4 IBM3 Microsoft2
Status Beta Beta/Production Preview
Language Support1 43 (89) 8 (14) 6 (7)
Cost/min 0,024 €0,006 / 15sec
0,02 € 0,06 €1000 calls a 15 sec for 4$
Speaker detection no English (8KHz) no
Audio Formats FLAC, Linear16, MULAW, ARM, AMR_WB
FLAC, PCM, WAV, OGG, NULAW
PCM single channel, Siren, SirenSR
Noise Friendly Yes Unkown Unkown
Word hints Yes No No1) Languages support (Languages supported including dialects)2) Microsoft: https://www.microsoft.com/cognitive-services/en-us/speech-api 3) IBM: http://www.ibm.com/watson/developercloud/speech-to-text.html4) Google: https://cloud.google.com/speech/
● High audio capturing qualityUse lossless coding. Capture audio with 16,000 Hz or higher. Use native sample rate.
● No additional noiseAPI’s include noise reduction. Duplicate noise reduction can reduce the quality. Echo and noise has huge impact on speech recognition quality
● User educationEducate user to be close to the microphone
● One speaker per stream.For multi speaker setting try to separate the audio streams as the current API’s are built for dictation
● Provide contextContext matters a lot. Provide word hints to help the system to correct detection.
Speech Recognition API
Best practices
Problem
Real life - Voice is in the early days
Speech-to-text-quality
Speaker recognition
Language mixing
Punctuation
We are building a voice first company and are looking for support
- Technical Research- Deep Learning & NLP Scientist- Software Engineers
Christian Rebernik Contact: [email protected]