1 are we ready? a look at the state of the art in speech-to-text applications marie meteer august...
Post on 17-Dec-2015
215 Views
Preview:
TRANSCRIPT
1
www.everyzing.comwww.everyzing.com
Are We Ready?
A Look at the State of the Art in Speech-to-text Applications
Marie Meteer
August 2007
Overview
• Speech Recognition: The State of the Art– A look back at where it came from– Elements of the models– State of the art performance
• Applications: Making them work– Call Center Analytics– Voicemail Transcription– Needles in Haystacks– Multimedia search
BBN Technology’s Speech Milestones
Early continuous speech recognizer using natural language understanding First software-
only, real-time, large-vocabulary, speaker-independent, continuous speech recognizer
First 40,000 word real time speech recognizer
Pioneered statistical language understanding and data extraction
Early adopter of statistical hidden Markov models
Introduced context dependent phonetic units
1992 1994
19951982 1986 1998
Rough’ n’ Ready prototype system for browsing audio
1976
2004
Exceeded DARPA EARS targets
2003
Audio Indexer System – 1st generation
Broadcast Monitoring System delivered to U.S. Gov’t. – 2nd generation
2002
DARPA EARS Program Award
20052000
AVOKE STX 1.0 introduced
AVOKE STX 2.0 with Domain Development Tools
4
Progress in Speech Recognition 1990’s
87 88 89 90 91 92 93 94 95 96 97 98
Wo
rd E
rro
r R
ate
(%)
50
40
30
20
10
60
70
80
90
5
21
Resource Management
WSJ 64K VocabW
SJ 5K Vocab
Broadcast News
SWBD Conversational Telephone
Connected Digits
Resource Mgt Spkr Dep.
Airline Task
Call Home
BBN’s 2003 Performance ExceedsWord Error Rate Goals
0
10
20
30
40
50
60
2002 2005 2007
Year
Wo
rd e
rro
r ra
te
Broadcast news ceiling
Broadcast news floor
Telephony ceiling
Telephony floor
2003
DARPA EARS for ASR Performance
Elements of a Speech Model
• Dictionary– List of all the words and their pronunciations, the sequence of “phonemes”
that make up the word• >Real Networks R-IY-L N-EH-T-W-ER-K-S
– Dictionary tool automatically creates phonetic pronunciations for most words
• Acoustic Model– Captures the relationship between the sounds and the phonemes– Specific to a language (e.g. English, Spanish) and a channel (e.g.
telephony, broadcast)
• Domain Model– Captures the sequences of words in the language using a “tri-gram” model,
that is the likelihood of a word given the two previous words– Can be as general as “Conversational” or as specific as “Technology”
Model Requirements
• Acoustic Data– Minimum of 50-100 hours transcribed data– English Broadcast News transcribed on 1600 hours of broadcast news
data– Training data must be a precise transcription with corresponding audio
file (including partial words, “um”, laugh, etc)
• Domain Modeling data– Text data, either transcribed from audio or off the web– Does not have to be as precise as for acoustic modeling– Has to model both the vocabulary and “style” of speaking
• Dictionary– Phonetic pronunciations of all of the words
Word Accuracy
• Recognition performance varies based on audio quality and domain
– Within News • Factors include
– Speaker– Audio quality– Background music
– Across Domains• Factors include
– Speaking style, – Out of vocabulary rate– Audio quality
DOMAIN ACCURACY
News 74.5
Movie Reviews 77.8
Technology 79.4
Gaming 59.45
Religion 68.2
SPEAKER ACCURACY
Male Anchor 82
Female Anchor 76
Non-native over the telephone 53
Commercial 55
Document Retrieval Accuracy
• To correctly retrieve a document, a search term only has to be found once in the document
• The table below reports on document retrieval accuracy based on words occurring 2 or more times in the document compared with overall word accuracy.
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ACCURACY
RECALL
PRECISION
Markets and Applications
Call Center Recording
Government Intelligence
Enterprise Search(webcasts, corp info)
Broadcast Monitoring & Retrieval
(audio/video publication)
Digital Asset Production
Consumer Search(video search)
AVOKE Caller Experience Analytics
• Breakthrough Caller Experience Analytics– The Only True End-to-End Solution
• From dialing to termination
– Multiple Techniques To Extract Understanding• Prompt and speech recognition, telephony data,
and human annotation
– Data-Driven Insights• With drill-down to listen for root cause
– Zero Integration• No on-site hardware or software
• To Manage & Optimize Contact Processes– Improve Operational Visibility
– Reduce Agent Time by 15-30+%
– Boost First Call Resolution
– Eliminate Customer Dis-Satisfiers
Full Text & Keyword Search
Search for words spoken by callers or agents
View call with full text of caller and call center – including all IVR(s), queue(s) and agent(s)
Voicemail Transcription
• Requirements – Near real time transcription– High accuracy, especially on names– Frequently very noisy conditions (Non-native speaker calling
on a cell phone from a street corner in Germany)
• Solution– Speech recognition automates a “first pass”– Human correction provides accuracy– Full human transcription on poor quality calls
Voicemail Solution?Human in the loop
Transcribers fix the output of the speech recognizer
“Hi Tom. I can’t make the meeting but I’m available to call in.
Give me a call at 101-555-1212. Thanks.”
“Hi Tom. I can’t make the meeting but I’m available to call in.
Give me a call at 101-555-1212. Thanks.”
Phone message
is left
Speech Recognizer produces a
rough transcript
Correct transcription goes back to the
server
Correct transcription goes back to the
server
Result: High Quality, Lower CostResult: High Quality, Lower Cost
Custom Applications: Broadcast Monitoring
Automatic translationof Arabic transcript from
Language Weaver MT
Automatic transcriptionof Arabic speech from
BBN Audio Indexer
Real-time streaming video(<5 min delay)
MultiMedia Search
16
…let’s look at the overall picture not just Obama and and Clinton Brett how do you assess the overall dynamics of what's happened over the course of the last three months how big -- victory for the president how big a defeat for the Democrat well it it. He would have been a bigger defeat it was a victory. This is this is -- reprieve cents for the president it's only as bill pointed out for months worth of funding. And it's and this issue's going to come up again in the Democrats are going to continue to try to impose restrictions on the with a president for a just war -- vote to be funded completely which is what. We're just talking about so. This is just justices have a battle he wanted that's that's nice for him but there's another one coming in just a few months. And of course what we have now is this whole idea that is taken hold and it's it's out there in the in the public parlance about September being in the big month not helpful to the president's cause -- -- for prisoners efforts you know we're not going to -- all the troops on the ground until next month and then visiting get to bounce of the summer to try to fix the situation. Probably unrealistic which in September's going to be a tough month of. ...
Problem:
Search engines have historically had very little to work with in terms of properly discovering and indexing multimedia content:
Opportunity:
The value of multimedia content is “trapped” inside the files, out of view of search engines. Titles and tags miss key concepts within the files:
Multimedia Consumption
17
Consumption:
• Automatic extraction of key terms and concepts for tagging, categorization
• Patent-pending “Snippet” navigation technology enables users to jump to relevant segments of the clip
• Social media integrations drives RSS subscription, bookmarking, etc.
• Full text output enables related content presentation
Multimedia Discovery
18
Search Term EveryZing Results FoxSports Results EveryZing Increase
Manny Ramirez 22 7 214%
Yankees 281 111 153%
Manchester United 21 2 950%
Golf 214 170 25%
Federer 45 15 200%
David Beckham 36 17 111%
Tom Brady 53 31 71%
Example: FoxSports.com
• EveryZing Media Merchandising indexes the full contents of FoxSports Multimedia files.
• As a result, EveryZing able to significantly increase the number of keyword results
• Great discovery leads to increased consumption and enhanced monetization opportunities.
Summary
• Speech recognition takes an inaccessible data structure (audio) and turns it into an accessible one (text)
• It’s far from perfect, but it’s a big jump from nothing
• Take away: It’s the task that matters. Find the right role, and speech recognition works
• (Corollary: A good prompt is worth two years of research)
20
Media Merchandising Solutions
Thank you!
Marie Meteer VP of Speech and NLP
mmeteer@everyzing.com
www.everyzing.com
top related